R quanteda tokenization – Learning how to learn

When it comes to text analysis, a lot of articles would recommend clean the texts before moving forward, such as removing punctuation, lower letters, removing stop words, white space, removing numbers, etc. In the tm() package, all these can be done with function tm_map(). However, because quanteda’s philosophy is to keep the original corpus intact. All these have to be done during the step of tokenization.

Good news is, qunteda’s tokens() function can do all above with a few extra, except that it can’t do remove stop words.

system.time(tokenized_txt<-tokens(final_corpus_sample,remove_numbers = TRUE, remove_punct = TRUE, remove_separators = TRUE,remove_symbols=TRUE, remove_twitter=TRUE,remove_url = TRUE))

But then I found that you can use tokens_select() to remove the stopwords:

nostop_toks <- tokens_select(tokenized_txt, stopwords('en'), selection = 'remove')

After that, I built 2-6 grams:

system.time(tokens_2gram<-tokens_ngrams(nostop_toks,n=2)) system.time(tokens_3gram<-tokens_ngrams(nostop_toks,n=3)) system.time(tokens_4gram<-tokens_ngrams(nostop_toks,n=4)) system.time(tokens_5gram<-tokens_ngrams(nostop_toks,n=5)) system.time(tokens_6gram<-tokens_ngrams(nostop_toks,n=6))

The corresponding system.time are as following:

Learning how to learn

Tag: R quanteda tokenization

Coursera Data Science Specialization Capstone course learning journal 4 – Tokenize the Corpus