Coursera Data Science Specialization Capstone course learning journal 4 – Tokenize the Corpus

When it comes to text analysis, a lot of articles would recommend clean the texts before moving forward, such as removing punctuation, lower letters, removing stop words, white space, removing numbers, etc. In the tm() package, all these can be done with function tm_map(). However, because quanteda’s philosophy is to keep the original corpus intact. All these have to be done during the step of tokenization.

Good news is, qunteda’s tokens() function can do all above with a few extra, except that it can’t do remove stop words.

system.time(tokenized_txt<-tokens(final_corpus_sample,remove_numbers = TRUE, remove_punct = TRUE, remove_separators = TRUE,remove_symbols=TRUE, remove_twitter=TRUE,remove_url = TRUE))

But then I found that you can use tokens_select() to remove the stopwords:

nostop_toks <- tokens_select(tokenized_txt, stopwords('en'), selection = 'remove')

After that, I built 2-6 grams:

system.time(tokens_2gram<-tokens_ngrams(nostop_toks,n=2))
system.time(tokens_3gram<-tokens_ngrams(nostop_toks,n=3))
system.time(tokens_4gram<-tokens_ngrams(nostop_toks,n=4))
system.time(tokens_5gram<-tokens_ngrams(nostop_toks,n=5))
system.time(tokens_6gram<-tokens_ngrams(nostop_toks,n=6))

The corresponding system.time are as following:

 

 

Comments are closed.