Tag: text mining

  • Coursera Data Science Specialization Capstone course learning journal 4 – Tokenize the Corpus

    Coursera Data Science Specialization Capstone course learning journal 4 – Tokenize the Corpus

    When it comes to text analysis, a lot of articles would recommend clean the texts before moving forward, such as removing punctuation, lower letters, removing stop words, white space, removing numbers, etc. In the tm() package, all these can be done with function tm_map(). However, because quanteda’s philosophy is to keep the original corpus intact. All these have to be done during the step of tokenization.

    Good news is, qunteda’s tokens() function can do all above with a few extra, except that it can’t do remove stop words.

    system.time(tokenized_txt<-tokens(final_corpus_sample,remove_numbers = TRUE, remove_punct = TRUE, remove_separators = TRUE,remove_symbols=TRUE, remove_twitter=TRUE,remove_url = TRUE))

    But then I found that you can use tokens_select() to remove the stopwords:

    nostop_toks <- tokens_select(tokenized_txt, stopwords('en'), selection = 'remove')

    After that, I built 2-6 grams:

    system.time(tokens_2gram<-tokens_ngrams(nostop_toks,n=2))
    system.time(tokens_3gram<-tokens_ngrams(nostop_toks,n=3))
    system.time(tokens_4gram<-tokens_ngrams(nostop_toks,n=4))
    system.time(tokens_5gram<-tokens_ngrams(nostop_toks,n=5))
    system.time(tokens_6gram<-tokens_ngrams(nostop_toks,n=6))

    The corresponding system.time are as following:

     

     

  • Coursera Data Science Specialization Capstone Project – thoughts

    Coursera Data Science Specialization Capstone Project – thoughts

    Finally, I am at the capstone project — after three years of on and off working on this coursera specialization, I am finally here.

    The project is to give you a set of text documents, asking you to mine the texts, and come up your own model. So far, I am on week 2. I haven’t dived into the project deep enough yet, so don’t know how exactly I am going to mine the texts, and what kind of model I will be using. But since I was working on preparing for our 3-minute presentation of “what is your passion” last week, for our Monday team retreat at Leadercast, I came across the Maslow’s Needs Hierarchy. I think it would be neat to look at words in each level of the hierarchy, and see how frequent people use words in each hierarchy in their daily blog posts, tweets, and news.

    Maslow's Hierarchy

    To do this, I need to:

    1. Obtain a dictionary and have all words categorized into Maslow’s hierarchy
    2. Run all words in the files against the dictionary to determine which hierarchy they belong to.
      1. Calculate the frequency of each unique word
      2. Calculate the frequency of each level
    3. It would be fun to look at the frequency of each level in general; then look at the correlations between each level.