Tag: R NLP tm

  • Coursera Data Science Specialization Capstone course learning journal -1

    I am finally at the last course of the Coursera Data Science Specialization. I already know that I need to learn Python in order to be a real expert of data analysis in my field to get start of. But now I need to finish this specialization first.

    It has been a quite steep learning curve even I have already finished the first nine courses. The reason is that this capstone course uses an entirely new scenario: Natural Language Processing. I have been reading a lot in the past days, including the past weekend, trying numerous of new packages, and failing. I first started with the traditional R text analysis package tm(). Learned about basics of removing stop words, removing punctuation, stemming, removing numbers, stripping white spaces, etc. These are done by function tm_map(). There is then the findFreqTerms() function to list the most frequent terms:

    con<-file("/Users/ruihu/Documents/DataScience/capstone/final/en_US/en_US.blogs.txt")
    #file_length<-length(readLines(con))

    temp_matrix <- VCorpus(VectorSource(readLines(con, encoding = "UTF-8")))

    ##inspect(temp_matrix[[2]])

    ##meta(temp_matrix[[122]],"id")

    ## eliminating extra whitespace
    temp_matrix <- tm_map(temp_matrix, stripWhitespace)
    ## convert to lower case
    temp_matrix <- tm_map(temp_matrix, content_transformer(tolower))

    ## Remove Stopwords
    temp_matrix <- tm_map(temp_matrix, removeWords, stopwords("english"))

    crudeTDM <- TermDocumentMatrix(temp_matrix, list(stemming=TRUE, stopwords = TRUE))
    inspect(crudeTDM)
    crudeTDM_dis<-dist(as.matrix(crudeTDM),method="euclidean")
    #crudeTDM_no_sparse<-removeSparseTerms(crudeTDM,0.9)
    #inspect(crudeTDM_no_sparse)
    #summary(crudeTDM_no_sparse)

    crudeTDMHighFreq <- findFreqTerms(crudeTDM, 1000,1050 )
    sort(crudeTDMHighFreq[-grep("[0-9]", crudeTDMHighFreq)])
    #crudeTDM_no_sparseHighFreq <- findFreqTerms(crudeTDM_no_sparse, 1,500)
    close(con)


    Then I realize that I still don’t know how to get correlations and create n-grams.

    I went back to the course discussion forums and found a bunch of helpful resources, which opened more doors but of course first of all, more learning and reading to do.

    See this post for resources that I found for the capstone course.