I am finally at the last course of the Coursera Data Science Specialization. I already know that I need to learn Python in order to be a real expert of data analysis in my field to get start of. But now I need to finish this specialization first.
It has been a quite steep learning curve even I have already finished the first nine courses. The reason is that this capstone course uses an entirely new scenario: Natural Language Processing. I have been reading a lot in the past days, including the past weekend, trying numerous of new packages, and failing. I first started with the traditional R text analysis package
tm(). Learned about basics of removing stop words, removing punctuation, stemming, removing numbers, stripping white spaces, etc. These are done by function
tm_map(). There is then the
findFreqTerms() function to list the most frequent terms:
temp_matrix <- VCorpus(VectorSource(readLines(con, encoding = "UTF-8")))
## eliminating extra whitespace
temp_matrix <- tm_map(temp_matrix, stripWhitespace)
## convert to lower case
temp_matrix <- tm_map(temp_matrix, content_transformer(tolower))
## Remove Stopwords
temp_matrix <- tm_map(temp_matrix, removeWords, stopwords("english"))
crudeTDM <- TermDocumentMatrix(temp_matrix, list(stemming=TRUE, stopwords = TRUE))
crudeTDMHighFreq <- findFreqTerms(crudeTDM, 1000,1050 )
#crudeTDM_no_sparseHighFreq <- findFreqTerms(crudeTDM_no_sparse, 1,500)
Then I realize that I still don’t know how to get correlations and create n-grams.
I went back to the course discussion forums and found a bunch of helpful resources, which opened more doors but of course first of all, more learning and reading to do.
See this post for resources that I found for the capstone course.