Coursera Data Science Specialization Capstone course learning journal -1

I am finally at the last course of the Coursera Data Science Specialization. I already know that I need to learn Python in order to be a real expert of data analysis in my field to get start of. But now I need to finish this specialization first.

It has been a quite steep learning curve even I have already finished the first nine courses. The reason is that this capstone course uses an entirely new scenario: Natural Language Processing. I have been reading a lot in the past days, including the past weekend, trying numerous of new packages, and failing. I first started with the traditional R text analysis package tm(). Learned about basics of removing stop words, removing punctuation, stemming, removing numbers, stripping white spaces, etc. These are done by function tm_map(). There is then the findFreqTerms() function to list the most frequent terms:


temp_matrix <- VCorpus(VectorSource(readLines(con, encoding = "UTF-8")))



## eliminating extra whitespace
temp_matrix <- tm_map(temp_matrix, stripWhitespace)
## convert to lower case
temp_matrix <- tm_map(temp_matrix, content_transformer(tolower))

## Remove Stopwords
temp_matrix <- tm_map(temp_matrix, removeWords, stopwords("english"))

crudeTDM <- TermDocumentMatrix(temp_matrix, list(stemming=TRUE, stopwords = TRUE))

crudeTDMHighFreq <- findFreqTerms(crudeTDM, 1000,1050 )
sort(crudeTDMHighFreq[-grep("[0-9]", crudeTDMHighFreq)])
#crudeTDM_no_sparseHighFreq <- findFreqTerms(crudeTDM_no_sparse, 1,500)

Then I realize that I still don’t know how to get correlations and create n-grams.

I went back to the course discussion forums and found a bunch of helpful resources, which opened more doors but of course first of all, more learning and reading to do.

See this post for resources that I found for the capstone course.



Comments are closed.