I am finally at the last course of the Coursera Data Science Specialization. I already know that I need to learn Python in order to be a real expert of data analysis in my field to get start of. But now I need to finish this specialization first.

It has been a quite steep learning curve even I have already finished the first nine courses. The reason is that this capstone course uses an entirely new scenario: Natural Language Processing. I have been reading a lot in the past days, including the past weekend, trying numerous of new packages, and failing. I first started with the traditional R text analysis package `tm()`

. Learned about basics of removing stop words, removing punctuation, stemming, removing numbers, stripping white spaces, etc. These are done by function `tm_map()`

. There is then the `findFreqTerms()`

function to list the most frequent terms:

```
```con<-file("/Users/ruihu/Documents/DataScience/capstone/final/en_US/en_US.blogs.txt")

#file_length<-length(readLines(con))

temp_matrix <- VCorpus(VectorSource(readLines(con, encoding = "UTF-8")))

##inspect(temp_matrix[[2]])

##meta(temp_matrix[[122]],"id")

## eliminating extra whitespace

temp_matrix <- tm_map(temp_matrix, stripWhitespace)

## convert to lower case

temp_matrix <- tm_map(temp_matrix, content_transformer(tolower))

## Remove Stopwords

temp_matrix <- tm_map(temp_matrix, removeWords, stopwords("english"))

crudeTDM <- TermDocumentMatrix(temp_matrix, list(stemming=TRUE, stopwords = TRUE))

inspect(crudeTDM)

crudeTDM_dis<-dist(as.matrix(crudeTDM),method="euclidean")

#crudeTDM_no_sparse<-removeSparseTerms(crudeTDM,0.9)

#inspect(crudeTDM_no_sparse)

#summary(crudeTDM_no_sparse)

crudeTDMHighFreq <- findFreqTerms(crudeTDM, 1000,1050 )

sort(crudeTDMHighFreq[-grep("[0-9]", crudeTDMHighFreq)])

#crudeTDM_no_sparseHighFreq <- findFreqTerms(crudeTDM_no_sparse, 1,500)

close(con)

Then I realize that I still don’t know how to get correlations and create n-grams.

I went back to the course discussion forums and found a bunch of helpful resources, which opened more doors but of course first of all, more learning and reading to do.

See this post for resources that I found for the capstone course.