Coursera Data Science Specialization Capstone course learning journal -1

I am finally at the last course of the Coursera Data Science Specialization. I already know that I need to learn Python in order to be a real expert of data analysis in my field to get start of. But now I need to finish this specialization first.

It has been a quite steep learning curve even I have already finished the first nine courses. The reason is that this capstone course uses an entirely new scenario: Natural Language Processing. I have been reading a lot in the past days, including the past weekend, trying numerous of new packages, and failing. I first started with the traditional R text analysis package tm(). Learned about basics of removing stop words, removing punctuation, stemming, removing numbers, stripping white spaces, etc. These are done by function tm_map(). There is then the findFreqTerms() function to list the most frequent terms:

con<-file("/Users/ruihu/Documents/DataScience/capstone/final/en_US/en_US.blogs.txt")
#file_length<-length(readLines(con))

temp_matrix <- VCorpus(VectorSource(readLines(con, encoding = "UTF-8")))

##inspect(temp_matrix[[2]])

##meta(temp_matrix[[122]],"id")

## eliminating extra whitespace
temp_matrix <- tm_map(temp_matrix, stripWhitespace)
## convert to lower case
temp_matrix <- tm_map(temp_matrix, content_transformer(tolower))

## Remove Stopwords
temp_matrix <- tm_map(temp_matrix, removeWords, stopwords("english"))

crudeTDM <- TermDocumentMatrix(temp_matrix, list(stemming=TRUE, stopwords = TRUE))
inspect(crudeTDM)
crudeTDM_dis<-dist(as.matrix(crudeTDM),method="euclidean")
#crudeTDM_no_sparse<-removeSparseTerms(crudeTDM,0.9)
#inspect(crudeTDM_no_sparse)
#summary(crudeTDM_no_sparse)

crudeTDMHighFreq <- findFreqTerms(crudeTDM, 1000,1050 )
sort(crudeTDMHighFreq[-grep("[0-9]", crudeTDMHighFreq)])
#crudeTDM_no_sparseHighFreq <- findFreqTerms(crudeTDM_no_sparse, 1,500)
close(con)


Then I realize that I still don’t know how to get correlations and create n-grams.

I went back to the course discussion forums and found a bunch of helpful resources, which opened more doors but of course first of all, more learning and reading to do.

See this post for resources that I found for the capstone course.

 

 

Comments are closed.