Category: Learning is always fun

  • Coursera Data Science Specialization Capstone course learning journal -1

    I am finally at the last course of the Coursera Data Science Specialization. I already know that I need to learn Python in order to be a real expert of data analysis in my field to get start of. But now I need to finish this specialization first.

    It has been a quite steep learning curve even I have already finished the first nine courses. The reason is that this capstone course uses an entirely new scenario: Natural Language Processing. I have been reading a lot in the past days, including the past weekend, trying numerous of new packages, and failing. I first started with the traditional R text analysis package tm(). Learned about basics of removing stop words, removing punctuation, stemming, removing numbers, stripping white spaces, etc. These are done by function tm_map(). There is then the findFreqTerms() function to list the most frequent terms:

    con<-file("/Users/ruihu/Documents/DataScience/capstone/final/en_US/en_US.blogs.txt")
    #file_length<-length(readLines(con))

    temp_matrix <- VCorpus(VectorSource(readLines(con, encoding = "UTF-8")))

    ##inspect(temp_matrix[[2]])

    ##meta(temp_matrix[[122]],"id")

    ## eliminating extra whitespace
    temp_matrix <- tm_map(temp_matrix, stripWhitespace)
    ## convert to lower case
    temp_matrix <- tm_map(temp_matrix, content_transformer(tolower))

    ## Remove Stopwords
    temp_matrix <- tm_map(temp_matrix, removeWords, stopwords("english"))

    crudeTDM <- TermDocumentMatrix(temp_matrix, list(stemming=TRUE, stopwords = TRUE))
    inspect(crudeTDM)
    crudeTDM_dis<-dist(as.matrix(crudeTDM),method="euclidean")
    #crudeTDM_no_sparse<-removeSparseTerms(crudeTDM,0.9)
    #inspect(crudeTDM_no_sparse)
    #summary(crudeTDM_no_sparse)

    crudeTDMHighFreq <- findFreqTerms(crudeTDM, 1000,1050 )
    sort(crudeTDMHighFreq[-grep("[0-9]", crudeTDMHighFreq)])
    #crudeTDM_no_sparseHighFreq <- findFreqTerms(crudeTDM_no_sparse, 1,500)
    close(con)


    Then I realize that I still don’t know how to get correlations and create n-grams.

    I went back to the course discussion forums and found a bunch of helpful resources, which opened more doors but of course first of all, more learning and reading to do.

    See this post for resources that I found for the capstone course.

     

     

  • Coursera Data Science Specialization Capstone Project – thoughts

    Coursera Data Science Specialization Capstone Project – thoughts

    Finally, I am at the capstone project — after three years of on and off working on this coursera specialization, I am finally here.

    The project is to give you a set of text documents, asking you to mine the texts, and come up your own model. So far, I am on week 2. I haven’t dived into the project deep enough yet, so don’t know how exactly I am going to mine the texts, and what kind of model I will be using. But since I was working on preparing for our 3-minute presentation of “what is your passion” last week, for our Monday team retreat at Leadercast, I came across the Maslow’s Needs Hierarchy. I think it would be neat to look at words in each level of the hierarchy, and see how frequent people use words in each hierarchy in their daily blog posts, tweets, and news.

    Maslow's Hierarchy

    To do this, I need to:

    1. Obtain a dictionary and have all words categorized into Maslow’s hierarchy
    2. Run all words in the files against the dictionary to determine which hierarchy they belong to.
      1. Calculate the frequency of each unique word
      2. Calculate the frequency of each level
    3. It would be fun to look at the frequency of each level in general; then look at the correlations between each level.
  • Make Your OWN Word Cloud Image

    Make Your OWN Word Cloud Image

    When making presentations or developing websites, I feel it is very time consuming to find images under Creative Commons Licensed or from Public Domain. After spending hours of searching on Google with “Labeled for reuse”, pixabay and openclipart, I think I may contribute a little on creating your own Word Cloud Images.

    The most frequently used website is wordle.net. I used it to create the School Data Analysis image below for one of my presentations. It is now shared to the public gallery so everyone can use it:

     School Data Analysis
    School Data Analysis

    Wordle is easy and it is the very first app of its kind. However, when I tried to create a “Thank You” word cloud in different languages, the problems came:

    Problem 1: Wordle doesn’t work well across different language. I used this page as the resources and typed in 25 types of “thank you” in different languages.

    Unfortunately, Wordle wasn’t able to recognize all of them. Many of them showed up as blank squared blocks. I tried to set the font as “Chrysanthi Unicode” as instructed in this article, it didn’t work.  Tried all other fonts, none of them worked for all languages.

    Problem 2:

    I wanted more than just a random piled words/phases in a meaningless shape. I wanted something more meaningful, something like this, but with the words of “Thank You” instead:

    Picture retrieved at http://funzim.com/10-cool-facts-love/
    Picture retrieved at http://funzim.com/10-cool-facts-love/

    Wordle doesn’t do this, at least for now.

    So I googled and found this site:

    http://www.tagxedo.com/app.html

    I would say I am very satisfied with the outcome:

    1. It was able to recognize all types of languages

    2. It gives plenty of cool shapes to frame your words in.

    So the final products I had are these:

    Creative Commons Attribution-Noncommercial-ShareAlike License @ Tagxedo
    Creative Commons Attribution-Noncommercial-ShareAlike License @ Tagxedo
    Creative Commons Attribution-Noncommercial-ShareAlike License
    Creative Commons Attribution-Noncommercial-ShareAlike License
    Creative Commons Attribution-Noncommercial-ShareAlike License
    Creative Commons Attribution-Noncommercial-ShareAlike License
    Creative Commons Attribution-Noncommercial-ShareAlike License
    Creative Commons Attribution-Noncommercial-ShareAlike License
    Creative Commons Attribution-Noncommercial-ShareAlike License
    Creative Commons Attribution-Noncommercial-ShareAlike License

     

     

     

     

    There are more variations in Tagxedo. Try it yourself and you can create so many interesting word clouds with CC license for your own non-commercial presentation use.

  • Big Data and Education

    Big Data and Education

    Screen Shot 2014-05-06 at 2.03.50 PMRecently I have been hearing people talking about “big data”. Supposedly it is a popular concept nowadays in the IT field. So I searched on Coursera and came across this course “Big Data and Eduction”. It was offered by Ryan Baker at the Columbia University in Oct 2013. Unfortunately it is no longer offered, but here is the archived course mateirals:  http://www.columbia.edu/~rsb2162/bigdataeducation.html

    I started watching the course videos and being finding different useful information. For example: the largest public data repository for educational software activities at PSLC data shop: https://pslcdatashop.web.cmu.edu/

    I think it would be interesting to run some data analysis based on certain data there and to see what can be “mined”.