• The purpose of this post is to show the ease of developing a word cloud in R and some of the tools available to work with raw text. This is a basic exploratory data analysis of raw text.
  • This is an analysis of a formatted version of the online retail data available at https://archive.ics.uci.edu/ml/datasets/Online+Retail. I’m interested in whether certain words in the product description are more prevelant. This could be used to clasify items by color or other keywords.
  • The following packages will be used knitr, tm, and wordcloud. knitr is the brilliant markdown package in R. tm is a text mining package. wordcloud produces the word cloud graphics using a processed dataset.

The description of the items are in the following format:

StockCode Description
85123A WHITE HANGING HEART T-LIGHT HOLDER
71053 WHITE METAL LANTERN
84406B CREAM CUPID HEARTS COAT HANGER
84029G KNITTED UNION FLAG HOT WATER BOTTLE
84029E RED WOOLLY HOTTIE WHITE HEART.
22752 SET 7 BABUSHKA NESTING BOXES

 

In the natural language processing literature, the main structure of text is called a Corpus. The Corpus represents the collection of texts in documents. The tm package creats a corpus using the VCorpus function.

item_desc <- VCorpus(VectorSource(retail_norm$Description))

inspect(item_desc[[200]])
## <<PlainTextDocument>>
## Metadata:  7
## Content:  chars: 28
## 
## BABUSHKA LIGHTS STRING OF 10

You can easily edit the corpus with the tm_map function. The processing of a corpus usually involves the following steps:

  1. Convert text to lower case: have to use the content_transformer function to call a function on metadata in order to preserve the metadata format.
  2. Remove stop words: stop words are common words that may not be of particular interest to your analysis.
  3. Remove white space: this removes extra white space that may confuse text look-ups and analysis
  4. Remove stemming: stemming is the process of removing plurality, tense, etc. to get at the root of the word. There are many algorithms for stemming and the tm package uses the Porter Stemming Algorithm.

I’ve included the code to do this processing below. I’ve also included a couple stop words for more clarity on the purpose of removing those words. For editing additional words outside of the functionality of tm_map, one could use regular expressions via the regex function in R.

item_desc <- tm_map(item_desc, content_transformer(tolower)) # convert to lower case
item_desc <- tm_map(item_desc, removeWords, stopwords("english")) #removes stopwords
item_desc <- tm_map(item_desc, stripWhitespace) # Remove extra white space
item_desc <- tm_map(item_desc, stemDocument) # Stemming


stopwords('en')[1:50]
##  [1] "i"          "me"         "my"         "myself"     "we"        
##  [6] "our"        "ours"       "ourselves"  "you"        "your"      
## [11] "yours"      "yourself"   "yourselves" "he"         "him"       
## [16] "his"        "himself"    "she"        "her"        "hers"      
## [21] "herself"    "it"         "its"        "itself"     "they"      
## [26] "them"       "their"      "theirs"     "themselves" "what"      
## [31] "which"      "who"        "whom"       "this"       "that"      
## [36] "these"      "those"      "am"         "is"         "are"       
## [41] "was"        "were"       "be"         "been"       "being"     
## [46] "have"       "has"        "had"        "having"     "do"

 

To navigate the corpus, you simply use the double brackets to call a specific line.

#Sample item in corpus
item_desc[[200]]
## <<PlainTextDocument>>
## Metadata:  7
## Content:  chars: 24
str(item_desc[[200]])
## List of 2
##  $ content: chr "babushka light string 10"
##  $ meta   :List of 7
##   ..$ author       : chr(0) 
##   ..$ datetimestamp: POSIXlt[1:1], format: "2017-05-14 22:37:18"
##   ..$ description  : chr(0) 
##   ..$ heading      : chr(0) 
##   ..$ id           : chr "200"
##   ..$ language     : chr "en"
##   ..$ origin       : chr(0) 
##   ..- attr(*, "class")= chr "TextDocumentMeta"
##  - attr(*, "class")= chr [1:2] "PlainTextDocument" "TextDocument"
#Sample item in corpus
item_desc[[200]][1]
## $content
## [1] "babushka light string 10"
str(item_desc[[200]][1])
## List of 1
##  $ content: chr "babushka light string 10"

You can create the term document matrix via the DocumentTermMatrix function. The document term matrix is simply a count of the number of times a word appears in a document. You can then find the frequency of the top 50 words using the findFreqTerms and colSums functions.

dtm <- DocumentTermMatrix(item_desc)
inspect(dtm[200:205,800:805])
## <<DocumentTermMatrix (documents: 6, terms: 6)>>
## Non-/sparse entries: 0/36
## Sparsity           : 100%
## Maximal term length: 11
## Weighting          : term frequency (tf)
## Sample             :
##      Terms
## Docs  lightbulb lights,funk lilac lili lime line
##   200         0           0     0    0    0    0
##   201         0           0     0    0    0    0
##   202         0           0     0    0    0    0
##   203         0           0     0    0    0    0
##   204         0           0     0    0    0    0
##   205         0           0     0    0    0    0
freq_terms <- findFreqTerms(dtm,50)

freq <- colSums(as.matrix(dtm))   
ord <- order(freq) 

freq_out <- data.frame(cbind(freq[tail(ord, n = 10)]))
dimnames(freq_out)[[2]] <- "Count"
kable(freq_out)
Count
design 104
christma 117
box 130
bag 136
blue 137
vintag 159
red 166
pink 193
heart 209
set 235

 

You can then call the wordcloud function using the names of the words and count of the words. I’ve set the seed in the following code because the word cloud function relies on a random generator to place some words and setting the seed allows me to recreate the exact image every time. The scale parameter in word cloud allows you to adjust the relative of size of words. The rot.per sets the amount of words rotated 90 degrees which allows more words to fit in the cloud.

par(mar=c(0,0,0,0))
set.seed(142)   
wordcloud(names(freq), freq, min.freq=25, 
          scale=c(4, 0.25), random.order = FALSE, rot.per=0.15, colors=brewer.pal(7,"Blues"))