A Data Scientist's blog: Word cloud in R

An update to the wordcloud package (2.2) has been released to CRAN. It includes a number of improvements to the basic wordcloud. Notably that you may now pass it text and Corpus objects directly. as in:

#install.packages(c("wordcloud","tm"),repos="http://cran.r-project.org")
library(wordcloud)
library(tm)

wordcloud("May our children and our children's children to a
thousand generations, continue to enjoy the benefits conferred
upon us by a united country, and have cause yet to rejoice under
those glorious institutions bequeathed us by Washington and his
compeers.",colors=brewer.pal(6,"Dark2"),random.order=FALSE)

data(SOTU)
SOTU <- ark2="" colors="brewer.pal(6," function="" random.order="FALSE)</pre" removewords="" stopwords="" tm_map="" tolower="" wordcloud="" x="">



This bigest improvement in this version though is a way to make your 
text plots more readable. A very common type of plot is a scatterplot, 
where instead of plotting points, case labels are plotted. This is 
accomplished with the text function in base R. Here is a simple 
artificial example:


states <- aine="" alifornia="" ansas="" aryland="" ashington="" assachusetts="" awaii="" c="" carolina="" daho="" dakota="" ebraska="" elaware="" ennessee="" ennsylvania="" entucky="" eorgia="" ermont="" est="" evada="" ew="" exas="" hampshire="" hio="" hode="" ichigan="" innesota="" irginia="" isconsin="" island="" ississippi="" issouri="" jersey="" klahoma="" labama="" laska="" llinois="" loc="" lorida="" matrix="" mexico="" ncol="2))" ndiana="" olorado="" onnecticut="" ontana="" orth="" ouisiana="" outh="" owa="" plot="" pre="" regon="" rizona="" rkansas="" rmvnorm="" states="" tah="" text="" type="n" virginia="" yoming="" york="">

Notice how many of the state names are unreadable due to 
overplotting, giving the scatter plot a cloudy appearance. The textplot 
function in wordcloud lets us plot the text without any of the words 
overlapping.


textplot(loc[,1],loc[,2],states)

A big improvement! The only thing still hurting the plot is the fact 
that some of the states are only partially visible in the plot. This can
 be fixed by setting x and y limits, whch will cause the layout 
algorithm to stay in bounds.


mx <- apply="" loc="" max="" min="" mn="" states="" textplot="" xlim="c(mn[1],mx[1]),ylim=c(mn[2],mx[2]))</pre">
Another
 great thing with this release is that the layout algorithm has been 
exposed so you can create your own beautiful custom plots. Just pass 
your desired coordinates (and word sizes) to wordlayout, and it will 
return bounding boxes close to the originals, but with no overlapping.


plot(loc[,1],loc[,2],type="n")
nc <- .5="" cex="50:1/20)</pre" loc="" nc="" states="" text="" wordlayout="">



okay, so this one wasn't very creative, but it begs for some further 
thought. Now we have word clouds where not only the size can mean 
something, but also the x/y position (roughly) and color. Done right, 
this could add whole new layer of statistical richness to the 
visually pleasing but statistically shallow standard wordcloud.







Filed under: R, wordcloud
7 Comments



25Jan/12Off

Beat me to it
So as a follow-up to the last post about the 2010-11 state of the union addresses, I was going to compare the 2011 to the new 2012 speech. It looks like Simply Statistics beat me to it, which is very gratifying. No need to do the work myself  




Filed under: wordcloud
No Comments



10Jan/12Off

Words in Politics: Some extensions of the word cloud
The word cloud is a commonly used plot to visualize a speech or set 
of documents in a succinct way. I really like them. They can 
be extremely visually pleasing, and you can spend a lot of time perusing
 over the words gaining new insights.


That said, they don't convey a great deal of information. From a 
statistical perspective, a word cloud is equivalent to a bar chart of 
univariate frequencies, but makes it more difficult for the viewer to 
estimate the relative frequency of two words. For example, here is a bar
 chart and word cloud of the state of the union address for 2010 and 
2011 combined.




Bar chart of the state of the union addresses for 2010-11



word cloud of the state of the union addresses for 2010-11

Notice that the bar chart contains more information, with the exact 
frequencies being obtainable by looking at the y axis. Also, in the word
 cloud the size of the word both represents the frequency, and the 
number of characters in the word (with longer words being bigger in the 
plot). This could lead to confusion for the viewer. We can therefore see
 that from a statistical perspective that the bar chart is superior.


... Except it isn't ....


The word cloud looks better. There is a reason why every infographic 
on the web uses word clouds. It's because they strike a balance of 
presenting the quantitative information, while keeping the reader 
interested with good design. Below I am going to present some extensions
 of the basic word cloud that help visualize the differences and 
commonalities between documents.



The Comparison Cloud
The previous plots both pooled the two speeches together. Using 
standard word clouds that is as far as we can go. What if we want to 
compare the speeches? Did they talk about different things? If so, are 
certain words associated with those subjects?


This is where the comparison cloud comes in.




Comparison plot

Word size is mapped to the difference between the rates that it 
occurs in each document. So we see that Obama was much more concerned 
with economic issues in 2010, and in 2011 focused more on education and 
the future. This can be generalized fairly naturally. The next figure 
shows a comparison cloud for the republican primary debate in new 
hampshire.



One
 thing that you can notice in this plot is that Paul, Perry and Huntsman
 have larger words than the top tier candidates, meaning that they 
deviate from them mean frequencies more. On the one hand this may be due
 to a single minded focus on a few differentiating issues (..couch.. Ron
 Paul), but it may also reflect that the top tier candidates were asked 
more questions and thus focused on a more diverse set of issues.

The Commonality Cloud
Where the comparison cloud highlights differences, the commonality 
cloud highlights words common to all documents/speakers. Here is one for
 the two state of the union addresses.







Commonality cloud for the 2010-11 SOTU

Here, word size is mapped to its minimum frequency across documents. 
So if a word is missing from any document it has size=0 (i.e. it is not 
shown). We can also do this on the primary debate data...




Republican primary commonality cloud

From this we can infer that what politicians like more than anything else is people  






The wordcloud package
Version 2.0 of wordcloud (just released to CRAN) implements these two types of graphs, and the code below reproduces them.


library(wordcloud)
library(tm)
data(SOTU)
corp <- 2010="" 2011="" as.matrix="" c="" collapse="\n" collected="" colnames="" commonality.cloud="" comparison.cloud="" content="" corp="" corpus="" ectorsource="" for="" function="" in="" library="" list="" max.words="Inf,random.order=FALSE)" n="" names="" nchar="" olorbrewer="" paste="" pre="" r2="" random.order="FALSE)" rcorp="" readlines="" removenumbers="" removepunctuation="" removewords="" repub="" repub_debate.txt="" rterms="" sotu="" sp="=speaker],collapse=" speaker="" splitat="" stopwords="" str_locate_all="" str_sub="" stringr="" stripwhitespace="" strsplit="" term.matrix="" termdocumentmatrix="" tm="" tm_map="" tmp="" tolower="" unlist="" wordcloud="" x="">



Link to republican debate transcript



















Filed under: R, wordcloud