Data Digressions

Data.. Big Data...

Baseball Diamond

Earlier this week, I was trying to think of a nice graphic to plot baseball teams statistics in one graphic. A heatmap was an obvious choice and I thought about trying to shape it into a diamond. Below is my attempt at a shiny app where you can select a team and a statistic to view the whole team. Each position is taken as a weighted mean of players that have had at least 120 plate appearances in 2017. The value in the heatmap is the percentile of that positions numbers compared to the rest of the league. Data is from Fangraphs.

Take aways:

  1. I wanted to have a heatmap where the values in the heatmap were relative to values outside the data to create the heatmap (i.e. players values are relative to the league and not their team). The scale_fill_gradientn made this very simple using the limits argument.
  2. Polygons are tricky to draw in R and ggplot2, but after some tweaking I think I was able to get it into a nice format.
  3. Fangraphs is a wonderful resource, but the lack of a nice API is frustrating.
  4. Shiny is so cool and easy to use.





References:

code to be posted soon

Intro to graphs

I had a fun time working through the networkx package in python and playing with some company data from crunch base. The below jupyter notebook shows how easy it is to use the pre-built functions in networkx to get information about your graph.

Parallel Computing Example in Python

Easy word cloud of keywords in item description using R

 

  • The purpose of this post is to show the ease of developing a word cloud in R and some of the tools available to work with raw text. This is a basic exploratory data analysis of raw text.
  • This is an analysis of a formatted version of the online retail data available at https://archive.ics.uci.edu/ml/datasets/Online+Retail. I’m interested in whether certain words in the product description are more prevelant. This could be used to clasify items by color or other keywords.
  • The following packages will be used knitr, tm, and wordcloud. knitr is the brilliant markdown package in R. tm is a text mining package. wordcloud produces the word cloud graphics using a processed dataset.

The description of the items are in the following format:

StockCode Description
85123A WHITE HANGING HEART T-LIGHT HOLDER
71053 WHITE METAL LANTERN
84406B CREAM CUPID HEARTS COAT HANGER
84029G KNITTED UNION FLAG HOT WATER BOTTLE
84029E RED WOOLLY HOTTIE WHITE HEART.
22752 SET 7 BABUSHKA NESTING BOXES

 

In the natural language processing literature, the main structure of text is called a Corpus. The Corpus represents the collection of texts in documents. The tm package creats a corpus using the VCorpus function.

item_desc <- VCorpus(VectorSource(retail_norm$Description))

inspect(item_desc[[200]])
## <<PlainTextDocument>>
## Metadata:  7
## Content:  chars: 28
## 
## BABUSHKA LIGHTS STRING OF 10

You can easily edit the corpus with the tm_map function. The processing of a corpus usually involves the following steps:

  1. Convert text to lower case: have to use the content_transformer function to call a function on metadata in order to preserve the metadata format.
  2. Remove stop words: stop words are common words that may not be of particular interest to your analysis.
  3. Remove white space: this removes extra white space that may confuse text look-ups and analysis
  4. Remove stemming: stemming is the process of removing plurality, tense, etc. to get at the root of the word. There are many algorithms for stemming and the tm package uses the Porter Stemming Algorithm.

I’ve included the code to do this processing below. I’ve also included a couple stop words for more clarity on the purpose of removing those words. For editing additional words outside of the functionality of tm_map, one could use regular expressions via the regex function in R.

item_desc <- tm_map(item_desc, content_transformer(tolower)) # convert to lower case
item_desc <- tm_map(item_desc, removeWords, stopwords("english")) #removes stopwords
item_desc <- tm_map(item_desc, stripWhitespace) # Remove extra white space
item_desc <- tm_map(item_desc, stemDocument) # Stemming


stopwords('en')[1:50]
##  [1] "i"          "me"         "my"         "myself"     "we"        
##  [6] "our"        "ours"       "ourselves"  "you"        "your"      
## [11] "yours"      "yourself"   "yourselves" "he"         "him"       
## [16] "his"        "himself"    "she"        "her"        "hers"      
## [21] "herself"    "it"         "its"        "itself"     "they"      
## [26] "them"       "their"      "theirs"     "themselves" "what"      
## [31] "which"      "who"        "whom"       "this"       "that"      
## [36] "these"      "those"      "am"         "is"         "are"       
## [41] "was"        "were"       "be"         "been"       "being"     
## [46] "have"       "has"        "had"        "having"     "do"

 

To navigate the corpus, you simply use the double brackets to call a specific line.

#Sample item in corpus
item_desc[[200]]
## <<PlainTextDocument>>
## Metadata:  7
## Content:  chars: 24
str(item_desc[[200]])
## List of 2
##  $ content: chr "babushka light string 10"
##  $ meta   :List of 7
##   ..$ author       : chr(0) 
##   ..$ datetimestamp: POSIXlt[1:1], format: "2017-05-14 22:37:18"
##   ..$ description  : chr(0) 
##   ..$ heading      : chr(0) 
##   ..$ id           : chr "200"
##   ..$ language     : chr "en"
##   ..$ origin       : chr(0) 
##   ..- attr(*, "class")= chr "TextDocumentMeta"
##  - attr(*, "class")= chr [1:2] "PlainTextDocument" "TextDocument"
#Sample item in corpus
item_desc[[200]][1]
## $content
## [1] "babushka light string 10"
str(item_desc[[200]][1])
## List of 1
##  $ content: chr "babushka light string 10"

You can create the term document matrix via the DocumentTermMatrix function. The document term matrix is simply a count of the number of times a word appears in a document. You can then find the frequency of the top 50 words using the findFreqTerms and colSums functions.

dtm <- DocumentTermMatrix(item_desc)
inspect(dtm[200:205,800:805])
## <<DocumentTermMatrix (documents: 6, terms: 6)>>
## Non-/sparse entries: 0/36
## Sparsity           : 100%
## Maximal term length: 11
## Weighting          : term frequency (tf)
## Sample             :
##      Terms
## Docs  lightbulb lights,funk lilac lili lime line
##   200         0           0     0    0    0    0
##   201         0           0     0    0    0    0
##   202         0           0     0    0    0    0
##   203         0           0     0    0    0    0
##   204         0           0     0    0    0    0
##   205         0           0     0    0    0    0
freq_terms <- findFreqTerms(dtm,50)

freq <- colSums(as.matrix(dtm))   
ord <- order(freq) 

freq_out <- data.frame(cbind(freq[tail(ord, n = 10)]))
dimnames(freq_out)[[2]] <- "Count"
kable(freq_out)
Count
design 104
christma 117
box 130
bag 136
blue 137
vintag 159
red 166
pink 193
heart 209
set 235

 

You can then call the wordcloud function using the names of the words and count of the words. I’ve set the seed in the following code because the word cloud function relies on a random generator to place some words and setting the seed allows me to recreate the exact image every time. The scale parameter in word cloud allows you to adjust the relative of size of words. The rot.per sets the amount of words rotated 90 degrees which allows more words to fit in the cloud.

par(mar=c(0,0,0,0))
set.seed(142)   
wordcloud(names(freq), freq, min.freq=25, 
          scale=c(4, 0.25), random.order = FALSE, rot.per=0.15, colors=brewer.pal(7,"Blues"))   


 

West LA Housing Inventory

This post is looking into seasonal inventory of housing in west LA. I’m using data from the Trulia api. The data is from 2010 to the present. I’m also using this post to learn how to use a Shiny app.


The top plot shows number of listings by month, the middle plot is the 6 month smoothed average trend, and the bottom plot is the seasonal component.



Couple quick takeaways:

  • Inventory has decreased tremendously since 2010.
  • The seasonal component suggests inventory is highest in the late summer/fall and drops sharply towards new years.
  • Shiny is very easy to learn; I had the app up and running in a couple hours.


Plotting multiple graphics on a grid in R

I am using the quantmod package to pull data on the S&P 500 (SPY) and NASDAQ (IXIC). The quantmod package has very nice graphics and I’m just using this data as practice for plotting in R. I’m going to crudely compare percent growth for the two indexes over the last two years using a couple plots. This post isn’t meant for anything more than exploring plots in R. I’m going to use the base R plotting functions and ggplot2. I used a number of stackoverflow.com posts as references.

library(quantmod)
library(ggplot2)
library(gridExtra)


#Pull data into workspace using quantmod
getSymbols("SPY",src="yahoo",warnings=FALSE)
getSymbols("^IXIC",src="yahoo",warnings=FALSE)

#Restricting timeframe of data
IXIC_timeframe<-index(IXIC$IXIC.Close)>format("2013-06-01",format="%Y-%m-%d")
SPY_timeframe<-index(SPY$SPY.Close)>format("2013-06-01",format="%Y-%m-%d")

#Creating dataset with all values in format I want
comp<-data.frame(Date=index(IXIC$IXIC.Close)[IXIC_timeframe],
    IXIC=round(100*IXIC$IXIC.Close[IXIC_timeframe]/as.numeric(IXIC$IXIC.Close[IXIC_timeframe][1])-100,1),
    SPY=round(100*SPY$SPY.Close[SPY_timeframe]/as.numeric(SPY$SPY.Close[SPY_timeframe][1])-100,1))

Below is a stacked plot using the base R plotting functions. I set the margins wide on the right side of the graph and also allow for plotting outside of graph using the xpd command inside the par function. I did this so that I can plot an axis on the leftside of the graph and a legend.

par(mar=c(5, 5, 2, 7), xpd=TRUE)

#Draw IXIC and set up plot environment
plot(comp$Date,
    comp$IXIC+50,
    type="l",ylab="",
    main="",yaxt="n",xlab="Date",ylim=c(0,100),lwd=2)

#Draw SPY lines
lines(comp$Date,
    comp$SPY+50,col=2,lwd=2)
legend("topright",inset=c(-0.15,0),c("IXIC","SPY"),
    col=c(1,2),lty=1,bty="n",cex=0.8,lwd=3)

#Take difference between indixes performance
indeces_diff<-comp$IXIC-comp$SPY

#Plot difference of the indeces
lines(comp$Date,
    indeces_diff+15,
    type="l",xlab="Date",ylab="Difference in Indeces",
    main="",lwd=2)
lines(comp$Date,
    rep(15,length(comp$Date)),
    type="l",xlab="Date",ylab="Difference in Indeces",
    main="",lwd=1,lty=2)

#Add the left axis
axis(side=2,at=c(50,70,90),labels=c(0,20,40))
mtext("Perc Change in \nIndeces from 6/1/013", side=2, line= 2.5, at=70)

#Add the right axis
axis(side=4,at=c(0,15,30),labels=c(-15,0,15))
mtext("Perc Diff Between \nIXIC and SPY", side=4, line= 3, at=15)

#Adding a summary statistics to the bottom of the plot frame
mtext(paste("Perc Diff on 6/1/2015: ",indeces_diff[comp$Date=="2015-06-01"],sep=""),
    side=1,line=3,adj=0, cex=0.8)

Below is the stacked plot using ggplot. The grid.arrange function allows plotting more than one plot on a canvas using ggplot2. These can be found in the gridExtra package.

#Formatting data for ggplot2
comp2<-data.frame(Date=rep(comp$Date,2),
    Perc=c(comp$SPY,comp$IXIC),
    Indeces=c(rep("SPY",length(comp$SPY)),rep("IXIC",length(comp$IXIC))),
    stringsAsFactors=FALSE)

#Plot first graph
p1<-ggplot(comp2,aes(x=Date,y=Perc,color=Indeces))+
    geom_line(size=1.1)+
    scale_color_manual(values=c("#000000","red"))+
    scale_y_continuous("Perc Change in \nIndeces from 6/1/013")+
    scale_x_date("")+
    theme_bw()+ 
    theme(legend.justification=c(1,0), legend.position=c(1,0),
        axis.text.x=element_blank(),axis.ticks=element_blank(),
        plot.margin = unit(c(1,1,-0.5,1), "cm"))

#Plot second graph
p2<-ggplot(comp,aes(x=Date,y=IXIC.Close-SPY.Close))+
    geom_line(size=1.1)+
    scale_y_continuous("Perc Diff Between \nIXIC and SPY")+
    theme_bw()+
    theme(plot.margin = unit(c(-0.5,1,1,1), "cm"))
    
#Call grid.arrange to plot both graphs in same frame
grid.arrange(p1,p2,ncol=1,main="Comparison of S&P 500 and NASDAQ (6/1/2013-6/1/2015)")

A next step would be adding further summary statistics using a table format. Possibly using tableGrob from RGraphics package. I’ll save that for another example…


 

Monthly Housing Prices

So I wanted to look into monthly housing price trends using d3.js in R, but couldn’t get the d3 graphic to load in WordPress because the file was too large. d3.js is a terrific java script package for interactive graphics.The R ports to d3 make it easy to streamline the analysis and generate graphics. Please click here to see a heatmap of West LA prices by month.

Exploring Housing Data

I pulled some data from Trulia and Zillow to try and understand housing trends in West Los Angeles. I had a couple questions that I may get to in future posts, but first I thought I would compare the different sources of data.

Trulia Data

The Trulia data is available through an API. The data includes, but is not limited to median, mean, and number of listings by zip codes. The data is available starting in 1/1/2010 to the present. The source of the data is broker listings that try to match the MLs. Below is how to parse the data in R:

##########################################################
#~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~
# Pull the data from Trulia
#   - Iterates over zips and months
#   - The dataframe has the following variables:
#       - zipcode
#       - begin_date: yyyy-mm-dd
#       - numberOfProperties: # of listings
#       - medianListingPrice: median list prices
#       - averageListingPrice: mean list price
#~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~
##########################################################

library(xml2)
library(httr)
library(XML)

# Pulled the zipcodes for west LA from the City's website
zipcodes_west_la <- c(90045, 90024,90293, 90056, 
                      90302, 90301,90292, 
                      90066, 90291, 90405, 90404, 90401, 
                      90403, 90402,90025, 90064, 90034,
                      90232, 90230, 90016)



housing_monthly <- NULL

time0 <- Sys.time()
for(zip_hold in zipcodes_west_la){
  print(zip_hold)
  for(begin_date in as.character(seq(ISOdate(2010,1,1), by = "month", length.out = 111))){
    
    # Added this for loop to slow down query to the API
    #  - Trulia only allows 2 queries/second
    for(i in 1:1000){ rnorm(10000)} 
    
    #~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~
    # Format dates and get end of month date
    end_date <- as.character(as.Date(seq(as.POSIXct(begin_date), by="month", length.out = 2)[2])-1)
    begin_date <- as.character(as.Date(begin_date))
    
    #~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~
    #Get url and parse using httr and XML packages
    url <- GET(
      paste("http://api.trulia.com/webservices.php?library=TruliaStats&function=getZipCodeStats&zipCode=",
        zip_hold,"&",
        "startDate=",begin_date,"&",
        "endDate=",end_date,"&","statType=listings&apikey=<API KEY>",sep="")
        )

    xml_url <- content(url, "text")
    xmlfile <- xmlTreeParse(xml_url)
    xmltop <- xmlRoot(xmlfile)
    
    
    # The offensive number of brackets are used to parse xml
    #  - Look into more elegant way to parse xml!
    
    desired_xml_chunk   <- xmltop[[2]][[1]][[2]][[1]][2][[1]][1][[1]]
    numberOfProperties  <- xmlSApply(desired_xml_chunk[2]$numberOfProperties,xmlValue)
    medianListingPrice  <- xmlSApply(desired_xml_chunk[3]$medianListingPrice,xmlValue)
    averageListingPrice <- xmlSApply(desired_xml_chunk[4]$averageListingPrice,xmlValue)
    
    #~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~
    #Print out dataset
    housing_monthly <- rbind(housing_monthly,
      data.frame(
        zipcode  = zip_hold,
        begin_date,
        numberOfProperties,
        medianListingPrice,
        averageListingPrice
    ))

}}
time1 <- Sys.time()

time1-time0 #28 mins for 1444 obs

head(housing_monthly)
##   zipcode begin_date numberOfProperties medianListingPrice
## 1   90045 2010-01-01                 78             664000
## 2   90045 2010-02-01                 85             665083
## 3   90045 2010-03-01                100             685429
## 4   90045 2010-04-01                100             673000
## 5   90045 2010-05-01                104             674143
## 6   90045 2010-06-01                121             663900
##   averageListingPrice
## 1              747779
## 2              744044
## 3              752451
## 4              718029
## 5              714914
## 6              727699

Zillow Data

The Zillow data is available in a CSV and is available for more years than the Trulia data. The listing price is based on a Zillow home value index and is available from 4/1996 to the present. Below is how I pulled and formatted the data:

##########################################################
#~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~
# Pull the data from Zillow
#   - Iterates over zips and months
#   - The dataframe has the following variables:
#       - zipcode
#       - city: city location of zipcode
#       - begin_date: yyyy-mm-dd
#       - sqft: price per square foot
#       - median: median zillow estimate
#~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~
##########################################################


sqft_data  <- read.csv("http://files.zillowstatic.com/research/public/Zip/Zip_MedianValuePerSqft_AllHomes.csv",header=T)
index_data <- read.csv("http://files.zillowstatic.com/research/public/Zip/Zip_Zhvi_AllHomes.csv",header=T)


sqft_data_la  <- sqft_data[sqft_data$CountyName == "Los Angeles",]
index_data_la <- index_data[index_data$CountyName == "Los Angeles",]


time0 <- Sys.time()
zillow_data <- NULL
for(indx in 1:length(sqft_data_la$RegionName)){
  print(indx)
  for(time_hold in names(sqft_data_la[,8:249])){
    
    
    median_price_out <- ifelse(sqft_data_la$RegionName[indx] %in% index_data_la$RegionName,
                           index_data_la[index_data_la$RegionName == sqft_data_la$RegionName[indx], time_hold],
                             NA)
    
    
    month_out <- paste(
                    substr(time_hold,2,5),"-",
                    substr(time_hold,7,8),"-01",sep="")
    
    zillow_data <- rbind( zillow_data,
      data.frame(
        zipcode = sqft_data_la$RegionName[indx],
        city    = sqft_data_la$City[indx],
        month   = month_out,
        sqft    = sqft_data_la[indx, time_hold], 
        median  = median_price_out
        
      ))
    
  }
  
}
time1 <- Sys.time()

time1 - time0 #14.3 mins

head(zillow_data)
##   X zipcode      city      month sqft median
## 1 1   90250 Hawthorne 1996-04-01  121 155400
## 2 2   90250 Hawthorne 1996-05-01  121 154600
## 3 3   90250 Hawthorne 1996-06-01  120 153300
## 4 4   90250 Hawthorne 1996-07-01  119 152100
## 5 5   90250 Hawthorne 1996-08-01  118 151500
## 6 6   90250 Hawthorne 1996-09-01  118 151300

Comparisons

Plotting the data against each other shows that the Zillow data is not best for looking at seasonal trends while the Trulia is data is not available before the Great Recession.

The below plot is aggregated at date for the zip codes in West LA from 1/10/2010 to 4/1/2016.

Zoom in of the years available for the Trulia data:

Initial Impressions

The median difference between the Trulia price and Zillow price is 39379 (IQR = 4371-74048). The variability between the estimates seems to grow over the years as well. It seems the Trulia data is more applicable to some of my interest because it contains seasonality and the dataset also includes number listings. It is nice that the Zillow data is available for so many years and it’s also easier to access because it’s available in CSV. The API on the Trulia data has some limitations and reading the data in R can be time intensive. Future posts will try to explore this data further.


 

© 2018 Data Digressions

Theme by Anders NorenUp ↑