Random Forest

posted Dec 12, 2016, 3:37 PM by Fanda   [ updated Dec 12, 2016, 3:52 PM ]
My project looks at the price transmission of three types of U.S. dairy products. I intend to use threshold error correction models to study asymmetric price transmission from farm to retail. Because of the sheer size of my dataset, I went to a formal Google employee and now professor in the department for help. He recommended me to use a data science trick called random forest. The result was amazing! 

I started with Nielson retail scanner data that cover the entire period from 2006 to 2014. Nielson offers weekly price observations for a lot of products in a lot of retail locations. In an effort to limit the scope of my project, I only picked the stores that are in the top 20 metropolitan areas. 

To determine what may or may not affect a yogurt product’s price besides temporal progressions, I parsed UPC description for each product. The result is a set of variables that describe the following aspects of a product:

 style Greek? Regular? Swiss? Russian?
A total of 9 styles, numbered from 1 to 9.
 sweetened Is it artificially sweetened? Binary variable.
 sugar Is sugar added to the product? Binary variable.
 ECJ Is evaporated cane juice used as sweetener. ECJ was usually seen in organic products and dictates a premium. Binary variable.
 fat_content Does the package mention the product is made of 1%, Skim, whole, or unspecified milk? Numerical variable.
 fruit Is it a fruit flavored product? Binary variable.
 lactose Is the product lactose free? Lactose-free milk usually dictates a premium. Binary as well.
 light Does the package say “light” or “lite”? It’s an indication of marketed low calorie product. Binary variable. 

Before feeding time series data into my model, I was particularly interested in understanding if the price progression of each product clusters around certain variables. For example, are ECJ products predominantly more expensive than non-ECJ products in most of the time?

Though it is tempting to run a panel regression to determine which of those variables have something to say about the prices, it is not all that practical with the computing power I have in my office computer. To start with, there are still 5,527,288 price series in the raw dataset even after my “metropolis draft pick”. Each price series consists of weekly price points for one product in one store. After aggregating prices to product and Metropolitan Statistical Area levels, I was still left with 4,691,635 price observations or roughly 33,382 price series. 

After consulting with the professor, I decided to give random forest a try. Because the panel dataset I have are essentially a collection of time series, stacking observations over different time periods seems inappropriately ignorant about temporal effects. So I sliced up the dataset by time periods and ran the random forest on each slice. The results are not disappointing at all.

Below is a chart of each variable’s “significance” level in its contribution to the retail price of the products.


Fanda Yang, random forest, big data, yogurt, price transmission

I used R cforest command from “party” package (what a name) to get what’s on the y-axis. What’s on the Y-axis is a measure called decrease in mean accuracy. My understanding of the measure is that it is basically a telltale of variation in retail prices that are lost by removing a variable. The magnitude is meant to be compared with that of other variables rather than being taken in absolute terms.

From this chart, it is fairly clear that style (red) and fat_content (black) stand out as major price “differentiators”. After showing the chart to two other professors in my department, I was told they believe that this chart makes sense. At least the stories behind style and fat_content seem to vindicate what we see in the chart. Around 2009 and 2010, Greek yogurt picked up sales volume and became very popular for its high protein content. The high rising fat_content curve from 2013 onward coincides with consumer’s awareness of fat content in milk markets. This chart also says retailers are setting price based more and more on fat_content over the years, which is something I was not expecting to see from random forest. Bravo! Random Forest!

As for ECJ (yellow) and light (green), it might be a toss-up at first glance. But upon closer inspection, I noticed that light picks up quite a bit around 2010 and stayed relevant for the remaining time periods, whereas ECJ fades its way after a peak in the middle of 2009. Given that ECJ is widely associated with organic products that are not studied in my project, I went for light. (No! I will not stay on the dark side!)

For the purpose of my project, I settled with style, fat_content and light