Dairy Price Transmission

Jan. 2016 -


How I utilized multi-threading technique to speed up data processing

posted Mar 16, 2017, 12:34 PM by Fanda   [ updated Mar 17, 2017, 9:29 AM ]

I was writing up the data analysis part of this project the other day. It was all fine and dandy until I found a mistake in the code I wrote to detect sales price. Most sales decisions are made on the retailer side and purely for the sake of hiking up sales volume. Since my project deals with the relationship between farm-level price and retail price, sales prices are viewed as noises that should be removed before any empirical analysis begins. The mistake was bad enough that I had to re-design the entire algorithm. Since removing sales prices is literally the first step after dumping raw data into the SQL server, this means I have to re-do all the analyses I have already done after the new routine removes sales prices again. Though everything is highly automated, I don't seem to be able to afford the time it takes to get everything ready again. 

The little program I wrote utilizes 5 out of 6 CPU cores. In the past 4 hours, the program only plowed through a little more than 3 out of 15 sets of price series. The program, as of this writing, has already cleaned up 273,981 series. The problem is, I don't really have an idea of what the total amount of series is at this point. Eventually, I would have to run an SQL query to figure it out. But boy, it will be time consuming. The database currently houses 614,917,889 price points. 

Well since I've got a lot of time to wait, let me just write a little piece about the simple parallelism I engineered myself. My PC has 6 cores. You can think of each core as an employee who works for me. These folks are very fast and good at doing repetitive tasks but never ask for a dime. I let one of them be the manager and the rest 5 be inspectors. Now I am going to just assume the manager is female and the other 5 are all male. (Disclaimer: I support gender equality. It's just easier this way so that when I use a gender-specific pronoun you know which group of them I am referring to.) The manager keeps a folder for each inspector. Each folder has a inspector's name on it. In any of those folders, she always tries to fill up 100 names of the price series that the inspector will work on. At the beginning of the day, she pulls out 100 names out of my database for each inspector, tells them which folder they are assigned to and monitors the her subordinates' work progress. Once an inspector gets a folder, he removes a name from the folder, goes to the database to find all price points under that name, then reads through each point to flag possible sales price. Once he finishes inspecting all price points in a series, he updates the database according to his judgement of which price is a sales price. After that, he removes another name from the folder, inspects, flags, saves, and so on and so forth. While every inspector is trying to clear out his folder, the manager makes sure he never runs out of work before all series are inspected. If the number of price series names in a folder drops below 50, she finds another 50 and puts them back into the folder. This goes on and on until there are no more prices series the manager can find and all 5 inspectors finished the last series in their folder. What does the manager do when there are more than 50 but less than 100 names in a folder? She kicks back and does nothing. (This actually frees the "manager" core so that SQL database engine can use it to do some additional processing. Series names are pre-loaded into RAM so that the manager core does not visit SQL server when she puts new names into a folder. The manager logic is designed so that it does nothing most of the time and puts very little burden on CPU when it does. Each inspector thread also owns a separate data connection. Those threads don't vanish even if their folders are temporarily empty. The manager keeps a tap on how many series are assigned and how many are finished. Until all series are assigned, she does not let go any of her workers. )

Figure 1

Figure 1 above is a screenshot of my program (top) and the Resource Monitor (bottom). If you look at the bottom of my program, you may notice that it reports 5 inspectors had finished 273,981 series and the manager had assigned a grand total of 274,485 series to them. The grand total is roughly 500 series more than the inspectors had collectively finished. Since this is a static picture, you cannot see those numbers in action. But both numbers are constantly increasing and about 500 apart. The chart in Resource Monitor shows disk IO in blue curve and CPU usage in green area. The green "trench" happens when the hard drive clears its buffer, which goes to show disk IO speed is the bottom neck of the entire process. 

Random Forest

posted Dec 12, 2016, 3:37 PM by Fanda   [ updated Dec 12, 2016, 3:52 PM ]

My project looks at the price transmission of three types of U.S. dairy products. I intend to use threshold error correction models to study asymmetric price transmission from farm to retail. Because of the sheer size of my dataset, I went to a formal Google employee and now professor in the department for help. He recommended me to use a data science trick called random forest. The result was amazing! 

I started with Nielson retail scanner data that cover the entire period from 2006 to 2014. Nielson offers weekly price observations for a lot of products in a lot of retail locations. In an effort to limit the scope of my project, I only picked the stores that are in the top 20 metropolitan areas. 

To determine what may or may not affect a yogurt product’s price besides temporal progressions, I parsed UPC description for each product. The result is a set of variables that describe the following aspects of a product:

 style Greek? Regular? Swiss? Russian?
A total of 9 styles, numbered from 1 to 9.
 sweetened Is it artificially sweetened? Binary variable.
 sugar Is sugar added to the product? Binary variable.
 ECJ Is evaporated cane juice used as sweetener. ECJ was usually seen in organic products and dictates a premium. Binary variable.
 fat_content Does the package mention the product is made of 1%, Skim, whole, or unspecified milk? Numerical variable.
 fruit Is it a fruit flavored product? Binary variable.
 lactose Is the product lactose free? Lactose-free milk usually dictates a premium. Binary as well.
 light Does the package say “light” or “lite”? It’s an indication of marketed low calorie product. Binary variable. 

Before feeding time series data into my model, I was particularly interested in understanding if the price progression of each product clusters around certain variables. For example, are ECJ products predominantly more expensive than non-ECJ products in most of the time?

Though it is tempting to run a panel regression to determine which of those variables have something to say about the prices, it is not all that practical with the computing power I have in my office computer. To start with, there are still 5,527,288 price series in the raw dataset even after my “metropolis draft pick”. Each price series consists of weekly price points for one product in one store. After aggregating prices to product and Metropolitan Statistical Area levels, I was still left with 4,691,635 price observations or roughly 33,382 price series. 

After consulting with the professor, I decided to give random forest a try. Because the panel dataset I have are essentially a collection of time series, stacking observations over different time periods seems inappropriately ignorant about temporal effects. So I sliced up the dataset by time periods and ran the random forest on each slice. The results are not disappointing at all.

Below is a chart of each variable’s “significance” level in its contribution to the retail price of the products.


Fanda Yang, random forest, big data, yogurt, price transmission

I used R cforest command from “party” package (what a name) to get what’s on the y-axis. What’s on the Y-axis is a measure called decrease in mean accuracy. My understanding of the measure is that it is basically a telltale of variation in retail prices that are lost by removing a variable. The magnitude is meant to be compared with that of other variables rather than being taken in absolute terms.

From this chart, it is fairly clear that style (red) and fat_content (black) stand out as major price “differentiators”. After showing the chart to two other professors in my department, I was told they believe that this chart makes sense. At least the stories behind style and fat_content seem to vindicate what we see in the chart. Around 2009 and 2010, Greek yogurt picked up sales volume and became very popular for its high protein content. The high rising fat_content curve from 2013 onward coincides with consumer’s awareness of fat content in milk markets. This chart also says retailers are setting price based more and more on fat_content over the years, which is something I was not expecting to see from random forest. Bravo! Random Forest!

As for ECJ (yellow) and light (green), it might be a toss-up at first glance. But upon closer inspection, I noticed that light picks up quite a bit around 2010 and stayed relevant for the remaining time periods, whereas ECJ fades its way after a peak in the middle of 2009. Given that ECJ is widely associated with organic products that are not studied in my project, I went for light. (No! I will not stay on the dark side!)

For the purpose of my project, I settled with style, fat_content and light

Data overview

posted Nov 28, 2016, 6:29 AM by Fanda   [ updated Nov 28, 2016, 6:48 AM ]

I obtained the access to Nielson retailer data around Jan. 2016. Boy! What a humongous data set! After looking at Nielson's data catalog, I decided to download only four data groups: fluid milk (module code: 3625), yogurt (module code: 3603), cheddar cheese (module code: 3550) and mozzarella (module code: 3546). Data were delivered through tsv (tab separated values) files. One for each product-year combo. All those files provide more than 5 million weekly price series. That's the starting point of this project.

1-3 of 3