How I utilized multi-threading technique to speed up data processing

posted Mar 16, 2017, 12:34 PM by Fanda   [ updated Mar 17, 2017, 9:29 AM ]
I was writing up the data analysis part of this project the other day. It was all fine and dandy until I found a mistake in the code I wrote to detect sales price. Most sales decisions are made on the retailer side and purely for the sake of hiking up sales volume. Since my project deals with the relationship between farm-level price and retail price, sales prices are viewed as noises that should be removed before any empirical analysis begins. The mistake was bad enough that I had to re-design the entire algorithm. Since removing sales prices is literally the first step after dumping raw data into the SQL server, this means I have to re-do all the analyses I have already done after the new routine removes sales prices again. Though everything is highly automated, I don't seem to be able to afford the time it takes to get everything ready again. 

The little program I wrote utilizes 5 out of 6 CPU cores. In the past 4 hours, the program only plowed through a little more than 3 out of 15 sets of price series. The program, as of this writing, has already cleaned up 273,981 series. The problem is, I don't really have an idea of what the total amount of series is at this point. Eventually, I would have to run an SQL query to figure it out. But boy, it will be time consuming. The database currently houses 614,917,889 price points. 

Well since I've got a lot of time to wait, let me just write a little piece about the simple parallelism I engineered myself. My PC has 6 cores. You can think of each core as an employee who works for me. These folks are very fast and good at doing repetitive tasks but never ask for a dime. I let one of them be the manager and the rest 5 be inspectors. Now I am going to just assume the manager is female and the other 5 are all male. (Disclaimer: I support gender equality. It's just easier this way so that when I use a gender-specific pronoun you know which group of them I am referring to.) The manager keeps a folder for each inspector. Each folder has a inspector's name on it. In any of those folders, she always tries to fill up 100 names of the price series that the inspector will work on. At the beginning of the day, she pulls out 100 names out of my database for each inspector, tells them which folder they are assigned to and monitors the her subordinates' work progress. Once an inspector gets a folder, he removes a name from the folder, goes to the database to find all price points under that name, then reads through each point to flag possible sales price. Once he finishes inspecting all price points in a series, he updates the database according to his judgement of which price is a sales price. After that, he removes another name from the folder, inspects, flags, saves, and so on and so forth. While every inspector is trying to clear out his folder, the manager makes sure he never runs out of work before all series are inspected. If the number of price series names in a folder drops below 50, she finds another 50 and puts them back into the folder. This goes on and on until there are no more prices series the manager can find and all 5 inspectors finished the last series in their folder. What does the manager do when there are more than 50 but less than 100 names in a folder? She kicks back and does nothing. (This actually frees the "manager" core so that SQL database engine can use it to do some additional processing. Series names are pre-loaded into RAM so that the manager core does not visit SQL server when she puts new names into a folder. The manager logic is designed so that it does nothing most of the time and puts very little burden on CPU when it does. Each inspector thread also owns a separate data connection. Those threads don't vanish even if their folders are temporarily empty. The manager keeps a tap on how many series are assigned and how many are finished. Until all series are assigned, she does not let go any of her workers. )

Figure 1

Figure 1 above is a screenshot of my program (top) and the Resource Monitor (bottom). If you look at the bottom of my program, you may notice that it reports 5 inspectors had finished 273,981 series and the manager had assigned a grand total of 274,485 series to them. The grand total is roughly 500 series more than the inspectors had collectively finished. Since this is a static picture, you cannot see those numbers in action. But both numbers are constantly increasing and about 500 apart. The chart in Resource Monitor shows disk IO in blue curve and CPU usage in green area. The green "trench" happens when the hard drive clears its buffer, which goes to show disk IO speed is the bottom neck of the entire process.