Language Wrangling: Running Google’s Sawzall on Quantcast’s MapReduce Cluster
It’s April 2011. Quantcast is running one of the largest MapReduce clusters out there. Engineers write Java code that gets executed across the whole cluster efficiently on terabytes of data. Wonderful!
Unfortunately, not everyone who needs access to data is an engineer. Also, even engineers don’t feel like writing a new MapReduce job every time they want to take a different look at a data set.
We realized we needed a more productive data analysis tool. The first thought was to get SQL to run on our petabytes of data, but that seemed like a large undertaking, as SQL data access semantics are fairly sophisticated, which implies a fairly large friction area with Quantcast’s MapReduce implementation. Another solution, simpler, was to get Google’s recently open-sourced Sawzall to run on Quantcast’s MapReduce cluster. Although as a language Sawzall is not as easy to use as SQL, especially to non-engineers, it’s still much simpler than Java. And it seemed reasonably easy to integrate Sawzall with Quantcast’s MapReduce implementation, as its interface to MapReduce is much narrower and better defined than SQL’s.
This theoretical “ease of integration with MapReduce” turned out (surprise!) harder than expected in practice. First, Sawzall was not open-sourced with a MapReduce harness, but only as a compilation/execution engine plus a command line tool with little practical utility. Second, Sawzall runs best on protocol buffers but Quantcast stores its data in different binary and text formats. Third, Quantcast’s MapReduce although based on Hadoop lacked streaming capability because it was branched off an old Hadoop version that predates streaming capability.
We integrated Sawzall execution with our cluster software by writing a generic MapReduce...