Language Wrangling: Running Google’s Sawzall on Quantcast’s Mapreduce Cluster

In: Computers and Technology

Submitted By rl10
Words 749
Pages 3
Language Wrangling: Running Google’s Sawzall on Quantcast’s MapReduce Cluster
Motivation

It’s April 2011. Quantcast is running one of the largest MapReduce clusters out there. Engineers write Java code that gets executed across the whole cluster efficiently on terabytes of data. Wonderful!

Unfortunately, not everyone who needs access to data is an engineer. Also, even engineers don’t feel like writing a new MapReduce job every time they want to take a different look at a data set.

We realized we needed a more productive data analysis tool. The first thought was to get SQL to run on our petabytes of data, but that seemed like a large undertaking, as SQL data access semantics are fairly sophisticated, which implies a fairly large friction area with Quantcast’s MapReduce implementation. Another solution, simpler, was to get Google’s recently open-sourced Sawzall to run on Quantcast’s MapReduce cluster. Although as a language Sawzall is not as easy to use as SQL, especially to non-engineers, it’s still much simpler than Java. And it seemed reasonably easy to integrate Sawzall with Quantcast’s MapReduce implementation, as its interface to MapReduce is much narrower and better defined than SQL’s.

Challenge

This theoretical “ease of integration with MapReduce” turned out (surprise!) harder than expected in practice. First, Sawzall was not open-sourced with a MapReduce harness, but only as a compilation/execution engine plus a command line tool with little practical utility. Second, Sawzall runs best on protocol buffers but Quantcast stores its data in different binary and text formats. Third, Quantcast’s MapReduce although based on Hadoop lacked streaming capability because it was branched off an old Hadoop version that predates streaming capability.

Big Picture


We integrated Sawzall execution with our cluster software by writing a generic MapReduce…...

Similar Documents

Cluster

...The purpose of the article is to explore what the clustering phenomenon really means for firms and how management can actively seize the opportunities arising from this trend. Based on the article, cluster is a group of firms and institutions of one industrial sector that are complementing each other along a value chain and also overlapping in a limited geographical area. Clusters are considered to increase the productivity with which companies can compete, nationally and globally. Clusters are also very important aspects of strategic management. Business cluster gives many benefits such as productivity benefits, innovation and higher profitability compared to their isolated competitors. The producers that located within clusters can more easily concentrate on their core competencies and increase productivity. In clusters, it is easier for companies to recruit suitable employees and for employees to specialize in terms of their education. However, a wage-spiral may arise in a very dynamic cluster if employees frequently switch from one firm to another. The significance of employees moving from company to company becomes clearer still when viewed as a mechanism for knowledge exchange. People take their knowledge with them to their new jobs, combining it with the knowledge acquired at their new firms and thus developing the common knowledge base further. This provides an explanation for research findings showing that a few selected centers are host to most of the...

Words: 1354 - Pages: 6

Running Away

... think will never understand you? I do. I couldn't stay at home. I don't live with my brothers and sisters and I'm an only child at home. It's like every day they push me away and push me more and more towards running away again. It's hard to take because I tried so many ways to get away from it. Anyways, when I went to Christian’s house he called my parents. He said he didn't want anything to happen to me. I guess I could give him credit for caring, and it’s funny that I think about it now because he’s been my best friend since I was 11 but that’s another story. It was just so hard to go back and know nothing would ever change between my parents and me.  The third time, yes, there was a third time, I run away again. I could have been kidnapped or raped while walking on the busy streets of Riverside by myself at my age. But to make a long story short, it was the first time I was caught by the cops. If I had run away again I would have been sent to juvenile hall. I was sent back home and things at home never changed but for some reason I kinda liked school after all this happened. I could never run away again but even if I could have I've learned my lesson. And everyone else should too. It never pays to run away from home. You can't run away from the problems that you have. Only you can solve them. ...

Words: 564 - Pages: 3

Language Wrangling: Running Google’s Sawzall on Quantcast’s Mapreduce Cluster

...Language Wrangling: Running Google’s Sawzall on Quantcast’s MapReduce Cluster Motivation It’s April 2011. Quantcast is running one of the largest MapReduce clusters out there. Engineers write Java code that gets executed across the whole cluster efficiently on terabytes of data. Wonderful! Unfortunately, not everyone who needs access to data is an engineer. Also, even engineers don’t feel like writing a new MapReduce job every time they want to take a different look at a data set. We realized we needed a more productive data analysis tool. The first thought was to get SQL to run on our petabytes of data, but that seemed like a large undertaking, as SQL data access semantics are fairly sophisticated, which implies a fairly large friction area with Quantcast’s MapReduce implementation. Another solution, simpler, was to get Google’s recently open-sourced Sawzall to run on Quantcast’s MapReduce cluster. Although as a language Sawzall is not as easy to use as SQL, especially to non-engineers, it’s still much simpler than Java. And it seemed reasonably easy to integrate Sawzall with Quantcast’s MapReduce implementation, as its interface to MapReduce is much narrower and better defined than SQL’s. Challenge This theoretical “ease of integration with MapReduce” turned out (surprise!) harder than expected in practice. First, Sawzall was not open-sourced with a MapReduce harness, but only as a compilation/execution engine plus a command line tool with little practical...

Words: 749 - Pages: 3

Running

...Running Running is a means of locomotion especially for terrestrial animals which allow them to move quickly on foot. The body of human beings is adapted to run in several ways. There are three primary nature of running which the body if is able to combine them, then running is achieved and good performance for athletes. This does not mean that the body only relies on this factors in order to execute running, there are also many other factors. These three factors that contribute to an efficient running are; cardiorespiratory endurance, muscular endurance, and sprint or explosive power. Cardiorespiratory endurance can be defined as the ability of the heart and lungs to absorb, transport, and use oxygen in the body during strenuous or intense exercise. Muscle endurance, on the other hand, is the strength of the human body to continuously use muscular strength while enduring repeated contractions for a long period of time. Explosive power is the capability of the body muscle to contract and develop over a short period of time to release energy. Cardiorespiratory endurance, the ability to absorb, transport and utilize oxygen in the body is the final factor that can increase the performance of the more run. Cardiorespiratory endurance exercise such as aerobics should be initiated for every more run if he or she is to be successful. These exercises increase endurance by increasing the lungs volume and strengthening the heart muscles. Cardiorespiratory endurance training...

Words: 1187 - Pages: 5

Mysql Cluster

...MySQL Cluster Quick Start Guide – LINUX This guide is intended to help the reader get a simple MySQL Cluster database up and running on a single LINUX server. Note that for a live deployment multiple hosts should be used to provide redundancy but a single host can be used to gain familiarity with MySQL Cluster; please refer to the final section for links to material that will help turn this into a production system. 1 Get the software For Generally Available (GA), supported versions of the software, download from http://www.mysql.com/downloads/cluster/ Make sure that you select the correct platform – in this case, “Linux – Generic” and then the correct architecture (for LINUX this means x86 32 or 64 bit). If you want to try out a pre-GA version then check http://dev.mysql.com/downloads/cluster/ Note: Only use MySQL Server executables (mysqlds) that come with the MySQL Cluster installation. 2 Install Locate the tar ball that you’ve downloaded, extract it and then create a link to it: [user1@ws2 ~]$ tar xvf Downloads/mysql-cluster-gpl-7.1.3-linux-x86_64-glibc23.tar.gz [user1@ws2 ~]$ ln -s mysql-cluster-gpl-7.1.3-linux-x86_64-glibc23 mysqlc Optionally, you could add ~/mysqlc/bin to your path to avoid needing the full path when running the processes. 3 Configure For a first Cluster, start with a single MySQL Server (mysqld), a pair of Data Nodes (ndbd) and a single management node (ndb_mgmd) – all running on the same server. Create folders to store the...

Words: 848 - Pages: 4

European Cluster

... handle and caused the most problems. Markoczy's (1993) survey of western managers involved in joint ventures with Hungarian companies identi®ed three problematic areas: decision-making processes; communication and task de®nition; and some aspects of personnel policies. Hickson and Pugh (1995) classi®ed the countries of the world into seven groups based on ®ve dimensions: managing authority, managing relationships, managing oneself, managing uncertainty, and managing time. The central±eastern European cluster consisted of Russia, Poland, Ukraine, Latvia and Bulgaria. Since the study was conducted during the period of transition, characteristics of the countries were only very guardedly outlined. The central±eastern European countries were linked together by their common past: centralized planned economy, one-party system, Soviet in¯uence, and dual hierarchy. Smith and his colleagues (1996 and 1997) collected data also about eastern European countries in their sample of 43 countries. They found that the major dividing line in approaches to management within Europe was between the east and the west. Eastern Europeans preferred autonomy (utilitarian involvement) vs. loyal involvement and hierarchy (conservatism) vs. equality (egalitarian commitment). Hampden-Turner and Trompenaars (2000) found eastern European countries (Bulgaria, Czeck Republic, Greece, Hungary, Poland, Russia, Yugoslavia) to be particularistic, medium to high individualistic, mostly speci®c, ascribed (non...

Words: 8788 - Pages: 36

Always Running

... lifestyle came through art and politics. His interest in writing first appeared when he attended some classes at the local YMCA. His writing and artistic ability received just enough nurturing so that he began to find more power in his writings than he ever did with his involvement in gang activities. He wanted power to challenge and ultimately change the harsh social conditions which produce gangs. Seeing both sides of the coin, he was able to logically realize what was actually going to make a difference and what was not. Thus, Rodriguez replaced his role of a alienated gang member with an equally radical commitment to political action. This transition was the turning point in his life. 6.Discuss the constant themes of power, empowerment and powerlessness, both as an individual and as a community. • The main theme of Always Running is to stand up to what you believe in and fighting for what you believe is right. During his youth, Rodriguez was being pushed around by the school system and legal system because of his age and his race. He did not know English, he grew up in poverty, and he had a lot of confrontations, which pushed him to join a gang. Being in a gang, Rogriguez and his friends were looking for trouble, when in reality, they needed a direction. They were always on the move, hence “always running”. • The theme of power is falsely represented to Luis and his peers by gangs, forcing them to believe they belong when they are really tyrants. Luis......

Words: 1209 - Pages: 5

Cluster Computing

... importance when implemented in an organization. To start with cluster computing is known to be cost effective that is low cost. This is seen where customer has a chance to exclude the cost and the hard bit of procurement, they can configure and easily operate HPC clusters and also have a pricing of mode known as pay as you go. The other advantage is where one can optimize cost by leveraging one model of pricing that is: spot instance, reserve or the on demand. Furthermore cluster computing is elastic meaning that one can add or exclude computer resources in the network to achieve the size and minimal time needed to complete a workload. The other need for the cluster computing is the ability to run a job at anytime and anywhere. This is the best bit of the cluster computing the ability to execute a task using APIs or the known management tools and make the workflow automatic thus achieve optimum scalability and efficiency. The speed of innovation can be increased by accessing computer resources in minutes rather than wasting a lot of time in the queues. To conclude with Network clusters bid a high-performance computing substitute to SMP and immensely corresponding computing schemes. Collective system performance separately, cluster styles also can hint to additional dependable computer systems over dismissal. Selecting a hardware design is impartially the commencement step in constructing of a valuable cluster: requests, routine optimization, and system administration......

Words: 498 - Pages: 2

Cluster

.... Final Cluster Centers | | Cluster | | 1 | 2 | 3 | 4 | v1 | 4 | 4 | 3 | 4 | v2 | 3 | 4 | 2 | 4 | v3 | 2 | 4 | 4 | 4 | v4 | 3 | 4 | 4 | 2 | v5 | 2 | 4 | 3 | 3 | v6 | 3 | 4 | 2 | 4 | v7 | 4 | 4 | 4 | 5 | v8 | 3 | 4 | 2 | 3 | v9 | 4 | 4 | 2 | 3 | v10 | 3 | 4 | 4 | 4 | v11 | 2 | 4 | 1 | 4 | v12 | 3 | 4 | 3 | 5 | v13 | 4 | 4 | 4 | 4 | v14 | 4 | 4 | 4 | 3 | v15 | 2 | 4 | 2 | 4 | Step 3: Cluster Profile Cluster: 1 The respondents belonging to this cluster believes that foreign made products are of superior quality and prefer ready- made clothing over tailored clothes. They enjoy surfing on the net. However, they give high importance to women education for the overall development of the country. Also, they prefer to take food outside every weekend. Cluster: 2 The respondents belonging to this cluster are conscious about the quality of products they purchase and believe in buying ready-made clothes also, they believe that foreign goods are of superior quality. They enjoy surfing on the net, for them computer is a necessity and prefer payment over credit cards , they listen to old music, major source of entertainment is movies and believes that TV are an integral part urban households. They prefer vegetarian food however they eat outside every weekend. They believe that women education and computer education at primary level is important. They also think that liberalization has increased the efficiency of Indian firms and that......

Words: 685 - Pages: 3

Cluster Computing

... cluster network design. Cluster applications are often CPU-bound so that interconnect and storage bandwidth are not limiting factors, although this is not always the case. 1.1.2 Cluster Benefits The main benefits of clusters are scalability, availability, and performance. For scalability, a cluster uses the combined processing power of compute nodes to run cluster-enabled applications such as a parallel database server at a higher performance than a single machine can provide. Scaling the cluster's processing power is achieved by simply adding additional nodes to the cluster. Availability within the cluster is assured as nodes within the cluster provide backup to each other in the event of a failure. In high-availability clusters, if a node is taken out of service or fails, the load is transferred to another node (or nodes) within the cluster. To the user, this operation is transparent as the applications and data running are also available on the failover nodes. An additional benefit comes with the existence of a single system image and the ease of manageability of the cluster. From the users perspective the users sees an application resource as the provider of services and applications. The user does not know or care if this resource is a single server, a cluster, or even which node within the cluster is providing services. These benefits map to needs of today's enterprise business, education, military and scientific community infrastructures. In summary, clusters......

Words: 5312 - Pages: 22

Language

...Language John Kendrick PSY/360 September 14, 2015 Professor Jackson Language Through sounds, gestures, and symbols humans have learned to communicate with each other. It is a developed system for communicating in a society. Languages will vary from one culture to the next and will take on different forms. Not only are languages spoken, they are expressed through hand gestures and written symbols. Language is a form of communication that allows humans to express emotion, opinions, thoughts, and beliefs (Galotti, 2014). The lexicon is the vocabulary contained within the language. It is the knowledge of the words contained in the language. It is a compilation of all words known, understood, and expressed by the individual. The language is compiled and understood by others contained in the same culture and supports how the language is expressed (Galotti, 2014). One key feature of language is broadcasting the message and then rapid fading of the message. The message will fade and then not heard. The next feature is interchangeability. This is the ability to both receive and send the message. Total feedback is occurs when the speaker can hear his own speech and can monitor the language performance as they go and specialization involves producing the speech through the specialized body parts adapted for this role (Galotti, 2014). The key features of language are phonology, syntax, semantics, and pragmatics. An expression of language occurs when an...

Words: 749 - Pages: 3

Running

... Calories Are You Really Burning? By Amby Burfoot: 2005) Seems like an easy choice. Running has also been known to boost up confidence in many individuals. Running becomes beneficial to your mental health as well. It can provide a lift in self- esteem and make you feel better about yourself in the long term. Usually when you run you set a goal, which is a sense of empowerment. When you reach that goal, it will leave you feeling proud and accomplished and then you may set an even bigger goal. That seems to be the trend lately. Starting off with a small race like a 5k, then maybe move on to a 10k, a half marathon, and if you are that ambitious, you may even one day run a full marathon. After completing a marathon, the feeling you get has got to be the ultimate confidence booster. The empowerment that comes along with running can have so many mental benefits. Stress is a huge problem in America and running has been shown to alleviate stress. Stress causes so many mood and health issues as well as affects sleep and your appetite. Running causes your body to rid the excess hormones and energy, which will help in all of these areas. Running also alleviates and reduces the risk of tension headaches. According to Matthew Stults-Kolehmainen, Ph.D., a kinesiologist at the Yale Stress Center raising one's heart rate can reverse damage to the brain caused by stress. "Stress atrophies the brain -- especially the hippocampus, which handles a lot, but the memory in......

Words: 1576 - Pages: 7

Cluster Analysis

...1. We now know what the concepts involving cluster analysis are, the different types of clusterings and clusters, the basic algorithms etc. That leads us to the second paper, titled: "Cluster analysis in marketing research: review and suggestions for application". Where the book chapter mainly explains the theory underlying cluster analysis, this paper actually focuses on the practical issues regarding the use and validation of cluster analytic methods. This part of the presentation is built up as follows: first, we provide you guys with a short introduction on the paper. Of course, there is quite some overlap with the book chapter and the first part of the paper so we will keep it short. Second, a major contribution of this paper is its empirical comparison of clustering methods to evaluate their performance. Therefore I will discuss the findings of this comparison with you. In the final part, my team member will guide you through the recommendations for using cluster analysis, as proposed by the authors. This part contains the major issues regarding the use of clustering methods. 2. Problems The main problem is the large number of different clustering methods that makes it hard for a potential user to choose the right method(s) that suits his or her purpose best. As also stated in the book chapter, cluster analysis has independently developed in a multitude of different disciplines. This is the main reason for the fact that (at least at the time, the paper is from 1983...

Words: 936 - Pages: 4

Running Head

...Running Head: Boy at the Window Boy at the Window ENG 125: Introduction to Literature Instructor: Sarah MacDonald April, 15 2012 Running Head: Boy at the Window Boy at the Window In “Boy at the Window”, Richard Wilbur uses examples of contrast, personification and allegory in order to convey a message about the fate of childhood innocence. Wilbur tries to tell the reader that the innocence felt by children is doomed from the start to succumb to the forces of experience. Contrast is the strongest literary device utilized in “Boy at the Window”, with the numerous examples serving to drive home the loss of innocence and set a tone for the whole poem. Wilbur juxtaposes the bitter cold outside with the warmth inside the home where the child resides; while the child looks at the snowman’s predicament from his own perspective, the more experienced snowman knows that he would be doomed were he to enter the warm house. This leads into a contrast between the worldly snowman, who understands the necessary division between his position and the boy’s, and the young child who feels only sympathy. Ice and water form another example of contrast; “Though frozen water is his [the snowman’s] element”, the snowman cries single tears of melted water out of sympathy for the young’s boy’s sorrow. What the boy desires, for the snowman to be inside the warmth rather than out in the cold, cannot be. A second group of contrasting images is the difference between loneliness and company...

Words: 807 - Pages: 4

Cluster Analysis

... Cluster Centers to the left, cluster 1 shows that every variable except cooking on gas, most the respondents would not at all consider the other 5 variables however, most respondents already do cook on gas. In cluster 2, it could be seen that the respondents already do all the variables except installing energy in efficient heating systems, which they would not at all consider. In cluster 3, the respondents would not consider applying hot water cylinder insulation, cook on gas and install an energy efficient washing machine. However, they already have installed an energy efficient refrigerator, heating pipe and energy efficient heating system. Looking at the Final Cluster Centers, there is mostly 2’s(unlikely to consider) and 3’s(would possibly consider), which means they are less likely to consider energy saving behaviours. Cluster 2 has two statements which they already do, which they may think that the appliances are a big drain on their power bill. Also, there are 2 statements installing an energy efficient refrigerator and washing machine that respondents already do, this shows that these two items may take lots of energy and lead to a high electricity bill. For cluster 3, the statement cooking on gas, the respondents would not at all consider, showing that cooking in other ways may be for fast and effective. In cluster 1,2 and 3 the individuals are less likely to want to engage in energy saving behaviours, don’t actually do it. Anova cannot be used, it should......

Words: 2421 - Pages: 10