Premium Essay

Hadoop Jitter

In:

Submitted By jmukher1
Words 7930
Pages 32
HadoopJitter: The Ghost in the Cloud and How to Tame It
Vivek Kale∗ , Jayanta Mukherjee† , Indranil Gupta‡ , William Gropp§
Department of Computer Science, University of Illinois at Urbana-Champaign 201 North Goodwin Avenue, Urbana, IL 61801-2302, USA Email: ∗ vivek@illinois.edu, † mukherj4@illinois.edu, ‡ indy@illinois.edu, § wgropp@illinois.edu

Abstract—The small performance variation within each node of a cloud computing infrastructure (i.e. cloud) can be a fundamental impediment to scalability of a high-performance application. This performance variation (referred to as jitter) particularly impacts overall performance of scientific workloads running on a cloud. Studies show that the primary source of performance variations comes from disk I/O and the underlying communication network [1]. In this paper, we explore the opportunities to improve performance of high performance applications running on emerging cloud platforms. Our contributions are 1. the quantification and assessment of performance variation of data-intensive scientific workloads on a small set of homogeneous nodes running Hadoop and 2. the development of an improved Hadoop scheduler that can improve performance (and potentially scalability) of these application by leveraging the intrinsic performance variation of the system. In using our enhanced scheduler for data-intensive scientific workloads, we are able to obtain more than a 21% performance gain over the default Hadoop scheduler.

I. I NTRODUCTION Certain high-performance applications such as weather prediction or algorithmic trading require the analysis and aggregation of large amounts of data geo-spatially distributed across the world, in a very short amount of time (i.e. on-demand). A traditional supercomputer may be neither a practical nor an economical solution because it is not suitable for handling data that is distributed across the

Similar Documents

Free Essay

Test

...5 TIMING DIAGRAM OF 8085 5.1 INTRODUCTION Timing diagram is the display of initiation of read/write and transfer of data operations under the control of 3-status signals IO / M , S1, and S0. As the heartbeat is required for the survival of the human being, the CLK is required for the proper operation of different sections of the microprocessors. All actions in the microprocessor is controlled by either leading or trailing edge of the clock. If I ask a man to bring 6-bags of wheat, each weighing 100 kg, he may take 6-times to perform this task in going and bringing it. A stronger man might perform the same task in 3times only. Thus, it depends on the strength of the man to finish the job quickly or slowly. Here, we can assume both weaker and strong men as machine. The weaker man has taken 6-machine cycle (6-times going and coming with one bag each time) to execute the job where as the stronger man has taken only 3-machine cycle for the same job. Similarly, a machine may execute one instruction in as many as 3-machine cycles while the other machine can take only one machine cycle to execute the same instruction. Thus, the machine that has taken only one machine cycle is efficient than the one taking 3-machine cycle. Each machine cycle is composed of many clock cycle. Since, the data and instructions, both are stored in the memory, the µP performs fetch operation to read the instruction or data and then execute the instruction. The µP in doing so may take several cycles to perform...

Words: 3904 - Pages: 16

Free Essay

Rfr Testing

...APPLICATION NOTE 183 RFC 2544: HOW IT HELPS QUALIFY A CARRIER ETHERNET NETWORK Bruno Giguère, Member of Technical Staff, Transport and Datacom Business Unit Service providers worldwide are actively turning up new services based on carrier Ethernet technology in a fierce competition to attract premium subscribers. The need for quality services has never been more important, making comprehensive Ethernet testing immediately at service turn-up vital to ensuring service quality and increasing customer satisfaction. Customer service-level agreements (SLAs) dictate certain performance criteria that must be met, with the majority documenting network availability and mean-time-to-repair values which are easily verified. However, Ethernet performance criteria are more difficult to prove, and demonstrating performance availability, transmission delay, link burstability and service integrity cannot be accomplished accurately by a mere PING command alone. Portable RFC 2544 test equipment enables field technicians, installers and contractors to immediately capture test results and demonstrate that the Ethernet service meets the customer SLA. These tests can also serve as a performance baseline for future reference. What is RFC 2544? The RFC 2544 standard, established by the Internet Engineering Task Force (IETF) standards body, is the de facto methodology that outlines the tests required to measure and prove performance criteria for carrier Ethernet networks. The standard provides...

Words: 2020 - Pages: 9

Free Essay

Hadoop

...Cluster. Hadoop Architecture Two Components * Distributed File System * Map Reduce Engine HDFS Nodes * Name Node * Only one node per Cluster * Manages File system, Name Space and Metadata * Single point of Failure but mitigated by writing to multiple file systems * Data Node * Many per cluster * Manages blocks with data and serves them to Nodes * Periodically reports to Name Node on the list of blocks it stores Map Reduce Nodes * Job Tracker * Task Tracker PIG – A high level Hadoop programing language that provides data flow language and execution framework for parallel computation Created by Yahoo Like a Built in Function for Map Reduce We write queries in PIG – Queries get translated to Map Reduce Program during execution HIVE : Provides adhoc SQL like queries for data aggregation and summarization Written by JEFF from FACEBOOK. Database on top of Hadoop HiveQL is the query language. Runs like SQL with less features of SQL HBASE: Database on top of Hadoop. Real-time distributed database on the top of HDFS It is based on Google’s BIG TABLE – Distributed non-RDBMS which can store billions of rows and columns in single table across multiple servers Handy to write output from MAP REDUCE to HBASE ZOO KEEPER: Maintains the order of all animals in Hadoop.Created by Yahoo. Helps to run distributed application and maintain them in Hadoop. SQOOP: Sqoops the data from RDBMS to Hadoop. Created...

Words: 276 - Pages: 2

Free Essay

Abc Ia S Aresume

...De-Identified Personal Health Care System Using Hadoop The use of medical Big Data is increasingly popular in health care services and clinical research. The biggest challenges in health care centers are the huge amount of data flows into the systems daily. Crunching this BigData and de-identifying it in a traditional data mining tools had problems. Therefore to provide solution to the de-identifying personal health information, Map Reduce application uses jar files which contain a combination of MR code and PIG queries. This application also uses advanced mechanism of using UDF (User Data File) which is used to protect the health care dataset. Responsibilities: Moved all personal health care data from database to HDFS for further processing. Developed the Sqoop scripts in order to make the interaction between Hive and MySQL Database Wrote MapReduce code for DE-Identifying data. Loaded the processed results into Hive tables. Generated test cases using MRunit. Best-Buy – Rehosting of Web Intelligence project The purpose of the project is to store terabytes of log information generated by the ecommerce website and extract meaning information out of it. The solution is based on the open source Big Data s/w Hadoop .The data will be stored in Hadoop file system and processed using PIG scripts. Which intern includes getting the raw html data from the websites, Process the html to obtain product and pricing information, Extract various reports out of the product pricing...

Words: 500 - Pages: 2

Premium Essay

Cisco Case Study

...(SLAs) for internal customers using big data analytics services ● Support multiple internal users on same platform SOLUTION ● Implemented enterprise Hadoop platform on Cisco UCS CPA for Big Data - a complete infrastructure solution including compute, storage, connectivity and unified management ● Automated job scheduling and process orchestration using Cisco Tidal Enterprise Scheduler as alternative to Oozie RESULTS ● Analyzed service sales opportunities in one-tenth the time, at one-tenth the cost ● $40 million in incremental service bookings in the current fiscal year as a result of this initiative ● Implemented a multi-tenant enterprise platform while delivering immediate business value LESSONS LEARNED ● Cisco UCS can reduce complexity, improves agility, and radically improves cost of ownership for Hadoop based applications ● Library of Hive and Pig user-defined functions (UDF) increases developer productivity. ● Cisco TES simplifies job scheduling and process orchestration ● Build internal Hadoop skills ● Educate internal users about opportunities to use big data analytics to improve data processing and decision making NEXT STEPS ● Enable NoSQL Database and advanced analytics capabilities on the same platform. ● Adoption of the platform across different business functions. Enterprise Hadoop architecture, built on Cisco UCS Common Platform Architecture (CPA) for Big Data, unlocks hidden business intelligence. Challenge Cisco is the worldwide...

Words: 3053 - Pages: 13

Free Essay

Big Analytics

...REVOLUTION ANALYTICS WHITE PAPER Advanced ‘Big Data’ Analytics with R and Hadoop 'Big Data' Analytics as a Competitive Advantage Big Analytics delivers competitive advantage in two ways compared to the traditional analytical model. First, Big Analytics describes the efficient use of a simple model applied to volumes of data that would be too large for the traditional analytical environment. Research suggests that a simple algorithm with a large volume of data is more accurate than a sophisticated algorithm with little data. The algorithm is not the competitive advantage; the ability to apply it to huge amounts of data—without compromising performance—generates the competitive edge. Second, Big Analytics refers to the sophistication of the model itself. Increasingly, analysis algorithms are provided directly by database management system (DBMS) vendors. To pull away from the pack, companies must go well beyond what is provided and innovate by using newer, more sophisticated statistical analysis. Revolution Analytics addresses both of these opportunities in Big Analytics while supporting the following objectives for working with Big Data Analytics: 1. 2. 3. 4. Avoid sampling / aggregation; Reduce data movement and replication; Bring the analytics as close as possible to the data and; Optimize computation speed. First, Revolution Analytics delivers optimized statistical algorithms for the three primary data management paradigms being employed to address...

Words: 1996 - Pages: 8

Free Essay

Hadopp Yarn

...Apache Hadoop YARN: Yet Another Resource Negotiator Vinod Kumar Vavilapallih Mahadev Konarh Siddharth Sethh h: Arun C Murthyh Carlo Curinom Chris Douglasm Jason Lowey Owen O’Malleyh f: Sharad Agarwali Hitesh Shahh Sanjay Radiah facebook.com Robert Evansy Bikas Sahah m: Thomas Gravesy Benjamin Reed f hortonworks.com, Eric Baldeschwielerh microsoft.com, i : inmobi.com, y : yahoo-inc.com, Abstract The initial design of Apache Hadoop [1] was tightly focused on running massive, MapReduce jobs to process a web crawl. For increasingly diverse companies, Hadoop has become the data and computational agor´ —the de a facto place where data and computational resources are shared and accessed. This broad adoption and ubiquitous usage has stretched the initial design well beyond its intended target, exposing two key shortcomings: 1) tight coupling of a specific programming model with the resource management infrastructure, forcing developers to abuse the MapReduce programming model, and 2) centralized handling of jobs’ control flow, which resulted in endless scalability concerns for the scheduler. In this paper, we summarize the design, development, and current state of deployment of the next generation of Hadoop’s compute platform: YARN. The new architecture we introduced decouples the programming model from the resource management infrastructure, and delegates many scheduling functions (e.g., task faulttolerance) to per-application components. We provide experimental...

Words: 12006 - Pages: 49

Premium Essay

Big Data

...Big Data is Scaling BI and Analytics How the information surge is changing the way organizations use business intelligence and analytics Information Management Magazine, Sept/Oct 2011 Shawn Rogers Like what you see? Click here to sign up for Information Management's daily newsletter to get the latest news, trends, commentary and more. The explosive growth in the amount of data created in the world continues to accelerate and surprise us in terms of sheer volume, though experts could see the signposts along the way. Gordon Moore, co-founder of Intel and the namesake of Moore's law, first forecast that the number of transistors that could be placed on an integrated circuit would double year over year. Since 1965, this "doubling principle" has been applied to many areas of computing and has more often than not been proven correct. When applied to data, not even Moore's law seems to keep pace with the exponential growth of the past several years. Recent IDC research on digital data indicates that in 2010, the amount of digital information in the world reached beyond a zettabyte in size. That's one trillion gigabytes of information. To put that in perspective, a blogger at Cisco Systems noted that a zettabyte is roughly the size of 125 billion 8GB iPods fully loaded. Advertisement As the overall digital universe has expanded, so has the world of enterprise data. The good news for data management professionals is that our working data won't reach zettabyte scale for some...

Words: 2481 - Pages: 10

Premium Essay

Big Data

...Big Data Big Data and Business Strategy Businesses have come a long way in the way that information is being given to management, from comparing quarter sales all the way down to view how customers interact with the business. With so many new technology’s and new systems emerging, it has now become faster and easier to get any type of information, instead of using, for example, your sales processing system that might not get all the information that a manger might need. This is where big data comes into place with how it interacts with businesses. We can begin with how to explain what big data is and how it is used. Big data is a term used to describe the exponential growth and availability of data for both unstructured and structured systems. Back in 2001, Doug Laney (Gartner) gave a definition that ties in more closely on how big data is managed with a business strategy, which is given as velocity, volume, and variety. Velocity which is explained as how dig data is constantly and rapidly changing within time and how fast companies are able to keep up with in a real time manner. Which sometimes is a challenge to most companies. Volume is increasing also at a high level, especially with the amount of unstructured data streaming from social media such as Facebook. Also including the amount of data being collected from customer information. The final one is variety, which is what some companies also struggle with in handling many varieties of structured and unstructured data...

Words: 1883 - Pages: 8

Premium Essay

Integration of Technology

...from good decisions and identify new opportunities to gain a competitive advantage. Hadoop It is open source software designed to provide massive storage and large data processing power. Hadoop has the ability to handle tasks running at the same time. Hadoop has a storage and processing part. It works by dividing files into large blocks and distributing them amongst the nodes (Kozielski & Wrembel, 2014). In processing, it works with MapReduce to ensure that codes are transferred and nodes are processed in parallel. By using nodes, Hadoop allows data manipulation making it is process faster and more efficiently. It has four main components: The Hadoop Common which contains utilities required, the Hadoop Distributed File System which is the storage part, Hadoop Yarn which manages and computes resources and Hadoop MapReduce which is a program responsible for processing large-scale data. It can process large amounts of data quickly by using multiple computers (Kozielski & Wrembel, 2014). Hadoop is being turned into a data processing operating system by large organizations. This is because it allows numerous data manipulations and analytical processes. Other data analysis programs such as SQL run on Hadoop and perform well on this system. The ability of Hadoop running many programs lowers cost of data analysis and allows businesses to analyze different amounts of data on products and consumers. Hadoop not only provides an organization with more data to work...

Words: 948 - Pages: 4

Free Essay

Literature Review

...Big Data and Hadoop Harshawardhan S. Bhosale1, Prof. Devendra P. Gadekar2 1 Department of Computer Engineering, JSPM’s Imperial College of Engineering & Research, Wagholi, Pune Bhosale.harshawardhan186@gmail.com 2 Department of Computer Engineering, JSPM’s Imperial College of Engineering & Research, Wagholi, Pune devendraagadekar84@gmail.com Abstract: The term ‘Big Data’ describes innovative techniques and technologies to capture, store, distribute, manage and analyze petabyte- or larger-sized datasets with high-velocity and different structures. Big data can be structured, unstructured or semi-structured, resulting in incapability of conventional data management methods. Data is generated from various different sources and can arrive in the system at various rates. In order to process these large amounts of data in an inexpensive and efficient way, parallelism is used. Big Data is a data whose scale, diversity, and complexity require new architecture, techniques, algorithms, and analytics to manage it and extract value and hidden knowledge from it. Hadoop is the core platform for structuring Big Data, and solves the problem of making it useful for analytics purposes. Hadoop is an open source software project that enables the distributed processing of large data sets across clusters of commodity servers. It is designed to scale up from a single server to thousands of machines, with a very high degree of fault tolerance. Keywords -Big Data, Hadoop, Map Reduce...

Words: 5034 - Pages: 21

Free Essay

Donald Miner Serves as a Solutions Architect at Emc Greenplum,

...www.it-ebooks.info MapReduce Design Patterns Donald Miner and Adam Shook www.it-ebooks.info MapReduce Design Patterns by Donald Miner and Adam Shook Copyright © 2013 Donald Miner and Adam Shook. All rights reserved. Printed in the United States of America. Published by O’Reilly Media, Inc., 1005 Gravenstein Highway North, Sebastopol, CA 95472. O’Reilly books may be purchased for educational, business, or sales promotional use. Online editions are also available for most titles (http://my.safaribooksonline.com). For more information, contact our corporate/ institutional sales department: 800-998-9938 or corporate@oreilly.com. Editors: Andy Oram and Mike Hendrickson Production Editor: Christopher Hearse Proofreader: Dawn Carelli Cover Designer: Randy Comer Interior Designer: David Futato Illustrator: Rebecca Demarest December 2012: First Edition Revision History for the First Edition: 2012-11-20 First release See http://oreilly.com/catalog/errata.csp?isbn=9781449327170 for release details. Nutshell Handbook, the Nutshell Handbook logo, and the O’Reilly logo are registered trademarks of O’Reilly Media, Inc. MapReduce Design Patterns, the image of Père David’s deer, and related trade dress are trademarks of O’Reilly Media, Inc. Many of the designations used by manufacturers and sellers to distinguish their products are claimed as trademarks. Where those designations appear in this book, and O’Reilly Media, Inc., was aware of a trade‐ mark claim, the...

Words: 63341 - Pages: 254

Free Essay

Big Data

...Literature ◦ What is Big data? ◦ Why Big-Data? ◦ When Big-Data is really a problem?   ‘Big-data’ is similar to ‘Small-data’, but bigger …but having data bigger consequently requires different approaches: …to solve: ◦ techniques, tools & architectures ◦ New problems… ◦ …and old problems in a better way.  From “Understanding Big Data” by IBM Big-Data  Key enablers for the growth of “Big Data” are: ◦ Increase of storage capacities ◦ Increase of processing power ◦ Availability of data  NoSQL  MapReduce Storage Servers ◦ DatabasesMongoDB, CouchDB, Cassandra, Redis, BigTable, Hbase, Hypertable, Voldemort, Riak, ZooKeeper ◦ Hadoop, Hive, Pig, Cascading, Cascalog, mrjob, Caffeine, S4, MapR, Acunu, Flume, Kafka, Azkaban, Oozie, Greenplum ◦ S3, Hadoop Distributed File System ◦ EC2, Google App Engine, Elastic, Beanstalk, Heroku ◦ R, Yahoo! Pipes, Mechanical Turk, Solr/Lucene, ElasticSearch, Datameer, BigSheets, Tinkerpop    Processing  …when the operations on data are complex: ◦ …e.g. simple counting is not a complex problem ◦ Modeling and reasoning with data of different kinds can get extremely complex  Good news about big-data: ◦ Often, because of vast amount of data, modeling techniques can get simpler (e.g. smart counting can replace complex model based analytics)… ◦ …as long as we deal with the scale  Research areas (such as IR, KDD, ML, NLP, SemWeb, …) are subcubes within the data cube ...

Words: 754 - Pages: 4

Premium Essay

Cloudera

...Ask Bigger Questions * Cloudera develops open-source software for a world dependent on Big Data. With Cloudera, businesses and other organizations can now interact with the world's largest data sets at the speed of thought — and ask bigger questions in the pursuit of discovering something incredible. Cloudera Enterprise Core provides a single Hadoop storage and management platform that natively combines storage, processing and exploration. Equally important, it takes Hadoop beyond batch processing by providing the foundation for real-time operation with upgrades to support and manage Apache HBase (Cloudera Enterprise RTD) and open-source Impala technology (Cloudera Enterprise RTQ), so your team can work at the speed of thought. Cloudera Enterprise Core is the most comprehensive solution for Hadoop in the enterprise and includes everything you need to operate Hadoop effectively — and get on the fastest path to repeatable success. Cloudera Enterprise Core Cloudera Enterprise Core is the leading solution for Apache Hadoop in the enterprise. It includes not only CDH, our 100% open source, enterprise-ready distribution of Hadoop and related projects, but also a subscription to Cloudera Manager, and technical support for the core components of CDH (all components except HBase and Cloudera Impala). Using...

Words: 289 - Pages: 2

Free Essay

Hadoop Distribution Comparison

...Hadoop Distribution Comparison Tiange Chen The three kinds of Hadoop distributions that will be discussed today are: Apache Hadoop, MapR, and Cloudera. All of them have the same goals of performance, scalability, reliability, and availability. Furthermore, all of them have advantages including massive storage, great computing power, flexibility (Store and process data whenever you want, instead of preprocess before storing data like traditional relational databases. And it enables users to easily access new data sources including social media, email conversations, etc..), fault tolerance (One node fails, jobs still works on other nodes because data is replicated to other nodes in the beginning, so the computing does not fail), low cost (Use commodity hardware to store data), and scalability (More nodes, more storage, and little administration.). Apache Hadoop is the standard Hadoop distribution. It is open source project, created and maintained by developers from all around the world. Public access allows many people to test it, and problems can be noticed and fixed quickly, so their quality is reliable and satisfied. (Moccio, Grim, 2012) The core components are Hadoop Distribution File System (HDFS) as storage part and MapReduce as processing part. HDFS is a simple and robust coherency model. It is able to store large amount of information and provides steaming read performance. However, it is not strong enough in the aspect of easy management and seamless integration...

Words: 540 - Pages: 3