Free Essay

A Survey of Checkpointing Strategies for Shared-Memory Hpc Applications

In: Computers and Technology

Submitted By jorgecara
Words 7288
Pages 30
A Survey of Checkpointing Strategies for Shared-Memory HPC Applications
Ana Gainaru,Aparna Sasidharan,Jorge Mario Cara Carmona University of Illinois, Urbana-Champaign April 25, 2012

1

Introduction

Fault tolerant protocols have always been a major research topic for the HPC community. Harnessing microprocessors to solve large computational problems has required the use of many microprocessors in a single system. Whereas today the large server machines in the business sector may have as many as 32 processors, large supercomputers can have thousands or tens of thousands of processors in a single machine. While this approach has proven itself to be highly effective in expanding the limits of computational capability, it has also brought to the foreground new challenges that did not arise in smaller systems. Fault tolerance is one such critical challenge.The problem of fault tolerance in modern systems arises from two important HPC trends. First is the rising frequency of faults in systems. Second is the increasing size and running times of applications running on these systems, making them more vulnerable to these faults. HPC systems are vulnerable to faults for three major reasons. First, whereas older machines were built from custommade,high-quality components, modern systems use commodity components that were designed and built for a less reliability-aware market. Second, as modern systems are made from more and more components, the probability of one of them failing becomes quite large, even if the individual components are reliable. Finally, as circuit feature sizes become smaller, circuits become increasingly vulnerable to soft, hard and intermittent errors, caused by ambient radiation, temperature fluctuation and other everyday phenomena. The end result is that the largest HPC systems today, such as the ASC systems, have mean times between faults on the order of hours to days. At the same time, many applications being run on these systems have running times on the order of days, weeks and months. At exascale, some predictions are 1

that the mean time between failures will range between a few hours to one day. Designing a suitable fault tolerance protocol for HPC applications is a challenge because it should be energy efficient without affecting the overall performance of the application. Replication is too costly since we are duplicating the workload. This can be ruled out as one of the options. Any fault tolerant protocol we design for HPC applications should therefore be based on process checkpointing.While it is possible to design applications to be insensitive to system faults, this requires a significant amount of programmer effort and can be extremely difficult for general algorithms. A more general and simpler solution is rollback restart, a technique in which the state of the application is periodically saved to stable storage. In the event of a system fault the application is aborted and restarted on functioning hardware as if nothing happened. While less efficient than application-specific approaches, rollback restart has become popular because its simplicity leads to a lower time-to-solution than competing fault tolerance techniques.

2

Fault-tolerance techniques for Message Passing Applications

A message passing system consists of a fixed number of processes that communicate only through messages. Processes cooperate to execute a distributed application program and interact with the outside world by receiving and sending input and output messages respectively. A global state of a message passing system is a collection of individual states of all participating processes and of the states of the communication channels. A consistent global state is one that occurs during a correct execution of the application without incurring failures. For example, if a process’s state reflects a message receipt, then the state of the corresponding sender reflects sending the message. A rollback recovery protocol for message passing systems must bring the system to a consistent state in the case of a failure. Checkpoint based rollback-recovery techniques can be classified as follows : • Uncoordinated Checkpointing - Each process takes a checkpoint without co-ordinating with the other processes and when it is most conveninent for it. But this may lead to Domino effect[1] in which case, all the processes may rollback to the beginning of the computation. You may also record checkpoints which donot store useful information which will need to be removed by garbage collection. • Coordinated Checkpointing - This technique does not lead to domino 2

effect and reduces storage overhead since only one checkpoint is saved. There is no need for garbage collection to remove obsolete checkpoints. The disadvantage is the latency involved in taking a checkpoint and therefore in committing the output of execution. Since the checkpoint is global, a barrier needs to be executed before taking a checkpoint and this may block communication. There are techniques to improve this, for example, non-blocking checkpoint coordination[2] and use of synchronised checkpoint clocks[3]. • Communication induced Checkpointing - In this method, a process can take checkpoints independently(local checkpoints), while it may also have to take certain checkpoints to ensure the progress of the recovery line(forced checkpoints). [TODO:Add something on message logging protocols] [TODO:Add diagrams] Coordinated checkpointing is the most popular protocol for HPC right now because recovery is simple and garbage collection is efficient(only the last checkpoint is needed). However it has severe drawbacks since all processes have to rollback to their previous checkpoint. Moreover, all processes writing their checkpoints at the same time creates burst accesses to the I/O system that may slow down execution. Uncoordinated checkpointing on the other hand does not require any synchronisation between the processes at checkpoint time. Thus it can avoid the problem caused by burst accesses to the I/O system. One of the grounds on which existing fault tolerant potocols differ is the amount of determinism they assume on applications. Both uncoordinated(without message logging) and coordinated checkpointing work on the assumption that all processes are non-deterministic. On the occurrence of a fault, these protocols have to restart all processes because the execution after a restart may be entirely different compared to that before the fault. Message logging protocols are a slight improvement in this regard because they assume some amount of determinism. They assume that the code is piecewise deterministic, i.e the state of a process is influenced only deterministic local actions and message receptions. However, message logging protocols have high overheads again, making them unsuitable for HPC applications. In short, if the communication in HPC applications were deterministic, we could reduce the overheads of fault tolerant protocols. Uncoordinated checkpointing can be combined with message logging to avoid the domino effect and limit the number of processes to rollback in the event of a failure if we assume that the application is piecewise deterministic. A new property called send determinism has been shown to be common to many MPI 3

HPC applications. In a send deterministic application, given a set of input parameters, the sequence of message emissions for any process is the same in any correct execution[4].

3

Fault-tolerance techniques for Shared Memory

In the shared memory programming model, the tasks of an application have direct access to each other’s memories, allowing them to interact without explicitly sending and receiving messages. However since multiple threads are simultaneously accessing the shared address space, you need additional primitives for thread coordination such as locks, semaphores, critical sections, critical regions and monitors. Depending on the shared memory API chosen, all or a subset of the above primitives may be available to you. Shared memory APIs also provide additional functionalities like the data distribution primitives in High Performance Fortran[5] and the worksharing constructs in OpenMP[6] and UPC[7].OpenMP,UPC,Co-Array Fortran[8] and Global Arrays[9] are some of the shared memory APIs commonly used in high performance computing. Fault-tolerance techniques for shared memory applications differ from those for message passing ones because of the underlying memory model. Each thread may have a memory that only it can access along with the common shared address space independent of all threads. Therefore it is important to understand the memory model used by an API.

3.1

Memory Models for Shared Memory

A memory model is a description of how read and write operations executed by the same and different threads behave relative to each other. The sequential memory model is as follows: a read to a given variable must return the value written by the most recent write to this variable. The simplest shared memory model is sequential consistency [10]. The intuition behind sequential consistency is that the execution of the application must correspond to some interleaving of operations on different threads executing on the same processor and accessing the same memory. In other words, the results of all reads performed by the applications threads must correspond to some total order on all read and write operations such that a given read of a variable returns the value of the write to this variable that was most recent according to the total order.However, enforcing sequential consistency on a multi-processor system requires a significant amount of stalling and additional communication and prevents the hardware and the compiler 4

from employing optimizations such as instruction reordering. This results in reduced performance for systems that guarantee sequential consistency. Since sequential consistency is expensive, a number of additional weaker memory models have been developed, such processor consistency [11],weak ordering [12], release consistency [13], total store ordering [14], the Digital Equipment Alpha memory model [15] and the PowerPC memory model [16]. These models are designed to allow for additional hardware and compiler optimizations, such as instruction reordering, the use of non-FIFO networks and having multiple processors simultaneously write to the same variable that is stored in their private caches. Since we are dealing with OpenMP, it would make sense that we discuss its memory model in some detail. The OpenMP memory model is most closely related to weak ordering. It separates memory accesses into data and synchronization accesses. No ordering provided for data accesses: the shared memory implementation is allowed to reorder operations on different variables. data writes are guaranteed to be atomic: no thread may see the partial result of a data write. synchronization accesses are provided with much stronger guarantees. First, any interactions between synchronization accesses are guaranteed to be sequentially consistent. Second, all data accesses that precede a given synchronization access in the programs source code must complete before this access may begin. Similarly, data accesses that follow a given synchronization access in the programs source code may not begin until this access has completed.The intuition behind weak ordering is that most of each threads memory accesses will be to variables that no other thread is touching at the same time. In this case it can use the highly optimized data accesses, which in the absence of multithread data races behave as if they were sequentially consistent. However, threads occasionally need to synchronize their executions in order to ensure that data accesses do not race with each other. Synchronization accesses can be used for this task. Although their guarantee of sequential consistency makes synchronization accesses slower, they donot occur frequently in an OpenMP program. We have analysed the NAS and Spec OpenMP benchmarks for the number of synchronisation points in them. The results we obtained are tabulated below. [TODO:include table]

4

Protocols

There has been a variety of work on developing rollback restart solutions for shared memory. The space of possible solutions is strongly influenced by the way shared memory is commonly implemented.Applications are written

5

to work with some shared memory API and memory models, where one API may support one or more memory models while each memory model may be supported by one or more APIs (while a memory model defines the behavior of reads and writes, a full API provides additional functionality such as synchronization functions and data parallel primitives). Each API is implemented by one or more shared memory implementations in hardware, software or a combination of both. Each implementation may use one or more consistency protocols and a given protocol may be used by one or more implementations. Shared memory implementations run on one or more network fabrics that at a low level provide a message passing model of communication that must be used by the implementation to provide the application with the abstraction of shared memory. Since all shared memory implementations are ultimately based on message passing, it is theoretically possible to use the techniques described in the previous sections for message passing applications to provide rollback restart for any shared memory implementation. However, in practice this can be unnecessarily expensive since shared memory implementations send many small messages, many of which are not relevant for restart. Moreover, they perform many non-deterministic actions (every read may be a non-deterministic event), further complicating this approach and reducing its efficiency. As such,typical solutions must tailor themselves to the details of a particular implementation, memory model or consistency protocol. The fact that shared memory protocols have the same types of interprocessor interactions as their message passing counterparts suggests that we can organize them into a similar taxonomy. This presentation divides shared memory roll- back restart protocols into the following types: coordinated, uncoordinated, quasi-synchronous and message logging. Protocols in each of these families closely resemble their message passing counterparts but focus more tightly on the internal details of their target protocols. 1. Coordinated Checkpointing : SafetyNet [17] is a framework for adding checkpointing to hardware multiprocessors for the purpose of surviving transient faults. It focuses on the sequential consistency memory models and augments existing protocols with checkpointing functionality. SafetyNet relies on the availability of some sort of logical time being maintained by the systems processors such that no message can arrive at an earlier logical time than the time when it was sent. All processors checkpoint themselves at the same logical time, once every 5,000-1,000,000 processor cycles. The processors registers are checkpointed directly to a nearby checkpoint buffer while its cache and 6

memory are saved lazily in a copy-on-write fashion. In order to make sure that the ownership of each memory block is known at the time of a checkpoint and is not in some intermediate state, a given checkpoint is not committed until all processors have determined that all their memory block ownership requests issued before the checkpoint have been completed. SafetyNet shows very low overheads because of its relatively rare (on a processor time scale) checkpoints and the fact that they log only .1% of all cache accesses. However, it is also quite limited in the severity of the faults that it can deal with. In particular, it cannot roll back from with processor failures or cache corruptions due to soft errors. Revive [18] is another example of a hardware coordinated checkpointing protocol, except that it uses blocking coordination instead of the non-blocking protocol used by SafetyNet. Revive targets off-the-shelf processors with caches and modifies the networking components of common hardware shared memory implementations, including network cards, directory controllers and memory controllers. To record a checkpoint all processors synchronize using a 2-phase commit protocol and flush their pipelines and caches, causing the main memories to contain a consistent state of the shared address space. Since main memories are still vulnerable to failure, their state is encoded using a single-bit parity error correcting code, with extra memory modules dedicated to holding the parity bits of the other modules. The memory controllers are modified to ensure that the parity module is always upto date with any changes to the other memory modules. Whenever a memory location is written to for the first time after a checkpoint, its original value is copied out into a special log in a copy-on-write fashion before the write is allowed to proceed. This is to ensure that the state of the memory at the time of the checkpoint is not lost. On restart the state of the shared address space is recreated from the current memory state and the overwritten values stored in the log. The result is that Revive has an time overhead of 6% and memory overhead of 14% while taking 10 checkpoints per second. Coordinated checkpointing is extended to software shared memory by [19], which presents a protocol closely related to sync-and-stop where processors checkpoint themselves inside application barriers. Because at barrier-time all shared memory state is consistent and there is no in-flight communication, it is possible to simply save the state of all pages and all directories in a checkpoint and directly restore them on restart. However, the fact that the applications own barriers are used 7

for synchronization means that no additional coordination cost is incurred.[20] presents a software-based checkpointer for SMP systems. In particular, their approach works at the user-level, above Pthreads. As such, they do not have direct control over the implementation of the API and in particular over the Operating Systems kernel threads. Their checkpointing protocol is initiated by a single master thread that sends a signal to all other threads. Since PThreads does not define what happens when a thread that is blocked on a mutex receives a signal, it is possible for such blocked threads to never be informed that a checkpoint has begun. As such, the master thread, along with any threads that have begun their checkpoint, release all their mutexes so that blocked threads may acquire them and be informed that a checkpoint has begun. In order to ensure that threads that have been released on this fashion do not continue executing application code, potentially corrupting application state, the master thread sets the prevent locks flag in order to inform forcefully released threads that they should not continue executing. When all threads have been informed of the checkpoint, they block and the master thread saves the state of the process to disk. It then informs all the other threads, which reacquire their released mutexes, and resume executing application code, which may involve re-blocking on mutexes that they were forcefully released from during the checkpoint. This solution has the advantage of high portability, since it works above the PThreads implementation rather than inside of it. The disadvantage is that at checkpoint time they save all application state without differentiating whether it belongs to the application or the PThreads implementation. This makes restart difficult since the PThreads implementation may have pre-restart information that may not remain valid on restart.BLCR [21] takes a different approach toward checkpointing PThreads applications by embedding this functionality inside the Linux kernel. While this approach sacrifices portability, kernel-level solutions have much more power over the PThreads implementation than their user-level counterparts. In BLCRs checkpointing protocol the master thread sends a signal to each application thread. When this signal is delivered to a thread, the signal handler makes a system call, informs the master thread and blocks the thread at a barrier. When the master thread has been informed that all threads are waiting for it, it releases them from the barrier and all threads participate in saving the applications state. A checkpoint is terminated via another barrier, after which point all threads return from the system call and resume 8

executing application code. The relative simplicity of working at the kernel level has helped to make BLCR more functional but has limited its portability. 2. Uncoordinated Checkpointing : In uncoordinated checkpointing individual threads are allowed to independently checkpoint their own state without any additional synchronization or communication between them. While this minimizes the regular-execution time overhead of checkpointing, on restart it can cause threads to roll back far into the past via the domino effect, causing both a time overhead and a space overhead for additional checkpoint storage. The extent of the domino effect depends directly on the frequency of communication in a given application. [TODO:Add details about rebound here] 3. Quasi-Synchronous Checkpointing - The approach in [22] focuses on providing fault tolerance for distributed memory implementations of shared memory. The primary feature of such implementations is that each processor has its own memory (not just cache) that is not accessible by other processors. In particular it extends invalidation-based protocols to support checkpointing. This protocol uses the disk to hold the contents of the entire application address space, with processors holding copies of some of these pages in their memories. Multiple processors may have a read-only copy of a page in their memory but if a process wishes to write to this page it must first invalidate all the read-only copies by sending messages to their host processors. If later on some processor wants to read from or write to this page, it must ask for the modified copy to be sent over and if it is trying to read it, change the pages status to read-only.In this protocol, checkpointing is achieved in the following manner : • A processor saves its state to disk • It flushes all the dirty pages modified by it since the last checkpoint to disk. On restart all of the memory pages of a processor are initially invalid. It can then reacquire any pages it needs by loading them from other processors or the disk whenever the application tries to read from or write to them via the same shared memory mechanisms that are used for normal read and write requests. One complication is that allowing processors to checkpoint independently can cause cascading rollbacks. These rollbacks may force the system to roll back to the start 9

of the applications execution. In order to avoid this they use a quasisynchronous method where a process is forced to take a checkpoint when another process asks it for a page that it has modified since its last checkpoint. 4. Message Logging - The protocol presented in [23] is based on Homebased Lazy Release Consistency(HLRC) and works as follows. Processors checkpoint themselves independently. Each processors checkpoint contains • the processors internal state, including any non-shared memory • the data of any pages that it is home to(every page is assigned to a home processor which maintains the most recent copy of the page). Since these checkpoints are uncoordinated, they are vulnerable to cascading rollbacks. This protocol deals with this via sender-based message logging. A message logging protocol must do three things: • record the outcomes of all non-deterministic events • record the data of any communication from processors that did not roll back to the one that did (so that they can be re-sent on restart) • suppress any communication from the rolled back processor to the others until it has finished restarting. Non-determinism appears in HLRC in the the behavior of locks where after a given processor has released a lock, the choice of the processor that will acquire it next is non-deterministic. As such, whenever a processor acquires a lock, both it and the locks releaser record this event in logs that they keep in their volatile memories. The only forms of communication in HLCR is processors sending page modifications during release operations and home nodes sending page copies to nodes that are performing acquire operations. Both of these communications are logged at their senders. Writer processors log the modifications that they send to home nodes and home nodes log old versions of their pages. When some processor rolls back, it reloads its state and the states of all of its pages from its checkpoint. It then receives from other processors the details of when it acquired its locks, the data of any pages that it received from other processors in the course of its execution and the updates sent to it by other processors when they 10

modified the pages that it is home to. It uses this information to locally recompute its pre-rollback state, including the state of the pages that it was home to, without doing any additional communication. This method has overheads caused by the cost of checkpointing and the cost of logging. The protocol tries to minimise the overheads by removing outdated log entries and checkpoints via garbage collection. Applications with considerable synchronisation and load imbalance can still cause significant overheads in this protocol. 5. Some additional protocols - [24] discusses an extension of the CacheAided Rollback Error Recovery (CARER) technique. CARER maintains exactly one checkpoint at all times. Processor state is saved by using a set of backup registers into which the values of regular registers are copied. Memory is checkpointed by flushing all dirty cache lines to main memory. After this has been done main memory corresponds to its state at checkpoint time and any writes performed by the processor are kept in dirty cache lines, which are not flushed to main memory until the next checkpoint. In the shared memory context they propose three checkpointing protocols for coherence protocols that work in the following scenario • processors have their own caches • processors are connected by a bus • they are connected to a single common main memory. The first protocol has all processors checkpoint at the same time and when one processor rolls back,so does every other (i.e. this is a coordinated protocol). The second protocol tracks the read/write interactions between processors and when a processor takes a checkpoint, it only forces other processors to checkpoint if it has participated in interactions since its last checkpoint and only processors that have interacted with others will need to checkpoint (i.e. its a quasi-synchronous protocol). The same logic is used to decide which processors will roll back. The third protocol forces processors to checkpoint whenever they participate in interactions with other processors (again, a quasisynchronous protocol). The advantage is that when a single processor wants to restart, no other processor is forced to restart as well. Transactional memory is an variant of shared memory, where certain pieces of code are identified as transactions. Code in these transactions can access a shared address space and the underlying runtime system 11

is responsible for guaranteeing the illusion that these transactions are executed in some serial order. Although this is trivial to guarantee on sequential systems, on parallel systems this is typically done by optimistically running multiple transactions at the same time and dynamically detecting if one transaction writes to a variable that is accessed by a concurrently executing transaction. If a conflict is detected where two transactions cannot be sequentially ordered relative to each other, one is rolled back and retried. This method of rollback-restart can be used for fault tolerance in transactional memory model[25,105].

5
5.1

Application Level Checkpointing
Need for Application Level Checkpointing

Despite the great variety of rollback restart protocols for shared memory systems,they are almost uniformly consistent on one point: no protocol applies to more than a small fraction of the shared memory solution space. As shown in the figure below[TODO:add figure] shared memory systems are created via a layering of APIs, memory models, shared memory implementations, consistency protocols and network interfaces. Each protocol focuses on a fraction of this space: a particular memory model, a type of network interface or a type of consistency protocol. As such, no protocol can hope to provide rollback restart functionality for even the majority of applications. One option would be to take a high-performance implementation of shared memory that works on a wide variety of platforms and supports a variety of shared memory APIs and augment it with rollback restart functionality. But the fact is that there is no one shared memory implementation that both works on and offers high performance for the majority of shared memory systems and the majority of shared memory applications.Another option would be to create an application-level rollback restart protocol that would work with any shared memory implementation, consistency protocol and memory model. Such a protocol could then be applied to any shared memory API to create a full rollback restart solution for shared memory applications. As an application-level approach, it would require one round of effort to apply this protocol for each shared memory API of interest. This is in contrast to lower-level solutions that would require a round of implementation effort for each shared memory implementation, consistency protocol and/or memory model, which are far more numerous than shared memory APIs. We studied the application level checkpointing protocol proposed by Pingali et al[26] in detail and its implementation in OpenMP. The next section dis12

cusses this protocol. We also used BLCR to measure the overheads caused by user-level checkpointing in HPC applications. The benchmarks and the results obtained are shown in section X. Based on these results we propose a suitable checkpointing strategy for these HPC benchmarks in section Y.

5.2

Application level Checkpointing for OpenMP

In this method, the checkpointer sits as a thin layer between the application and the underlying shared memory system. It intercepts all the function calls that the application may issue, for example, thread creation and resource acquisition. The job of the checkpointing layer is to coordinate the state saving activities of all threads such that the checkpoint saved on stable storage can be used to restart the application when needed. It is assumed that there exists some way of recording the state of the application. In particular, the authors assume there is a save state function which when called by all the threads is guaranteed to save the private and shared state of all threads to stable storage. the basic protocol used is the following : Global_Barrier Save_State Global_Barrier The two barriers ensure that the application is not running while the checkpoint is being recorded. Furthermore, the first barrier ensures that any writes to shared data that were initiated before the checkpoint, will have completed by the time the first barrier returns. This makes this protocol applicable to both sequential consistency as well as any relaxed memory model model, since in virtually any memory model barriers force memory to be consistent at that point in time. Integrating this into an application presents certain difficulties since the program uses synchronisation to coordinate its own activities. It can lead to deadlocks which did not exist earlier. We will look at these issues in detail in the rest of this section. This discussion will bring to light some of the methods for application level checkpointing in shared memory programs. Suppose the shared memory API allows the application to acquire and release mutually exclusive resources. Problem occurs when one thread tries to checkpoint while holding a resource that another thread is waiting to acquire. this can lead to a lead to a deadlock. Now that a checkpoint has led to a deadlock, it can be removed in the following manner. You can force the thread waiting for the resource to participate in a checkpoint and resume 13

acquiring the resource after the checkpoint was completed. The approach can be summarised as follows: • Before checkpoint begins ensure that all threads are awake and not blocked on resources by releasing all the resources • Before checkpoint completes each thread reacquires its released resources • Immediately after checkpoint completes put each thread that was forcefully woken back in to a blocked state by having it try to reaquire the resource it was blocked on. This strategy can be extended to all the synchronisation constructs of the shared memory API. 1. Locks and critical sections • Before checkpoint begins the thread holding a resouce must release it • Before checkpoint completes each owner thread must reacquire its resource • After checkpoint completes each thread that was forced to wake up should try to reacquire the resource it was blocked on 2. Barriers • Before checkpoint begins all threads that are not waiting on a barrier must execute a barrier in order to release the threads blocked at a barrier. • After checkpoint completes each forcefully awoken thread goes back to being blocked at the barrier 3. Semaphores • Before checkpoint begins the semaphore counter must be decremented n times where n > 0 and n ≤semaphore counter at the time of the checkpoint. • Before checkpoint completes the semaphore’s counter is n less that it was the checkpoint began, so it must be waited on n times to restore the state of the counter to uts precheckpoint state.

14

• After the checkpoint completes each forcefully awoken thread tries to wait on the semaohore again 4. Condition variables • Before checkpoint begins each condition variable must be repeatedly notified doe as long as it is not known that all threads have begun their checkpoints • After checkpoint completes each forcefully awoken thread tries to wait on the condition variable again While applying the above protocol to OpenMP, we need to be aware of its additional constructs like parallel loops where multiple threads may execute the iterations of a given for loop and ordered regions which work like critical sections except that they must be executed in-order. Pingali et al, adapted the protocol to support all features of OpenMP except nested parallelism. The protocol was implemented as part of the Cornell Checkpointing Compiler[27] system. We found this protocol attractive due to its portability and flexibilty since the user can decide the location of potential checkpoints.

5.3

Architecture

The architecture of the OpenMP checkpointer is given below. [TODO:include diagram here] The OpenMP API is defined as a set of compiler #pragma directives and several functions that describe how a given program should be parallelized. As such, the only way to insert the protocol layer between the application and the OpenMP implementation is to perform compiler transformations on the source code that convert the original OpenMP directives into modified versions. The C3 pre-compiler reads the source files of the OpenMP application written in C and instruments them to execute the checkpoint coordination protocol and to perform application-level saving of shared and private state. The C 3 runtime is then placed as a layer between this modified executable and the native OpenMP library, performing additional checkpoint coordination and state saving functionality. This structure makes checkpointing a property of the application rather than of the underlying system, making it available on any platform on which the application is executed. The only modification programmers make to source files is to insert calls to function potential checkpoint at program points where checkpoints may be taken. These points should ideally have minimal live state and need to be executed at least as often as the desired checkpoint interval.

15

In practice placing potential checkpoint locations at the top of the applications main computation loop usually satisfies these conditions. Checkpoints need not be taken at every ccc potential checkpoint call; instead,the choice can be based on a timer or an adaptive mechanism. The output of the precompiler is compiled with the native compiler on the hardware platform, and linked with a library that implements the above protocol for generating consistent snapshots of the state of the computation. This layer sits between the application and the OpenMP runtime layer, and intercepts all calls from the instrumented application program to the OpenMP library.

5.4

What needs to be saved?

OpenMP processes have four types of state: • Application state - Includes application variables(global and local), heap objects and the stack activation frames that exist at checkpoint time • Hidden state - Any state inside the OpenMP runtime system that must be recreated on restart. This includes the locations of privatised and reduction variables, the binding between stack regions and thread numbers and the scheduling of worksharing constructs • Synchronisation state - The state of any synchronisation objects the threads held or was blocked on at checkpoint time. These include barriers, locks, critical sections and ordered regions in worksharing constructs. Depending on the degrees of access a checkpointing protocol has to various portions of the application state, it may have to use all or some of the following methods to restore the state during restart. • Restore - State that can be directly manipulated by the application and can therefore be directly saved at checkpoint-time and restored on restart. Example: Global and local variables. • Replay - State that cannot be directly manipulated by the application but can be recreated using a sequence of deterministic operations. On restart it can be regenerated by replaying these operations. Example: OpenMP locks since their state is inaccessible to the application (the implementation details of the omp lock t structures are unknown) but can be recreated using lock routines such as omp init lock and omp set lock. 16

• Reimplement - State that cannot be recreated by replaying operations. In this case the operations that create and maintain this state need to be reimplemented so that this state can be saved and restored. Example: heap objects in C since on restart it is not possible to force malloc to place these objects at their original locations. • Restrict - State that cannot be recreated by reimplementing the operations.This type of state cannot be supported and the application must be restricted from using this kind of state or to only using it in a restricted manner. Example: the use of application-implemented synchronization inside the application can cause our checkpointing protocol to deadlock and thus, its use must be restricted. The authors of C3 have used a combination of these techniques to instrument and tranform the original OpenMP application to acheive rollback and restart. The C3 system uses Restore to recreate the values of variables and heap objects, Replay to recreate activation frames and Reimplement to recreate the heap. On restart, it is necessary that a thread be reassigned the same stack region that it had in the original execution. Since OpenMP provides no guarantees about the locations of stacks, they have relaxed this condition and allowed stacks on restart to move relative to the original stacks by a small amount. The stack contents are recreated via replay by calling the function calls and re-entering the OpenMP directives that were on the stack at checkpoint time. Privatised and reduction variables are allocation by reimplementing their address assignment. Worksharing constructs are handled by a mix of replay and reimplement. On restart, each thread must resume executing the same workunit it had when checkpoint was taken, and it must also make sure that the remaining unexecuted workunits are scheduled to run on some threads. This is done by rewriting the OpenMP worksharing construct. The OpenMP synchronisation constructs also needed rewriting in order to support explicit acquiring and release of resources on checkpoint.

6

Measurements and Analysis

Although we wanted to measure the cost of checkpointing on the C3 system, it was not available to us. So we used BLCR instead to measure the overheads of checkpointing HPC benchmarks. BLCR is a system-level checkpoint restart system developed mainly for HPC applications. The checkpointing overheads measured by BLCR will not represent those produced 17

by C3. But we wanted to study the nature of the benchmarks we had to propose a better solution. The overheads produced by C3 will atleast be as much as those generated by BLCR. To simulate user initiated checkpointing supported by C3 in BLCR, we used the callback utility of BLCR to initiate checkpoints from within the OpenMP application.Before we proceed, we to the results we give a short introduction to BLCR. [TODO:Add note on BLCR] The experiments were conducted on [TODO:add details and architecture of the cluster used] The overheads we measured were the following: [TODO:Add list of overheads measured] [TODO:Add graphs]

7
7.1

Benchmarks
NAS Parallel Benchmarks

The NAS Parallel Benchmarks are derived from the CFD codes. They are designed to compare the performance of parallel computers and are widely recognized as a standard indicator of computer performance. NPB consists of five kernels and three simulated CFD applications derived from important classes of aerophysics applications. These five kernels mimic the computational core of five numerical methods used by CFD applications. We analyzed the OpenMP implementation code of the following applications: • BT is a simulated CFD application that uses an implicit algorithm to solve 3-dimensional (3-D) compressible Navier-Stokes equations. The finite differences solution to the problem is based on an Alternating Direction Implicit (ADI) approximate factorization that decouples the x, y and z dimensions. The resulting systems are Block-Tridiagonal of 55 blocks and are solved sequentially along each dimension. • SP is a simulated CFD application that has a similar structure to BT. The finite differences solution to the problem is based on a BeamWarming approximate factorization that decouples the x, y and z dimensions. The resulting system has Scalar Pentadiagonal bands of linear equations that are solved sequentially along each dimension. • LU is a simulated CFD application that uses symmetric successive over-relaxation (SSOR) method to solve a seven-block-diagonal system resulting from finite-difference discretization of the Navier-Stokes equations in 3-D by splitting it into block Lower and Upper triangular systems. 18

• FT contains the computational kernel of a 3-D fast Fourier Transform (FFT)-based spectral method. FT performs three one-dimensional (1-D) FFTs, one for each dimension. • MG uses a V-cycle MultiGrid method to compute the solution of the 3-D scalar Poisson equation. The algorithm works continuously on a set of grids that are made between coarse and fine. It tests both short and long distance data movement. • CG uses a Conjugate Gradient method to compute an approximation to the smallest eigenvalue of a large, sparse, unstructured matrix. This kernel tests unstructured grid computations and communications by using a matrix with randomly generated locations of entries. • EP is an Embarrassingly Parallel benchmark. It generates pairs of Gaussian random deviates according to a specific scheme. The goal is to establish the reference point for peak performance of a given platform. • IS is a large integer sort operation testing both integer computation speed and interprocessor communication. This kernel stresses the integer performance of the underlying node. More information about this benchmark can be found in [1].

7.2

SPEC OMP Benchmarks

The SPEC OMP benchmark suite was developed by the SPEC High Performance Group for performance testing of shared memory systems. It used the OpenMP versions of SPEC CPU2000 benchmarks and candidates. The benchmark suite has OMPM2001 for medium and OMPL2001 for large compute intensive applications. OMPM2001 has 9 applications which are written in Fortran and 2 that are written in C. The applications are listed below: • 310.wupwise m - this application is used for Quantum Chromodynamics • 312.swim m - Shallow water modelling • 314.mgrid m - Multi-grid solver in 3D potential field • 316.applu m - Parabolic/elliptic partial differential equations • 318.galgel m - Analysis of oscillatory instability(Fluid Dynamics) 19

• 320.equake m - Finite element simulation for earthquake modelling • 324.apsi m - Distribution of various pollutants • 326.gafort m - Genetic Algorithm • 328.fma3d m - Finite element crash simulation • 330.art m - Neural network simulation • 332.ammp m - Computational chemistry

8

Pattern analysis

By investigating the code manually, we extracted possible memory access patterns from all 8 applications and kernels. We measured the number of synchronisation and worksharing constructs in the applications. 1. Loop parallelization The normal loop parallelizations, for and do, where each thread operated on a sub-set of entries from an array. Here the program is non-deterministic within the loop since the threads may execute in any order. However, the parallel do construct in OpenMP has an implicit barrier in at its exit point. This barrier will ensure that the results written in shared data are consistent for different iterations of the loop. We intend to explore the behaviour of the parallel do construct. #pragma omp parallel for for(i=0; i

Similar Documents

Free Essay

A Survey of Ocr Applications

...International Journal of Machine Learning and Computing, Vol. 2, No. 3, June 2012 A Survey of OCR Applications Amarjot Singh, Ketan Bacchuwar, and Akshay Bhasin Abstract—Optical Character Recognition or OCR is the electronic translation of handwritten, typewritten or printed text into machine translated images. It is widely used to recognize and search text from electronic documents or to publish the text on a website. The paper presents a survey of applications of OCR in different fields and further presents the experimentation for three important applications such as Captcha, Institutional Repository and Optical Music Character Recognition. We make use of an enhanced image segmentation algorithm based on histogram equalization using genetic algorithms for optical character recognition. The paper will act as a good literature survey for researchers starting to work in the field of optical character recognition. Index Terms— Genetic algorithm, bimodal images, Captcha, institutional repositories and digital libraries, optical music recognition, optical character recognition. I. INTRODUCTION Highlight in 1950’s [1], applied throughout the spectrum of industries resulting into revolutionizing the document management process. Optical Character Recognition or OCR has enabled scanned documents to become more than just image files, turning into fully searchable documents with text content recognized by computers. Optical Character Recognition extracts the relevant......

Words: 3379 - Pages: 14

Premium Essay

Game Theory Application on Pricing Strategies Used by the Retail Supermarket Industry

...Game Theory Application on Pricing Strategies Used by the Retail Supermarket Industry Everyday low pricing (EDLP) and High-Low Pricing are the most popular pricing strategies used by companies in the retail supermarket industry. In this part of the project, the two pricing strategies are analyzed using a game-theoretic framework and compared to the observed behavior of supermarket within the industry. First, the definitions of both strategies are provided to better understand the analysis. Everyday low pricing (EDLP) is a pricing strategy that provides consumers with low prices without the need of using of coupons or waiting for sales price events. This strategy was started by Wal-Mart and Procter & Gamble and it is believed that EDLP saves retailers the time and expense of applying short-term promotional pricing, as well as, the cost of distributing and processing of coupons. Companies who often adopted this strategy also believed that it creates shopper loyalty. (Barron's Educational Series, Inc., 2000) The basic premises that suppliers are basing on when using EDLP are the following: * Steady, competitive prices will lead to even demand for products. * Inventory and other logistical costs will drop due to effective management of product flows. * There will be reduction of promotional costs and other forms of trade. * The costs of saving due to consistent demand and better management of inventory will result to lower final price of the products.......

Words: 1272 - Pages: 6

Free Essay

Distributed Memory Management: Design Issues and Future Trends

...Distributed Memory Management: Design Issues and Future Trends Ehab S. Al-Shaer College of Computer Science Northeastern University Boston, MA 02115 March 19, 1993 ABSTRACT In recent times the shared memory paradigm has received considerable attention in the realm of distributed systems. Distributed Shared Memory (DSM) is the abstraction for supporting the notion of shared memory in a physically non-shared (distributed) architecture. Issues to be addressed in the design and implementation of DSM include maintaining the consistency of the shared data across the network without incurring high overhead and integrating the DSM mechanisms with the local memory management. The consistency model provided by a given DSM implementation attempts to balance performance and ease of programming: while DSM provides the abstraction of shared memory, it is not true shared memory - both from the point of view of the semantics and the cost of shared memory access. The focus of this paper is to identify the issues involved in the design of DSM systems, briefly highlight the mechanisms in use by some current DSM implementations and propose some new DSM models for future distributed systems. I INTRODUCTION As computers become cheaper, there is increasing interest in using multiple CPUs to speed up individual applications. There are basically two design approaches to achieve this goal of high performance at low cost: multiprocessors and multicomputers. Multiprocessors......

Words: 7654 - Pages: 31

Free Essay

Shared Memory Architecture

...INTRODUCTION: Shared memory is a memory that can be used by multiple programs to avoid redundant copies or to provide communication among them. In other words we can say that, it is a technique, in this exchanging of data in process of program is done more quickly than by reading and writing using the services of OS. It is the fastest form of IPC available. A number of basic issues in the design of shared memory system have to be taken into consideration. Once the memory is mapped into the address space of the processes that are sharing the memory region, no kernel involvement occurs in passing the data between the processes. It refers to a multiprocessing design where several processors access globally shared memory. These include access control, synchronization, protection and security.  Access control determines which process accesses are possible to which recourses.  Synchronization constraints limit the time of access from sharing processes to shared resources.  Protection is a system feature that prevents processes from making arbitrary access to resources belong to other processes.  At the memory module the requests arrive through its two ports.  The simplest shared memory system consists of one memory module that can be accessed from two processors.  Classification of shared memory: Depending on the interconnection of network, a shared memory leads to system can be classified as :- • UMA: -  Shared memory......

Words: 431 - Pages: 2

Premium Essay

Memory

...The Best Study Skills - Five Strategies You Need to Know and Use Advice on the best study skills can be found in many books and across the web. However, these sources often gloss over certain memory-related habits that can really help you remember the material much better. On this page I describe five proven study strategies that every student should know and use to maximize their retention of the material. Much of my understanding of the best study skills comes from the book Your Memory: How It Works & How to Improve It by Dr. Kenneth Higbee. According to Dr. Higbee, the study skills you should be using are: 1. Reduce Interference 2. Space It Out 3. Use Whole and Part Learning 4. Recite It 5. Use a Study System Dr. Higbee refers to these as "strategies for effective learning". Below I've explained each of these with examples. Make these five best study skills a part of every study session, and you should be able to recall the material much more easily. Best Study Skills #1: Reduce Interference One cause of forgetting is something called "interference". Interference occurs when information you have learned previously interferes with (gets in the way of) new material that is similar. Interference may cause confusion when the time comes to recall the material. The brain can mix up new information with what was learned before. For example, suppose you met several people at a business conference last week. Then, you met several more people at a party last night...

Words: 2950 - Pages: 12

Free Essay

Memory

...There are five requirements to memory management. What exactly is memory management? In simple terms it’s the computers process of controlling and coordinating computer memory by assigning blocks to various programs to optimize system performance. This usually gets divided into three areas application, operating systems (OS), and hardware. Each involves different methods of using memory in which case they would need requirements so that the memory is used accordingly. With that said, the following are memory management requirements in further detail. Taking into consideration how multiprogramming works and its general purpose we can determine that any specific program can not know in advance what will be running at that moment in the memory as it is being executed. Relocation provides the ability to execute a process independently. Especially since the main memory is being shared by multiple programs and applications at the same time. Relocation is basically the process of swapping back unused blocks of memory and returned to the main memory at a different location. Shared memory is accessible by multiple programs. They provide communications between each application, assuring there are no redundant copies produced. This methods helps the program exchange data quickly. Then we have the protection requirement. This is a strategy that manages access to memory found on your computer. Each specific process needs to be protected against unwanted interference, it should not......

Words: 382 - Pages: 2

Premium Essay

A Survey of Manufacturing Strategies in China-Based Enterprises

...181}199 A survey of manufacturing strategies in China-based enterprises David J. Robb *, Bin Xie Department of Management Science and Information Systems, The University of Auckland, Private Bag 92019, Auckland, New Zealand Department of Management Science and Engineering, School of Economics and Management, Tsinghua University, Beijing 100084, People's Republic of China Received 13 December 1999; accepted 7 September 2000 Abstract We present the results of a 1997 study exploring the manufacturing strategy of 46 plants (foreign-invested enterprises (FIEs) and wholly Chinese-owned enterprises (WCOEs)) located primarily in the Beijing}Tianjin area. Semi-structured interviews using a questionnaire in English and Chinese were employed to: (i) identify current activity (practices and emphases) and future trends in manufacturing, (ii) compare and contrast FIEs and WCOEs, and (iii) identify how practices correspond with performance on various competitive objectives and overall performance. Fundamental di!erences were found between FIEs and WCOEs both in terms of practice and priorities (the former emphasizing 2001 Elsevier Science B.V. All rights reserved. competition on time/delivery, and the latter various aspects of quality). Keywords: Operations strategy; Production management; Delivery performance; Empirical research; Cross-cultural adaptation 1. Introduction Following calls for more `groundinga to undergird and advance the "eld of Operations Strategy (e.g., [1]),......

Words: 9541 - Pages: 39

Free Essay

A Look Into Computer Memory

...A Look into Computer Memory   Table of Contents Abstract 3 A Look into Computer Memory 4 Memory Hierarchy 4 Allocation Policies 5 Relocation Policies 6 Hit and Hit Ratio 6 Modern Computer Applications 7 Conclusion 7 References 8   Abstract The memory of a computer is a key component of the overall architecture of a computer. Several types of memory exist with the architecture of the computer which collectively is known as the memory hierarchy. The use of the memory hierarchy, placing and moving information, is effected by the allocation and relocation policies. How well these policies allow the processor to find the information it is looking for, known as a hit, is determined by the hit ratio. The modern processor available today relies on memory hierarchy to maintain their high performance. The paper will take a look at how these various pieces and policies work together within the architecture of a computer. A Look into Computer Memory Memory plays a key role in the modern processor. The memory hierarchy is the foundation for which the allocation and relocation policies function upon. These policies work to provide the needed information in the proper memory location to attempt to maintain a high hit ratio to avoid processor delay. Regardless of the speed of a modern processor, a low hit ratio adds delay to the high performance processor. Memory Hierarchy Memory in a computer varies in size, speed with regards to access time and, just as importantly,...

Words: 1554 - Pages: 7

Premium Essay

A Survey on Applications of Cloud Computing in Healthcare

...A Survey on Applications of Cloud Computing in Healthcare |Ayesha Asif |………. |Noorulain Ashraf | |Department of Software Engineering |Department of Software Engineering |Department of Software Engineering | |Fatima Jinnah Women University |Fatima Jinnah Women University |Fatima Jinnah Women University | |Rawalpindi, Pakistan |Rawalpindi, Pakistan |Rawalpindi, Pakistan | |Ayesh_15_94@yahoo.com |abc@xyz.com |noorulainkhaan@gmail.com | Abstract – Healthcare Delivery Organizations (HDOs) are facing a number of problems due to their mode of operations. Amongst the problems are improper modes of data storage, insecurity of medical records, difficulty in accessing healthcare services, high cost, and inaccurate diagnosis. Recently research has identified Cloud Computing (CC) as a new and substantial business model capable of providing efficient services that can benefit the healthcare industry. The aim of this paper is to study the existing cloud computing applications in healthcare industry and propose optimal techniques and solution on the basis of analysis. Keywords – Cloud Computing (CC); Healthcare Information Technology......

Words: 4070 - Pages: 17

Free Essay

Survey

...Survey of Open Source Integrated Library System in Thai University Libraries in Bangkok and Pathumthani by Pranee Kiriyanant The Center for Southeast Asian Studies (CSEAS) Kyoto University 2012 Survey of Open Source Integrated Library System in Thai University Libraries in Bangkok and Pathumthani by Pranee Kiriyanant Survey of Open Source Integrated Library System in Thai University Libraries in Bangkok and Pathumthani / Pranee Kiriyanant. Kyoto : The Library of Center for Southeast Asian Studies, Kyoto University, 2012 CSEAS Bibliographical Series No. 10 ISBN 978-4-906332-07-6 Published by The Center for Southeast Asian Studies Library Kyoto University 46 Shimoadachi-cho, Yoshida Sakyo-ku, Kyoto 606-8501, Japan Copyright © 2012 Pranee Kiriyanant Contents Acknowledgments Abstract Abstract (in Thai language) Chapter 1 Introduction Chapter 2 Literature Review Open source software Integrated library system History of open source software Origin of open source software Adoption of open source software in libraries Open source software use in Thailand Descriptions of open source software Brief descriptions of integrated library system products Advantages and Disadvantages of open source software Chapter 3 Research method Chapter 4 Results Chapter 5 Conclusion References Appendix Appendix A: Request letter for participation (in Thai language) Appendix B: Request letter for participation (translation copy in......

Words: 19644 - Pages: 79

Premium Essay

Mobile Application Development Strategy

...STRATEGY PAPER: Mobile Application Development Mobile Application Development Strategy Document No: Strategic Initiative: Coverage: Classification:  RTC 01 Version : 1.0 Increasing customer loyalty base and enhancing customer reach through Mobile application Retail SBU Strategic Date Last 17.11.2014 Changed: Strategic High Value: Time Frame: Priority: Long Term High 1.0.Corporate/Business Vision: We are the most admired global energy company leveraging talent and technology. We are the first choice of customers, always We exploit profitable growth opportunities outside energy  2.0.Description of Strategic Initiative: Consumers are spending more time on their apps than ever before. Over the past half-decade, the proliferation of mobile devices has transformed us into an app-driven society, presenting marketers with new opportunities to connect with consumers by creating more interesting and sophisticated apps to command their attention. Mobility offers tremendous opportunities for every business unit in the enterprise. From optimizing processes to real-time collaboration, improved customer experiences to increased revenues, the benefits are big—but so are the challenges. Cisco recently published statistics that demonstrate how user demand is driving every enterprise down the mobile route whether they like it or not. Version 1.0 Page 1 STRATEGY PAPER: Mobile Application Development This explosive growth......

Words: 4889 - Pages: 20

Premium Essay

Marketing Communication Strategies and Applications in the Nutraceutical Industry in Uae

...American University of Sharjah Master of Business Management MBA 509 | Marketing Concepts Spring 2015 Marketing communication strategies and applications in the nutraceutical industry in UAE By Nadia B. Azzam ID 27379 Submitted to Dr. Mohammed Sajid Khan Submission Date May 31, 2015 1 Table of Contents 1.0. Introduction .......................................................................................................................... 3 2.0. Literature review .................................................................................................................. 3 3.0. Application of Promotional strategies in Nutraceutical Dr. Nutrition Centre, UAE ........... 7 4.0. Conclusion ......................................................................................................................... 10 5.0. References .......................................................................................................................... 12 2 1.0. Introduction In today’s society, consumers are bombarded with a massive number of promotional messages delivered to them through different mediums and from different organizations. Whether at home watching TV, listening to Radio, walking in the mall or driving through the city, consumers get exposed to those messages which they process through different channels, and eventually create behavioral responses. However, due to the competitiveness in the......

Words: 2819 - Pages: 12

Premium Essay

Memory

...Memory is a constructive process through which humans organize and shape information received constantly (Huffman, 2010). This process is divided on two major approaches called information-processing model and the traditional three-stage memory model. In this paper, I will talk about three basic operations that are part of the information-processing model. These basic operations are known as encoding, storage and retrieval. The encoding operation is responsible for processing the information received into the memory system (Huffman, 2010). For instance, when a new word is learned, the encoding operation carries it over to the memory system for the next step. The storage operation is then responsible for preserving information over time. The word learned is then kept in our memory system to be used later on thanks to the storage operation. And the retrieval operation recovers all information needed from the memory storage. For example, the word learned now has to be used; the retrieval operation goes and finds it in the memory system for us to apply. One concept that I commonly use to improve my memory is making notes of tasks or information that I need to learn or remember. Making this notes not only makes my information easily remembered, it is also being stored in my brain while I am writing it. There are many useful techniques out there to better help improve our memory. Cuesta College provides students a list of memory tips on their website that will help......

Words: 368 - Pages: 2

Free Essay

Checkpoints

...Checkpoint Filesystem for Parallel Applications John Bent∗† Garth Gibson‡ Gary Grider∗ Ben McClelland∗ , , , , Paul Nowoczynski§ James Nunez∗ Milo Polte† Meghan Wingate∗ , , , ABSTRACT Categories and Subject Descriptors D.4.3 [Operating Systems]: File Systems ManagementFile organization General Terms Performance, Design Keywords High performance computing, parallel computing, checkpointing, parallel file systems and IO ∗ LANL Technical Information Release: 09-02117 Los Alamos National Laboratory ‡ Carnegie Mellon University § Pittsburgh Supercomputing Center † (c) 2009 Association for Computing Machinery. ACM acknowledges that this contribution was authored or co-authored by a contractor or affiliate of the U.S. Government. As such, the Government retains a nonexclusive, royalty-free right to publish or reproduce this article, or to allow others to do so, for Government purposes only. SC09 November 14–20, Portland, Oregon, USA. Copyright 2009 ACM 978-1-60558-744-8/09/11 ...$10.00. 100 Speedup (X) Parallel applications running across thousands of processors must protect themselves from inevitable system failures. Many applications insulate themselves from failures by checkpointing. For many applications, checkpointing into a shared single file is most convenient. With such an approach, the size of writes are often small and not aligned with file system boundaries. Unfortunately for these applications, this preferred data......

Words: 12373 - Pages: 50

Premium Essay

Creating Shared Value

...Corporations can create shared value by using their core capabilities in ways that contribute to both social progress and economic success. Some developing countries have experienced phenomenal economic growth, but that growth has not been inclusive. In recent decades, developing countries have experienced a rapid rate of economic growth. Although this has led to higher incomes and better health for many Developing countries, we still have far to go to make this growth truly inclusive. Developing countries are expected to fall short on several Millennium Development Goals: by 2015, it is expected that 40 percent of Developing countrie’s children will remain undernourished, and Developing countries will have progressed only halfway toward its goals for decreasing infant mortality. Inequality, poor public health, and environmental degradation will increasingly constrain their economic growth. Corporations play a critical role in achieving inclusive growth. Government is often seen as the answer to society’s problems, but spending by the Government of Developing countries alone will likely be insufficient to address these critical issues. governments can encourage contributions from the private sector by passing legislation and using its purchasing power to create a supportive, enabling environment. Philanthropy can also help catalyze change, but charity has a limited ability to sustainably achieve scale. The private sector, however, possesses skills and technologies......

Words: 7401 - Pages: 30