No Sql Databases

INSY 5337 Data Warehousing – Term Paper
NoSQL Databases: An Introduction and Comparison between Dynamo,
MongoDB and Cassandra
Authored ByNitin Shewale Aditya Kashyap Akshay Vadnere Vivek Adithya Aditya Trilok

Abstract
Data volumes have been growing exponentially in recent years, this increase in data across all the business domains have played a significant part in the analysis and structuring of data. NoSQL databases are becoming popular as more organizations consider it as a feasible option because of its schema-less structure along with its capability of handling BIG Data. In this paper, we talk about various types of NoSQL databases based on implementation perspective like key store, columnar and document oriented. This research paper covers the consolidated applied interpretation of NoSQL system, depending on the various database features like security, concurrency control, partitioning, replication, Read/Write implementation. We also would draw out comparisons among the popular products and recommend a particular NoSQL solution on the above mentioned factors.

1. Introduction
Until recently, Relational database systems have been on the forefront of data storage and management operations. The advent of mobile applications that requires real time analysis like
GPS based services, banking and social media has led to huge unstructured data being produced every second. Traditional RDBMS systems have found it difficult to cater to these huge chunks of unstructured data, as RDBMS mainly stores structured data in tabular format. Also, the unstructured data being mapped to a relational database results in increase in complexity as it uses expensive infrastructure to model the same. Also, even if the data model fits into SQL, platter of features provided by SQL becomes an overhead. Relational schema becomes a burden on applications which are trying to store data in multiple forms like videos, blogs and images etc.
A new methodology for data management was introduced for the management of unstructured data known as NoSQL (Not Only Structured Query Language).
NoSQL covers a broader topic of data structuring, storage and aggregation via various implementation approaches. It can store unstructured data and provide real time analysis to back up the web service applications. It gives up on conventional benchmarking of database management principles like Atomicity, Consistency, Isolation and Durability, to attain flexible data handling. Also, it provides inbuilt data partitioning and replication. Essentially, data across the business domains is governed by company policies and processes for data control and quality.
NoSQL moves away from these restrictions to promote performance and scalability requirements of particular application and services [1][2][3][4][10].

2. NoSQL Characteristics
Analogy of ACID properties in NoSQL is BASE, which is derived from CAP Theorem. CAP Theorem assures following database management standards –


Consistency – The given data should be available at all parts of the system at the same time. 

Availability – The data should be available any time and should provide a response every time.



Partition Tolerance – Total failure of the system should not be driven by failure of one section or partition of the system [7][8][9].

NoSQL database systems, like MongoDB and Cassandra, have strayed away from consistency to attain greater availability and efficient partitioning. This gave rise to systems driven on BASE principles. 

Basically Available – Data is distributed across various systems, hence the data is always available in one of the system, even if one of the systems fail.



Soft State – Since the data is distributed, there is no assurance of consistency.



Eventually Consistent – The data would be consistent eventually, even if it’s not at a given point in time [5][6][10].

3. Features of NoSQL


Flexible Data Models – NoSQL allows horizontal data partitioning across different distributed systems or processors. However, relational model has a fixed schema in

contrast to NoSQL. Applications based on NoSQL have data models explicitly designed and augmented for them.


Partial Record Updates – Data models that use NoSQL emphasize on column based processing that enable data aggregation on more than one attributes and entities.



Optimized MapReduce Processing – MapReduce, a native functionality for data movement and mapping is a part of NoSQL.



Horizontal Scalability – It allows on-the-fly addition of the processors with their own resources. Each node is fed with a subset of data to process, thus increasing the efficiency of the application. Horizontal scalability is more achievable in NoSQL data model as compared to RDBMS [1].

4. Types of NoSQL Databases



Key Value -> Key value data stores references the data using a unique key. The unique key acts as a link to the data that is randomly and independently stored on the disk.
Addition of new data values can be done without inflicting with existing data .Thus the key value stores are entirely schema less, the only structure that could possibly derived from the key stores is the combination of key value pairs. In this paper we discuss our findings on DynamoDB by Amazon.



Document -> Document data stores references a collection of uniquely identified keyvalue pairs known as Documents. Each document is recognized by its own unique ID in

the document collection. Document stores enables new documents to be stored with different kind of attributes. In this paper we discuss our findings on MongoDB


Column -> These are column centric data stores, where the indexing is done on every column. It provides efficient and high speed read-write operations. Any modification or addition of new data is stored using a timestamped version. We have introduced
Cassandra as an example for Column data store [1][10][12].

5. Comparative study between DynamoDB, Cassandra and MongoDB
5. A. DynamoDB
Dynamo was designed to provide a storage system within Amazon's platform that would be stubborn during unforeseen circumstances.
5. A.1 Key Features of Dynamo:


Key-Value Data Model - Data are represented as objects, and objects are determined based on unique keys. The operations supported on the data are get/put associated with the specified unique key.



Eventual Consistency - The primary objective of Dynamo is to be stubborn against unforeseen circumstances. However, it is a challenge to obtain such a consistency at the initial phase. Hence, consistency increases eventually while all the replicas are updated in a timely manner.



Symmetry and Decentralization - Every node had as much responsibility as the peers in
Dynamo. Every node is equally responsible for its peers in Dynamo. Thus the probability of failure is very low and the amount of manual intervention required would be very less
[24][26][27][28][29].

5. A.2 Operations:
Dynamo performs the following operations:


Get - To return the object associated with the key.



Put - To associate the object with the specified key.

5. A.3 Security:
Dynamo does not implement an efficient security mechanism, making it inefficient in handling scenarios that require authorization [24][26][27][28][29].

5. A.4 Partitioning:
Dynamo is a highly scalable system that can adapt to varying amounts of data by adding and removing nodes in a flexible manner. To implement partitioning, Dynamo uses a technique called consistent hashing, every node is allocated to one or more points on a fixed ring. Each data item is identified by a unique key. The data item is allocated to that specific node by hashing the key of the same. The output thus obtained is a point on the ring. Ring is rotated clockwise to identify the initial node. This derives an effective methodology of partitioning, as any deletion of node would have an impact only on their immediate members in the ring.

5. A.5 Replication:
The data is replicated on multiple hosts, thus resulting in supreme quality, reliability and durability. To implement replication, Dynamo replicates each data object at N nodes, where the value of N is set by the user. Coordinator node is allocated with a key, K, data associated with the
K node is stored locally. Also, the node replicates N-1 different nodes forming a ring
[24][26][27][28][29].
5. A.6 Storage:
Every specific node in Dynamo has its own persistence engine, this engine is used for storage as binary objects. Every instance uses its own unique persistence engine for storage. Few types of persistence engines used by instances are MySQL and Berkeley Database (BDB). The persistence engine makes use of pluggable components. The advantage of these pluggable engines is that users can choose the engine based on their requirements. For instance, BDB handles relatively small objects whereas MySQL can handle objects of large sizes [24][26][27][28][29].
5. A.7 Read/Write Implementation:
Dynamo implements a protocol that has two parameters R/W which represent the minimum number of nodes that must participate in a read/write operation. When a write operation is to be performed by the coordinator, it writes the data locally and then sends the write request to the other N-1 replica nodes. If a response is obtained for at least W-1 nodes, then the operation is said to be a success. Then, the coordinator informs the client.

When the coordinator is requested to perform a read operation, the coordinator sends a read request to the N-1 nodes. When there is a response from at least R-1 nodes, the result is returned to the client. If the objects received are different and if they are received from different nodes, a list of objects is sent by the coordinator to the client rather than a single object
[24][26][27][28][29].
5. A.8 Concurrency Control:
Shared objects are allowed access concurrently among multiple clients. Before all replica nodes are updated, write operations are returned. Hence, different versions of the same object may be returned. Such inconsistencies are handled effectively by Dynamo [24][26][27][28][29].

B. MongoDB
MongoDB is a document key store based product developed in C++. The indexing in case of
MongoDB is done using document key structure. It is a schema-less, performance and query optimization based product [12] [10].
5. B.1 Features of MongoDB


Flexibility during initial phases of development and design.



Horizontal scalability is infused an inbuilt feature in Mongo



User friendly tools to transfer data between different databases



Inter compatibility between implementation in various programming language [30].

5. B.2 Operations:
MongoDB allows these operations:


Insert – Adds new documents to a collection.



Find – Retrieves documents from a collection.



Update– Updates documents of a collection.



Remove – Removes a document from a collection [23][10].

5. B.3 Security
Since Mongo data files are unencrypted, they are prone to attacks. To lessen this, the application must actively encrypt every sensitive information before writing into the DB and also prevent unauthorized access. As Mongo uses java script for internal language, it is also prone to potential script injection attacks. Authentication is not provided in sharded clusters of Mongo DB [13].
5. B.4 Partitioning/ Sharding
Sharding enables segregating of data in numerous machines. MongoDB allows automated partitioning as a built in feature. This feature allows horizontal scaling across many processors
(nodes). Sharding combined with replication leads to availability of a highly mountable cluster.
For resource hungry applications, MongoDB creates cluster of shards, and balances the nodes, without impacting the original node [1] [10] [11] [13].
5. B.5 Storage
Data format used in storing data in Binary JSON (BSON) with a maximum size of 16MB. Data allocation is limited to 2GB per node in a 32 bit system. Data is mapped in-memory to increase

performance. Data is transferred to the disc after every minute by default, which is customizable
.Creation of new files is followed up by immediate flushing of data to disc, thus freeing up the memory [10][11] [14].

5. B.6 Replication
In MongoDB data replication is driven by Master-slave replication with various replica sets. Data is replicated in asynchronous form across servers. Read operations can be performed by multiple slave servers whereas write operation can be handled by only one server at a given point in time.
All the servers at a given point in time have a master server and a new master server is elected in case the previous one falters. Reading from multiple slave servers leads to eventual consistency, to achieve load balancing. The client has the ability to enforce the write operations the master server [10][11] [14] [13].
Replicas can be created in many ways in MongoDB catering to different needs of the application

Secondary Replicas - These replicas are not giving the opportunity to become a Master , they just store data



Hidden Replicas - These replicas are hidden from the application and cannot be elected as Master server. These replicas perform read only operations and are given voting rights to elect a new master server in case of failover.



Delayed - Delayed replicas are not synced with the master and will not have the updated data. 

Arbiters - These replicas are basically arbitrators and do not take part in any functionalities except communicating with other members and taking part in election [11] [14].

5. B.7 Read/Write Implementation
Indexing used in Mongo allows efficient read operations but effects negatively on the write/ insert operations. Mongo allows read operations on slave servers, write operations are controlled by master server. Data reading is performed by slave servers simultaneously in an asynchronous manner [14].
5. B.8 Concurrency Control:
Instant update on all the nodes is done on a MongoDB database system. Mongo DB does not support concurrency control. It exhibits eventual consistency. Data is sent out asynchronously to slave servers, thus it is not controlled.[11][12][13][14]

5. C. Cassandra
Cassandra was designed by Facebook to cater to humongous data needs of the organization.
Cassandra essentially vouches for two BASE features i.e. availability and scalability [21] It brings together the data structure of BigTable and high availability feature of Dynamo [25][11].

5. C.1 Features of Cassandra


Cassandra has multiple nodes in a cluster which are identical in terms of their software infrastructure. All the nodes are symmetric and does not need a master node. This feature allows linear scalability.



Hashing implemented for a new data value does not significantly impact the indexing maintained for other data values.



Interface provided by Cassandra is not easy to use for developers [13].

5. C.2 Operations/Read Write Implementation
1) Write -> Write function when executed by a client, it is captured by one of the nodes in the cluster randomly. This nodes then in turn writes the data to the cluster. The write action is then replicated on all the other nodes of the cluster via a Replication placement strategy. 2) Append -> after the write action being passed on to the individual nodes, change in data is proceeded to commit.
3) Update -> Update function modifies the main memory structure table with the update.
4) Read -> Client makes a read request to the random node in Cassandra, this node then identifies the node in the cluster holding the required data and then transfers the read request to that particular node.
[11][13][14]

5. C.3 Storage
Column based storage is the mainstay of storage system in Cassandra. Cassandra predominantly has one table as its primary operational unit. It also has a multidimensional map which is distributed and linked using keys. [19][20]Column families are defined in the initial phase of launching Cassandra, column families can be infinite. Specifications of at least some of the column families is mandatory. These families are further subdivided into columns and super columns. These can be added on runtime to the column families. Indexing of the columns can be done using the name which is being assigned to the column, they store numerous data values in each row. Similarly super columns are identified by their name and consist of multiple columns which are linked to super columns randomly [11][13]14].
5. C.4 Partitioning
Cassandra runs on nodes in a cluster which are symmetrical, hence the same data is distributed on all the nodes. Partitioning is done using two techniques i.e. Order-preserving partitioning and
Random partitioning. Order-preserving partitioning enables efficient execution of range queries but might cause issues in load-balancing. The nodes and their keys are evenly distributed in the cluster in both these techniques [11][13][14].
5. C.5 Replication
Replication of data is done on all the nodes of a cluster, data set is assigned to a particular node in the cluster. Data items are allocated to a spot in the node depending on the key of the data item, consistent hashing is used to identify the key of the data item. (24) Each data item has a

node coordinator which coordinates the replication of that data item to other nodes. Also client can choose the no of replicas that a particular data item can maintain [8] [11].
5. C.6 Concurrency Control
Cassandra enables Multi version concurrency control [8].
5. C.7 Security
Data files and the interactions between the client-database are unencrypted, as a result of which any user with access to file systems can extract the information he/she desires. Also, Intra cluster communication can be done freely whereas Inter cluster communication comes with a facility of authentication. Security in Cassandra is loosely implemented, IP addresses of the nodes of the cluster is the only info needed to sniff into the system [13][11].

6. Conclusion and Recommendation
We have compared three main products i.e. Dynamo, MongoDB and Cassandra on the basis of major features that drive the selection of a NoSQL product for any organization. MongoDB and
Cassandra are supersets of Dynamo, as they also are essentially implemented on the key-value pair indexing. Dynamo fails to maintain relatively similar attributes together, as can be done in
MongoDB through document linking. Also, horizontal scalability is better achieved in MongoDB and Cassandra than Dynamo.
Eventually, we have figured out that MongoDB and Cassandra are better products in terms of partitioning, replication and concurrency control than Dynamo.



When it comes to update operations, Cassandra is much faster than MongoDB and is independent of the size of the data.



Read operations in Cassandra are relatively fast than MongoDB for medium sized data sets, speed of read operations decline with increase in number of records.



Complex queries consisting of read and update operations simultaneously are better performed in Cassandra than MongoDB.



Symmetric node structure in cluster formation in Cassandra serves better concurrency control than Master slave structure in MongoDB.



Security in NoSQL systems is loosely implemented, comparatively Cassandra provides better authentication and authorization mechanisms than what we have in MongoDB
[8][11][14].

We would recommend Cassandra as an overall better product when compared on the basis of replication, concurrency control, Partitioning and Read/Write Implementation. Cassandra is tried and tested, it is being used by more than 1500 companies [25].

References
[1] NoSQL Systems for Big Data Management- 2014 IEEE 10th World Congress on Services- Venkat N
Gudivada Weisburg Division of Computer Science Marshall University Huntington, WV, USA gudivada@marshall.edu Dhana Rao Biological Sciences Department Marshall University Huntington,
WV, USA raod@marshall.edu Vijay V. Raghavan Center for Advanced Computer Studies University of
Louisiana at Lafayette Lafayette, LA, USA vijay@cacs.louisiana.edu

[2] R. Cattell, “Scalable sql and nosql data stores,” SIGMOD Rec., vol. 39, no. 4, pp. 12–27, May 2011.
[3] V. Benzaken, G. Castagna, K. Nguyen, and J. Siméon, “Static and dynamic semantics of NoSQL languages,” SIGPLAN Not., vol. 48, no. 1, pp. 101–114, Jan. 2013.
[4] F. Cruz, F. Maia, M. Matos, R. Oliveira, J. a. Paulo, J. Pereira, and R. Vilaça, “MeT: Workload aware elasticity for NoSQL, booktitle = Proceedings of the 8th ACM European Conference on Computer
Systems, series = EuroSys ’13, year = 2013, isbn = 978-1-4503-1994-2, location = Prague, Czech Republic, pages = 183–196, numpages = 14, publisher = ACM, address = New York, NY, USA.”
[5] REDUCE, YOU SAY: What NoSQL can do for Data Aggregation and BI in Large Repositories - 2011 22nd
International Workshop on Database and Expert Systems Applications - Laurent Bonnet1,2 , Anne
Laurent1 , Michel Sala1 1LIRMM Universite Montpellier 2 – CNRS ´ 161 rue Ada, 34095 Montpellier –
France name.surname@lirmm.fr Ben´ edicte Laurent ´ 2 2Namae Concept Cap Omega 34000
Montpellier – France b.laurent@namaconcept.com Nicolas Sicard3 3LRIE – EFREI 30-32 av. de la republique ´ 94 800 Villejuif – France nicolas.sicard@efrei.fr
[6] P. A. Bernstein and N. Goodman. Multiversion concurrency control – theory and algorithms. ACM
Trans. Database Syst., 8:465–483, December 1983.
[7] NoSQL Database: New Era of Databases for Big data Analytics - Classification, Characteristics and
Comparison - International Journal of Database Theory and Application Vol. 6, No. 4. 2013 - A B M
Moniruzzaman and Syed Akhter Hossain Department of Computer Science and Engineering Daffodil
International University abm.mzkhan@gmail.com, aktarhossain@daffodilvarsity.edu.bd
[8] NoSQL Databases: MongoDB vs Cassandra - Veronika Abramova Polytechnic Institute of Coimbra
ISEC - Coimbra Institute of Engineering Rua Pedro Nunes, 3030-199 Coimbra, Portugal Tel. ++351 239
790 200 a21190319@alunos.isec.pt Jorge Bernardino Polytechnic Institute of Coimbra ISEC - Coimbra

Institute of Engineering Rua Pedro Nunes, 3030-199 Coimbra, Portugal Tel. ++351 239 790 200 jorge@isec.pt [9] Jing Han; Haihong, E.; Guan Le; Jian Du, "Survey on NoSQL database," Pervasive Computing and
Applications (ICPCA), 2011 6th International Conference on , vol., no., pp.363,366, 26-28 Oct. 2011. doi:10.1109/ICPCA.2011.6106531. [10] NoSQL Evaluation A Use Case Oriented Survey - 2011 International Conference on Cloud and Service
Computing - Robin Hecht Chair ofApplied Computer Science IV University of Bayreuth Bayreuth,
Germany robin.hecht@uni -bayreuth.de, Stefan Jablonski Chair ofApplied Computer Science IV
University ofBayreuth Bayreuth, Germany stefan.jablonski@uni-bayreuth.de
[11] A Comparative Analysis of Different NoSQL Databases on Data Model, Query Model and Replication
Model-> Clarence J. M. Tauro1,∗, Baswanth Rao Patil2 and K. R. Prashanth3 - 1Christ University, Hosur
Road, Bangalore, India. 2Department of Computer Science, Christ University, Hosur Road, Bangalore,
India. 3Department of Computer Science, Christ University, Hosur Road, Bangalore, India. e-mail: clarence.tauro@res.christuniversity.in; baswanth.rao@cs.christuniversity.in; prashanth.r@cs.christuniversity.in [12] 2012 Third International Conference on Emerging Intelligent Data and Web Technologies MongoDB vs Oracle - database comparison - Alexandru Boicea, Florin Radulescu, Laura Ioana Agapin
Faculty of Automatic Control and Computer Science , Politehnical University of Bucharest,Bucharest,
Romania . alexandru.boicea@cs.pub.ro, florin.radulescu@cs.pub.ro, lauraioana.agapin@gmail.com
[13] Security Issues in NoSQL Databases - 2011 International Joint Conference of IEEE TrustCom-11/IEEE
ICESS-11/FCST-11 - Lior Okman Deutsche Telekom Laboratories at Ben-Gurion University, Beer-Sheva,
Israel, Nurit Gal-Oz, Yaron Gonen, Ehud Gudes Deutsche Telekom Laboratories at Ben-Gurion University, and Dept of Computer Science, Ben-Gurion University, Beer-Sheva, Israel, Jenny Abramov Deutsche

Telekom Laboratories at Ben-Gurion University and Dept of Information Systems Eng. Ben-Gurion
University, Beer-Sheva, Israel.
[14] NoSQL Databases: MongoDB vs Cassandra - Veronika Abramova Polytechnic Institute of Coimbra
ISEC - Coimbra Institute of Engineering Rua Pedro Nunes, 3030-199 Coimbra, Portugal Tel. ++351 239
790 200 a21190319@alunos.isec.pt, Jorge Bernardino Polytechnic Institute of Coimbra ISEC - Coimbra
Institute of Engineering Rua Pedro Nunes, 3030-199 Coimbra, Portugal Tel. ++351 239 790 200 jorge@isec.pt [15] E. Brewer. (2000, Jun.) Towards robust distributed systems. [Online]. Available: http://www.cs.berkeley.edu/ brewer/cs262b-2004/PODCkeynote.pdf
[16] S. Gilbert and N. Lynch, “Brewer’s conjecture and the feasibility of consistent, available, partitiontolerant web services,” SIGACT News, vol. 33, pp. 51–59, June 2002. [Online]. Available: http://doi.acm.org/10.1145/564585.56460 [17] Jing Han; Haihong, E.; Guan Le; Jian Du, "Survey on NoSQL database," Pervasive Computing and
Applications (ICPCA), 2011 6th International Conference on , vol., no., pp.363,366, 26-28 Oct. 2011. doi:10.1109/ICPCA.2011.6106531. [18] Tudorica, B.G.; Bucur, C., "A comparison between several NoSQL databases with comments and notes," Roedunet International Conference (RoEduNet), 2011 10th , vol., no., pp.1,5, 23-25 June 2011. doi:10.1109/RoEduNet.2011.5993686. [19] Lakshman, Avinash, Malik and Prashant, Cassandra – A Decentralized Structured Storage
System. In: SIGOPS Operating Systems Review, vol. 44, pp. 35–40, April (2010).
[20] Lakshman and Avinash, Cassandra – A structured storage system on a P2P Network. August
(2008).

[21] The apache software foundation, The Apache Cassandra Project (2011). http://cassandra.apache.org/, last accessed on January (2011).
[22] David Karger, Eric Lehman, Tom Leighton, Rina Panigrahy, Matthew Levine and Daniel
Lewin, Consistent hashing and random trees: distributed caching protocols for relieving hotspots on the WorldWideWeb .In Proceedings of the twenty-ninth annual ACM symposium on Theory of computing, STOC’97, pp. 654–663, New York, NY, USA (1997) ACM.
[23] https://www.mongodb.org/ - https://docs.mongodb.org/manual/
[24] https://aws.amazon.com/dynamodb/
[25] http://cassandra.apache.org/
[26] Dynamo and BigTable – Review and Comparison - 2041 IEEE 28-th Convention of Electrical and
Electronics Engineers in Israel - Grisha Weintraub Dept. of Mathematics and Computer Science The
Open University Raanana, Israel
[27] Neal Leavitt: Will NoSQL Databases Live Up to Their Promise?. IEEE Computer (COMPUTER)
43(2):12-14 (2010)
[28] G. DeCandia et a l.: Dynamo: amazon's highly available keyvalue store. SOSP 2007:205-220
[29] Rick Cattell: Scalable SQL and NoSQL data stores. SIGMOD Record (SIGMOD) 39(4):12-27 (2010)
[30] 2012 Third International Conference on Emerging Intelligent Data and Web Technologies MongoDB vs Oracle - database comparison- Alexandru Boicea, Florin Radulescu, Laura Ioana Agapin
Faculty of Automatic Control and Computer Science Politehnica University of Bucharest Bucharest,
Romania alexandru.boicea@cs.pub.ro, florin.radulescu@cs.pub.ro, lauraioana.agapin@gmail.com

Similar Documents

Database - Sql

Sql Database

Sql Database

About Databases Sql

Basic Database Notes (Sql)

Distribution Channel

Sql Injection.

Pt2520 Unit 1 Exploring Programming Languages

Pos410 R12 Course Syllabus

Sql Server Security Best Practise

Distributed Systems

A Survey of Sql Injection Defense Mechanisms

Database Design

Nt1330 Unit 3 Assignment 1

Databases in Use

Popular Essays