Cassandra and Relational database schema comparison – Query vs relationship modeling
In this next article of the series, we’ll continue our comparison of relational and NoSQL schemas. Part one of the article used MongoDB as our subject for comparison. This second article uses Cassandra as the representative for NoSQL architectures.
We began our analysis in part one by reviewing relational database schemas. We discussed schema design and normalization, reviewed the different data types provided by one of the relational database vendors and finished with an overview on referential integrity and business rule enforcement.
Cassandra Introduction
The innovators behind many of the NoSQL product offerings were super-sized technology players that were being affected by a new form of social and internet retail traffic profiles. Facebook, Amazon, and Google were all contending with exponential increases in concurrent users and user-generated content that did not fit neatly into tabular rows and columns.
Although middle-tier, web servers were able to be easily scaled horizontally to accommodate increased demand, their database server counterparts presented more of a challenge. Relational database clustering had been readily available for years, but it was both costly and complex to administer. They also understood that buying progressively more powerful hardware to provide vertical database server scalability presented its own set of limitations.
Their need to store unstructured data, in conjunction with the system’s ability to provide almost absurdly high degrees of scalability, data distribution and availability were the business drivers behind their innovation. Innovation that lead to the creation of database management systems that did not adhere to the relational model. Many of the NoSQL products are architected to leverage low cost hardware to provide horizontal scalability and data redundancy at a more affordable price point.
Origins of Cassandra
Facebook originally developed Cassandra as the storage system for their Inbox search feature. Facebook released the software as an open source project in 2008. The foundations for its architecture are Amazon’s Dynamo and Google’s BigTable. In 2009, it became an Apache incubator project and in 2010 it was upgraded to a top-level project. DataStax, a software company that develops and supports a commercial version of Cassandra was founded in 2010. Their entire business model focuses on improving, selling and providing training and product support for Cassandra.
Designed for High Performance and Availability
Cassandra is a partition, wide-row database datastore. A partition row store uses a partition key to distribute data to nodes in the cluster. Cassandra also has the ability store extremely wide rows (lots of columns) when compared to its relational and NoSQL counterparts. Cassandra’s theoretical limit for the number of columns in a row is 2 billion.
Cassandra is architected as a peer-to-peer distributed system that partitions data to all nodes in the cluster. It’s read/write anywhere design and robust replication mechanism provides extremely high availability. Cassandra automatically replicates data to multiple nodes for fault-tolerance. Replication to multiple, geographically disparate data centers ensures Cassandra users that their data is available when they need it. Cassandra administrators are able to leverage low cost commodity hardware to provide near linear horizontal scalability at economical price points. Nodes can be easily added to a Cassandra cluster to increase capacity and failed nodes can be replaced with no downtime.
Cassandra utilizes a masterless ring design to provide horizontal scalability. There is no concept of a master node; in Cassandra all nodes play an identical role. Each node in the ring communicates to each other using a distributed, scalable protocol called “gossip”. Cassandra utilizes a hashing algorithm to generate numerical tokens from each row’s partition key. Cassandra uses the tokens to distribute data to the various nodes in the cluster. When new nodes are added, or existing nodes removed, Cassandra automatically balances the data within the cluster to reflect the new configuration.
Cassandra Ring Cluster
Cassandra Replication
A Cassandra cluster consists of one or more keyspaces. A keyspace can be loosely compared to an Oracle Schema or a Microsoft SQL Server database. Administrators configure replication on the keyspace, which allows keyspaces to have different replication models. Data can be replicated to multiple nodes in the cluster, which fosters reliability, availability and high performance I/O operations. The number of row copies stored in the cluster is called the row’s replication factor. If a row has a replication of 4, the row is stored 4 times across the cluster.
Cassandra’s replication model is fully multi-master. When a node becomes unavailable, surviving members of the cluster are able to satisfy both read and write requests. If multiple nodes fail, depending on their number and system configuration, there is a chance that the data will become unavailable. In master-slave models, a single node of the group is assigned to be the owner for a given piece of data and it is the only node that is allowed to modify it. Multi-master replication allows updates to occur from any client to any node. Conflicts are resolved at the column level. All node members can process both read and write requests to replicated data.
Cassandra Query Language (CQL)
The CQL language became available in Cassandra 1.1. Those that have an understanding of the SQL language will be able to quickly understand CQL syntax. It is important to note that CQL is not intended to be a replacement for SQL. CQL does not provide specifications for JOIN, LIKE, subquery, aggregation and embedded arithmetic calculations.
There is also a fairly robust set of rules limiting what operators (IN, =, >, >=, <= ….) can be used in a CQL statement’s WHERE clause. The types of operators allowed depends on whether the column being restricted in the WHERE clause is a partition key, clustering key, secondary index key or part of a collection. For more information, please turn to this excellent article on the CQL WHERE clause by DataStax.
Cassandra Data Objects
Before we continue our discussion on Cassandra schema design, some supporting information is in order:
- Keyspace – A keyspace is a logical container for data tables and indexes. It can be compared to an Oracle Schema or a SQL Server database. Keyspaces also define how the data is replicated to the various nodes
- Table – Like its relational counterparts, Cassandra uses tables as logical storage structures. Unlike relational systems, a Cassandra table can be created with an astronomically high number of column specifications to provide high speed row inserts and column level reads. In a future article, we will analyze the fairly significant differences in how Cassandra stores row and column values when compared to relational systems. The intent of this article is to focus on schemas and not the underlying storage architecture
- Column – Can be loosely compared to a relational table column
- Primary Key – Uniquely identifies a row occurrence in a Cassandra table
- Partition Key– The partition key identifies which node in the cluster will store the row. It is responsible for data distribution across the nodes
- Clustering Key Orders rows based on the column’s value
- Index – Cassandra does provide secondary indexes. We’ll learn later in this article that they are much more performance sensitive when it comes to column cardinality
- Partitioner – A hashing algorithm that generates a hash value token from the partition key. The token is the value used to distribute the data across the various nodes in the cluster. The partitioner’s goal is to assign equal portions of data to each node. Each node in a Cassandra cluster assumes storage ownership for a range of hash values
Cassandra Schema Design – It’s All About Performance
We learned previously that the goal of normalization is to store a fact in one place to minimize update, delete and insert anomalies. The more times a fact is duplicated in the schema, the more challenging it is to ensure its validity.
But normalized data, depending on how complex the schema becomes, often affects query performance. When a fact is stored only in one place, the programs are often required to access multiple schema objects when retrieving many different but related facts. The more schema objects you have to access to retrieve the related facts, the greater the impact it has on query performance. We normalize to reduce data anomalies and denormalize to improve query performance. The most common activity that occurs during the denormalization process is the merging of attributes that are often accessed together into a single schema object.
Cassandra’s goal is to provide high speed access to large volumes of data. Cassandra administrators will leverage low cost disk storage, Cassandra’s wide row features and fast write performance to create highly denormalized models. Data duplication, consistency and quality are not enforced as strictly by the database when you compare Cassandra to relational systems.
In relational systems, administrators model the data. In Cassandra, administrators design schemas that are based on query patterns.
Denormalization and the duplicated data that results from it will be a fact of life in Cassandra because the primary goal is to provide high performance access to large data volumes. It’s important to note that although it is much more common to find denormalized data in Cassandra, it isn’t a system requirement. Cassandra designers build schemas based on access patterns which often results in data being duplicated.
Here’s Where Things Get Different – Cassandra/DataStax Best Practice of One Table Per Query
Administrators attempt to design Cassandra schemas that satisfy a query’s request for data by reading as few partitions as possible, with the ultimate goal being 1 partition. To achieve this, they are often willing to create a new table to satisfy a single query. They are, in effect, pre-building the answers to individual requests for data. DataStax manuals state that a general best practice is to have one table per query.
Data Types
Cassandra provides a wealth of data types to choose from during table creation. The list below shows a few examples. The data types shown aren’t recommendations and the list is not all inclusive.
- text – UTF-8 encoded string
- varchar – UTF-8 encoded string
- asci – US-ASCII character string
- int – 32-bit signed integer
- smallint – 2 byte integer
- bigint – 64-bit signed long
- decimal – Variable-precision decimal
- double – 64-bit IEEE-754 floating point
- double – 64-bit IEEE-754 floating point
- date – Date string, such as 2015-05-03
- time – Time string, such as 13:30:54.234
- timestamp – Date plus time, encoded as 8 bytes since epoch
- uuid – Universally unique identifier
- collections – Set, list and map collection types can be loosely compared to an array of values
Primary Keys
Primary keys in Cassandra can be compared to their relational counterpart in that they must be unique and they must contain a value. But, as we will learn later in this article, primary key specifications in Cassandra also provide other functionality. Primary keys can be simple, or compound. A primary key with one column in the specification is known as a simple primary key. If the primary key contains multiple columns, they are identified as compound.
Partition Key
Cassandra uses the first column of the primary key as the partition key. The partition key identifies which node in the cluster will store the row. For simple primary keys, one column is both the primary key and the partition key.
In order to better, describe how Cassandra stores data using primary and partitioning keys, lets deviate from our emp table and use a movie table as our example:
CREATE TABLE movieorder (
id uuid,
genre text,
movie_order int,
movie_id uuid,
movie_title text,
director text,
PRIMARY KEY (id) );
In the example above, the ID column is the table’s primary key and the partition key. The data type for the ID column is UUID or universally unique identifier that is used in distributed systems to uniquely identify information. The partitioner will generate the token based on the id column. The node that is assigned a range of tokens that includes the row being processed assumes storage responsibilities. Please note that I’m using UUID as an example and it is not intended to be a blanket recommendation. There is a wealth of information in DataStax’s manuals and web discussions on the best columns to use for partition keys.
Compound Primary Key
A compound primary key consists of multiple columns. The first column in the specification is the partition key. We learned previously that the partition key determines which node stores the data. The partitioner will generate the token based on the id column. Column or columns that are specified after the first column are known as clustering columns. Rows are stored in order of the cluster column values.
The clustering order is determined by the position of columns in the compound primary key specification. Now that the data is distributed to the cluster nodes using the partition key, the clustering columns store the rows on that node in cluster column order, which provides fast read access. Here’s an example of a compound primary key:
CREATE TABLE movieorder (
id uuid,
genre text,
movie_order int,
movie_id uuid,
movie_title text,
director text,
PRIMARY KEY (id, genre) );
In the example above, the primary key consists of the id and genre columns. Cassandra would store rows on the nodes using the id column and order those rows by the values contained in the genre column. The default order is ascending but can be changed to descending by using the “WITH CLUSTERING ORDER BY” clause in the table definition.
Composite Partition Key
Composite partition keys are used when single column partition keys result in an overload of data on a single node. The multiple column specifications allow administrators to more granularly distribute data to the nodes.
CREATE TABLE movieorder (
id uuid,
genre text,
movie_order int,
movie_id uuid,
movie_title text,
director text,
PRIMARY KEY ( (id, genre) movie_title, director) );
The partitioner will generate the token based on the id and genre columns. Cassandra will attempt to store rows that have the same id and genre on the same node and and rows that have the same id but different genres on different nodes. The rows will be in order of movie_title and director.
Unique Columns
Unlike its relational counterparts, the only specification in Cassandra that can be used to enforce column uniqueness is the primary key clause. Although additional indexes can be created in Cassandra, there is no specification that can be used to guarantee uniqueness on the indexed columns.
Secondary Indexes
The CREATE INDEX command is used to create additional indexes on Cassandra tables. In relational systems, the most beneficial columns to create B-TREE indexes on are ones with high cardinality. High cardinality columns contain values that are very uncommon or unique. The higher the cardinality the better the B-TREE index performs. In Cassandra, it is the opposite. Columns that have many rows with the same indexed value are the best candidates for secondary indexes. In addition, extremely low cardinality columns that contain Boolean true/false, yes/no values are also not good candidates. We’ll discuss secondary index, partition key columns usage in WHERE clauses in a later article.
Referential Integrity and Data Relationships
One of the most powerful relational schema constructs is the foreign key constraint. Relational modelers design schemas that reflect how the data elements are related to each other and often use their specifications to enforce business rules. Cassandra does not provide foreign key functionality inherently as a database feature.
Relational administrators need to be aware that Cassandra has a much different scope of implementation. Cassandra schema design focuses primarily on query patterns. They combine Cassandra’s wide row capability with intelligent partition and cluster key selection to group logically related data elements together. Denormalization and data duplication techniques are used intelligently during the query modeling process. Cassandra administrators use its high speed write performance to overcome the challenges of duplicated data.
Their ultimate goal is data access performance, not strictly adhering to the rules of normalization. Savvy Cassandra administrators understand the update challenges of data duplication. They use denormalization techniques much like their relational DBA counterparts – in an educated fashion to improve performance, as opposed to being lax in their data modeling techniques.
The next article in this series will compare how data is accessed in relational and NoSQL databases. Our discussion will include data retrieval, manipulation as well as advanced query constructs that include joins, or lack thereof, complex queries and aggregation.
Thanks for reading.