MongoDB and Relational Database Schema Comparison – Dynamic vs Rigid
In this next article of the series, we’ll analyze the different strategies that the Relational and NoSQL database vendors follow for schema design. This will be a two-part article with the first using MongoDB as our subject for comparison followed by a second discussion using Cassandra as the NoSQL product representative.
You’ll notice that I used the term “vendor” as opposed to “model”. E.F. Codd presented a set of 12 rules, actually 13 because the foundational rule started with 0, that defined the prerequisites that must be met before he considered the database management system to be relational.
While most relational products do adhere to a subset of Codd’s rules, currently there are no commercially viable offerings that meet all 13. Although there have been several attempts to establish a taxonomy that classifies a product as NoSQL, there is no single set of defining prerequisites that must be met before a database management system is identified as adhering to the NoSQL Model.
Schema Design and Normalization
A schema is a collection of logical structures of data, or schema objects. Before we begin our analysis of Relational and NoSQL database schemas, we need to discuss the role normalization plays in their design. Normalization is the process that has the goal of storing a fact in its most appropriate place. No schema design is perfect, but the goal of normalization is to store a fact in one place to minimize update, delete and insert anomalies. The more times a fact is duplicated in the schema, the more challenging it is to ensure its validity.
Here’s a quick example. The graphic below depicts our sample employee table. The DEPTNO column in the row contains the number of the department that the employee is assigned to. But, there may be other attributes that further describe a given department that we want to store in our database. Departments have a department number, but also have department name and location attributes.
Let’s take a look at our employee table below. In a denormalized model, when we need to change the location (DEPTLOC column) for department 10 from “Pittsburgh” to “New York”, we are required to find and update every row that has a DEPTNO=10. The application would be required to change all rows of the employee table that have that DEPTNO value, in this case, FOOT and ALLEN.
The second table is our newly created Department table. If we move all of the related attributes for a department to a separate table, we are able to more easily ensure their quality. When our applications need to update information related to a department, the applications don’t have to update multiple rows, and oftentimes multiple tables depending on schema design, to ensure all information related to a department is consistent.
The normalization process occurs when the administrator models the data and follows a set of industry standard rules called “Normal Forms”. Normalization, depending on how complex the schema becomes because of the process itself, may affect query performance. When a fact is stored only in one place, the programs are often required to access multiple schema objects when retrieving many different but related facts. The more schema objects you have to access to retrieve the related facts, the greater the impact it has on query performance. We normalize to reduce data anomalies and denormalize to improve query performance.
As a result, very few models strictly adhere to all the rules of normalization. The first three are the ones most commonly followed, so let’s take a look at them:
- First Normal Form – All entities are required to have a unique identifier. In the example above, the values contained in the EMPNO and DEPTNO columns uniquely identifier the row. To adhere to first normal form, each attribute must not contain multiple attributes and cannot be repeating
- Second Normal Form – All attributes of the entity must be dependent upon the entity’s entire key value. In our Department table above, the department name and location are dependent upon the department’s number
- Third Normal Form – Attributes that are not part of the entity’s key cannot be dependent on other non-key attributes
When we moved the DEPTLOC and DEPTNAME facts to the separate Department table, our applications are now required to access multiple objects to retrieve all of the information related to a given employee. In the relational model, this is performed using a JOIN statement. When the number of objects that are joined together increases, so does the chance for a performance degradation to occur.
The most common activity that occurs during the denormalization process is the merging of objects that are often joined together by the application’s SQL statements. Administrators are attempting to reduce the number of accesses to schema objects when the application program is retrieving sets of related facts.
NoSQL and Relational Architectures Adherence to Normalization
So, now that we understand the normalization process and the benefits a normalized schema provides, what impact does it have on Relational and NoSQL schema design? Normalization is not tied to any model. It is a process that can be applied to any database management system including NoSQL. Normal forms are not dependent upon a particular database model; they are a set of predefined best practices that can be applied to schema design.
When you compare the two models, it is more common to find denormalized schemas existing in the NoSQL architectures. This is not because NoSQL offerings, as a class of products, inhibit or impede the normalization process. A more common set of causal factors is the type of data stored by the application, the application’s requirements for a flexible schema and lighting fast access to large, complex data stores. Because the data being accessed can be quite large, administrators will often denormalize the schema and distribute data amongst multiple servers to provide the application with horizontal scalability.
We’ll learn in future articles, that because of their intended purpose and scope of implementation, there are several NoSQL product architectures that don’t lend themselves to normalized design. Their offerings are intended to accomplish a different set of processing objectives that aren’t solved by normalization. Their product’s goal is to provide extremely fast data access which can be impeded by schema normalization. Their intended target market is customers that require high performance access to large volumes of data.
Experienced data designers will agree that preventing data anomalies using effective modeling techniques is important to any database design and implementation. But NoSQL designers are willing to transfer more of the responsibilities for data quality and business rule enforcement from the database to the application in order to receive the benefits of performance, scalability, availability and design flexibility.
Terminology Comparison – Relational and NoSQL
Before we begin our analysis of Relational and NoSQL schemas, a brief comparison of the two models’ terminology is warranted. We’ll be using Oracle and SQL Server as examples for the relational model and MongoDB for NoSQL. In an upcoming article, we’ll perform the same comparison using Cassandra as our NoSQL subject.
Relational
NoSQL (MongoDB)
Database
Database
Table
Collection
Row
Document
Column
Field
Index
Index
Primary Key
Primary Key
Join
Reference or Embedded Document
Check Constraint
Validation Rules
Oracle and SQL Server Schema Design
Both Oracle’s and SQL Server’s schema feature set is extremely robust when compared to their MongoDB NoSQL counterpart. Relational database administrators use a combination of column data types and constraints to enforce data quality. During table creation they assign a data type to each column to define its attributes, which include the kind of data stored in the column, its length and if it is numeric, precision and scale. Administrators also use constraints to further enforce data quality and provide support for business rules.
The key factor is that the administrator uses the column definitions to assign as much ownership for data quality and business rule enforcement to the database as possible. Here’s an example of an Oracle table definition:
CREATE TABLE RDX.employee (
empno NUMBER(5) PRIMARY KEY,
ename VARCHAR2(15) NOT NULL,
ssn NUMBER(9) UNIQUE ENCRYPT,
job VARCHAR2(10),
mgr NUMBER(5),
hiredate DATE DEFAULT (sysdate),
photo BLOB,
sal NUMBER(7,2) CONSTRAINT check_sal CHECK (sal > 0) ),
hrly_rate NUMBER(7,2) GENERATED ALWAYS AS (sal/2080),
deptno NUMBER(3) NOT NULL
CONSTRAINT admin_dept_fkey REFERENCES RDX.department (DEPTNO))
Let’s take a look at a couple of the column definitions used above:
- empno
- NUMBER(5) – Allow up to 5 numeric values
- PRIMARY KEY – A constraint definition that is used to ensure the column’s value uniquely identifies this row occurrence. There must be a value for this column when the row is inserted (can’t be NULL) and it has to be unique
- SSN
- UNIQUE – Constraint definition that states that there can be no duplicate values in this column for any row in the table. Every row must have a different SSN value
- job
- VARCHAR2(10) – Can contain up to 10 characters
- blob
- Oracle’s attempt to store non structured data. Oracle provides several mechanisms to store data that does not neatly fit into a column
- sal
- NUMBER(7,2) – Requires the values be numeric and also defines their precision and scale
- CONSTRAINT check_sal CHECK (sal > 0) ), the values must be greater than 0
- hrly_rate
- NUMBER(7,2) GENERATED ALWAYS AS (sal/2080) – This column definition states that the value is to be calculated using another column’s contents combined with an arithmetic computation. In this case, we are calculating the hourly rate to be the yearly salary (sal column) divided by the estimated number of work hours per year. Both Oracle and SQL Server provide a robust specification for calculated columns
The last column definition defines a referential constraint which allows the database to enforce additional relationship rules between the schema objects. If we take a look at our sample tables, we see that we are establishing a parent/child relationship between the two schema objects.
deptno NUMBER(3) NOT NULL
CONSTRAINT admin_dept_fkey REFERENCES RDX.department (DEPTNO))
Now that we have established the relationship, we can leverage the Oracle database’s inherent referential integrity rules. For example, when we add a new employee row, the value for the employee’s DEPTNO column must match a value in a row in our department table. If we update the employee’s department number, once again that new value must match a value in the department number column in our department table.
In addition, we can add referential constraint definitions to further enforce additional business rules using the schema definition. For example, if the application attempts to delete a row in the department table, there is a chance that there are child rows in the employee table that reference it. When we delete department number 15 from our department table, a data anomaly might occur. In this case, we have an employee named Oster that has an invalid (non-existing) department, often called an “orphaned row”.
Using additional referential integrity constraint definitions, we can instruct the database to:
- Prevent the deletion from being performed. Often described as “on delete restrict”
- Cascade the delete operation to also delete related rows in the child table or “cascading delete”
- Set the value of the deptno column to a NULL value, described as “on delete set null”
Depending on the relational database being used, there are other constraint definition alternatives that can be used. The above three examples show the reader how referential integrity constraints can be used to enforce both data quality and business rules.
MongoDB Document Model Schema
In this section, we’ll review MongoDB’s document based schema. In an upcoming article of this series, we’ll use Cassandra as the subject for our comparison. NoSQL products vary greatly in the mechanisms they deploy to store data so its important that we review a couple of different offerings. Schemaless is a popular term often used to characterize the NoSQL structure definitions used to describe stored data. A more accurate comparative terminology to use would be dynamic (NoSQL) versus rigid (relational) schemas.
MongoDB’s schema definition could also be described as implicitly defined (versus explicitly defined in the relational mode), in that they are shared between the database management system and the application code. As stated previously, MongoDB administrators are often willing to have the application itself assume more ownership of data quality to fully leverage MongoDB’s schema flexibility, as well as scalability and availability features.
Before we begin, we need to understand JSON and BSON and how MongoDB uses them:
- JSON is an open source, standardized, human readable, data-interchange format that, although initially based on the JavaScript programming language, is language independent. JSON was created to facilitate stateful, asynchronous communications between a web application and a server without using plugins or applets.
JSON uses two primary structures:- Field – Value Pairs – The name of the field and the value being stored
- Ordered list of values – An array or list of zero or more values which can have any type
- BSON is a binary-encoded format of JSON documents. The term BSON is a combination of the words “binary” and “JSON”
MongoDB stores JSON documents, which are collections of Field-Value pairs in the BSON format. JSON is the human readable document that has its data stored in the BSON binary-encoded format. MongoDB’s shell and language drivers are used to translate BSON to the language specific document representation. BSON also provides additional data types and is more efficient at storage than its JSON counterpart. Think of BSON as how the stored data is formatted on disk and JSON as the data interchange between BSON and the application.
BSON Data Types
Like its relational counterpart, MongoDB/BSON also provide a wide range of data types. Here are a couple of quick examples.:
- String – Set of numeric, alpha or special characters of any size and is the most commonly used data type. Length of string is limited by the maximum document size
- Object – Used to store embedded documents. MongoDB provides the ability to store documents within documents
- Array – Stores multiple values consisting of the same, or different data type
- Boolean – True or False Values
- ObjectID – a 12 byte BSON type. MongoDB uses ObjectID values as default values for “_ID” fields to guarantee a document’s uniqueness within a collection
- Date – Stores date information
- Integer – Numeric values
- Timestamp – Date and time
- Binary Data – Binary set of data. Subtypes are used to indicate what kind of data is in the byte array. Can be compared to a Java Byte Array
- Null – Null value
- Regular Expression – First cstring is the regex pattern, the second is the regex options
- JavaScript – JavaScript code
- Here’s Where Things Get Different
- We learned previously that a MongoDB collection can be equated to an Oracle or SQL Server table. MongoDB provides the CREATE command to create a collection. The CREATE command provides options that allow the user to “cap” or limit the size of the collection, assign a storage engine, define data type validation rules/levels/actions and defaults for indexes that will be created on the collection.
- Before we move forward, let’s look at some key points to better frame our discussion:
- A collection does not have to exist before a document is inserted into the system. The operation will create the collection if the collection does not currently exist
- MongoDB provides the ability to create a schema dynamically for a given document when it is first inserted into the system
- Data types are not required to be specified during collection or document creation. MongoDB interprets the values being inserted and stores the data in the appropriate data type. Document validation rules can be created to enforce predefined criteria, which includes the $type. Document validation rules will be discussed later in this article
- Documents in the same collection can have different schemas. MongoDB collections do not enforce document structure. This allows users to tailor the field-value pairs to more effectively represent the individual entity or object the document is trying to depict. In real-world implementations, it is fairly obvious that there will be some number of fields in common between sets of documents contained in a collection
- Here is the field syntax we would be using to insert a document into our employee collection:
db.employees.insert(
{ empno: 7388,
ename: ‘Foot’,
jobno: 87,
mgrno: 6599,
hiredate: new Date(‘Dec 12, 2015’),
deptno: 10
}
)
We can also embed a document containing department information. MongoDB denormalization is discussed later in this article.
{ empno: 7388,
ename: ‘Foot’,
jobno: 87,
mgrno: 6599,
hiredate: new Date(‘Dec 12, 2015’),
department: {
deptno: 10,
deptname: ‘sales’,
depltloc: ‘Pittsburgh’
}
}
Primary Keys
Each document in a collection requires a unique identifier. MongoDB uses the “_id” field to act as a document’s primary key. Users are able to generate their own keys to use as input values for _id fields. If the client does not specify the _id field during document insertion, MongoDB will add the _id field and use the system generated ObjectId as the _id. The system constructs the ObjectId vale by concatenating seconds since the Unix epoch (timestamp), machine identifier, process identifier and a random value.
Unique Fields and Indexes
Like its relational counterpart, MongoDB will reject documents that contain a duplicate value on a field that is uniquely indexed. In addition, MongoDB admins are able to create compound indexes on multiple fields. MongoDB will then enforce uniqueness on the combination of values. Here’s an example on how to create a unique index on the empno field:
db.employees.createIndex( { “empno”: 1 }, { unique: true }
Data Quality Constraints – MongoDB Validation Rules
MongoDB does provide the capability to validate incoming data for update and insert operations. Administrators are able to specify Validation Rules for collections by using the validator option. Validation rules are expressions that can use any of MongoDB’s query operators.
The ValidationLevel option can be set to “strict” which enforces validation rules for all inserts and updates, while the “moderate” option applies validation rules to inserts and to updates to existing documents that fulfill the validation criteria. The example below shows a set of validation rules for the employees collection that states the name field should contain a string value and the email field should contain “@rdx.com”.
{ name: { $type: “string” } }, { email: { $regex: /@rdx.com$/ } }
In addition, we can use the keyword “exists” in combination with a ValidationLevel setting of “Moderate” in the definition to only execute the validations for those documents that contain a value in the field identified in the specification. Documents that do not contain a value in the field will not be checked.
Lastly, MongoDB provides a ValidationAction setting. If the value is set to “error”, MongoDB will reject any insertion or update that violates the validation criteria. When the validationAction is set to “warn”, MongoDB will log violations but allows the insertion or update to proceed.
Referential Integrity
One of the most powerful schema constructs is the foreign key constraint. MongoDB does not provide foreign key functionality inherently as a database feature, so developers are required to create and maintain the relationships programmatically.
MongoDB users can denormalize the document’s model by storing all of the related data in a single document. The above example demonstrates how denormalization was used to embed the department information in the employee document.
department: {
deptno: 10,
deptname: ‘sales’,
depltloc: ‘Pittsburgh’
}
There are numerous use cases where it makes more sense from a modeling and application processing perspective to store related information in a separate document that resides in the same, or different collection. Users have two options:
- Programmatically store the document’s primary key field (_id) in another document’s field as a manual reference. The application would query the field to return the _id field and then execute a follow-up query using that value to retrieve the related document’s information
- MongoDB provides DBRefs which provide a common format and type to represent relationships among documents
As we have learned, relational and NoSQL schemas, although possessing some common traits, are very different in their approach to schema design as well as how the application interacts with it. In my next article, we’ll compare the relational schema to Cassandra/Datastax. We’ll then move forward with the series by analyzing data access and CRUD operations.
Thanks for reading, and if your business is now leveraging the benefits of MongoDB, contact us to discuss how RDX’s MongoDB supportcan help you get the most out of your databases.