NoSQL Databases: A Guide with Python Code Snippets Document databases
Document databases
Table of contents
- What are NoSQL Databases?
- Why Use NoSQL Databases?
- What is a Document Database?
- A document database (also known as a document-oriented database or a document store) is a database that stores information in documents.
- What are documents?
- What are the key features of document databases?
- What makes document databases different from relational databases?
- What are the relationships between document databases and other databases?
- Why not just use JSON in a relational database?
- What are the use cases for document databases?
- Using MongoDB with Python
In this Lab, we will explore NoSQL databases - what they are, why they are used, and how we can interact with them using Python.
We'll be using MongoDB, a popular NoSQL database, and pymongo, the Python driver for MongoDB.
What are NoSQL Databases?
NoSQL databases, also known as "not only SQL," are a type of database design that offers flexible and scalable solutions, particularly when dealing with large amounts of distributed data. Unlike relational databases, NoSQL databases do not rely on traditional tabular relationships. They are particularly useful for working with large sets of distributed data.
There are several types of NoSQL databases, including:
Document databases (e.g., MongoDB, CouchDB)
Key-value stores (e.g., Redis, DynamoDB)
Wide-column stores (e.g., Cassandra, HBase)
Graph databases (e.g., Neo4j, Amazon Neptune)
Why Use NoSQL Databases?
Here are some key reasons to use NoSQL databases:
Scalability: NoSQL databases are highly scalable and provide superior performance when managing large amounts of data.
Flexibility: They offer flexibility as they allow you to store and process data in ways that relational databases can't.
Speed: NoSQL databases use internal optimizations for specific data types, leading to faster data manipulation and retrieval.
What is a Document Database?
A document database (also known as a document-oriented database or a document store) is a database that stores information in documents.
Document databases offer a variety of advantages, including:
An intuitive data model that is fast and easy for developers to work with.
A flexible schema that allows for the data model to evolve as application needs change.
The ability to horizontally scale out.
Because of these advantages, document databases are general-purpose databases that can be used in a variety of use cases and industries.
What are documents?
A document is a record in a document database. A document typically stores information about one object and any of its related metadata.
Documents store data in field-value pairs. The values can be a variety of types and structures, including strings, numbers, dates, arrays, or objects. Documents can be stored in formats like JSON, BSON, and XML.
Below is a JSON document that stores information about a user named Tom.
{
"_id": 1,
"first_name": "Tom",
"email": "tom@example.com",
"cell": "765-555-5555",
"likes": [
"fashion",
"spas",
"shopping"
],
"businesses": [
{
"name": "Entertainment 1080",
"partner": "Jean",
"status": "Bankrupt",
"date_founded": {
"$date": "2012-05-19T04:00:00Z"
}
},
{
"name": "Swag for Tweens",
"date_founded": {
"$date": "2012-11-01T04:00:00Z"
}
}
]
}
Collections
A collection is a group of documents. Collections typically store documents that have similar contents.
Not all documents in a collection are required to have the same fields, because document databases have a flexible schema.
CRUD operations
Document databases typically have an API or query language that allows developers to execute the CRUD (create, read, update, and delete) operations.
Create: Documents can be created in the database. Each document has a unique identifier.
Read: Documents can be read from the database. The API or query language allows developers to query for documents using their unique identifiers or field values. Indexes can be added to the database in order to increase read performance.
Update: Existing documents can be updated — either in whole or in part.
Delete: Documents can be deleted from the database.
What are the key features of document databases?
Document databases have the following key features:
Document model: Data is stored in documents (unlike other databases that store data in structures like tables or graphs). Documents map to objects in most popular programming languages, which allows developers to rapidly develop their applications.
Flexible schema: Document databases have a flexible schema, meaning that not all documents in a collection need to have the same fields. Note that some document databases support schema validation, so the schema can be optionally locked down.
Distributed and resilient: Document databases are distributed, which allows for horizontal scaling (typically cheaper than vertical scaling) and data distribution. Document databases provide resiliency through replication.
Querying through an API or query language: Document databases have an API or query language that allows developers to execute the CRUD operations on the database. Developers have the ability to query for documents based on unique identifiers or field values.
What makes document databases different from relational databases?
Three key factors differentiate document databases from relational databases:
1. The intuitiveness of the data model: Documents map to the objects in code, so they are much more natural to work with. There is no need to decompose data across tables, run expensive joins, or integrate a separate Object Relational Mapping (ORM) layer. Data that is accessed together is stored together, so developers have less code to write and end users get higher performance.
2. The ubiquity of JSON documents: JSON has become an established standard for data interchange and storage. JSON documents are lightweight, language-independent, and human-readable. Documents are a superset of all other data models so developers can structure data in the way their applications need — rich objects, key-value pairs, tables, geospatial and time-series data, or the nodes and edges of a graph.
3. The flexibility of the schema: A document’s schema is dynamic and self-describing, so developers don’t need to first pre-define it in the database. Fields can vary from document to document. Developers can modify the structure at any time, avoiding disruptive schema migrations. Some document databases offer schema validation so you can optionally enforce rules governing document structures.
What are the relationships between document databases and other databases?
The document model is a superset of other data models, including key-value pairs, relational, objects, graph, and geospatial.
Key-value pairs can be modeled with fields and values in a document. Any field in a document can be indexed, providing developers with additional flexibility in how to query the data.
Relational data can be modeled differently (and some would argue more intuitively) by keeping related data together in a single document using embedded documents and arrays. Related data can also be stored in separate documents, and database references can be used to connect the related data.
Documents map to objects in most popular programming languages.
Graph nodes and/or edges can be modeled as documents. Edges can also be modeled through database references. Graph queries can be run using operations like $graphLookup.
Geospatial data can be modeled as arrays in documents.
Why not just use JSON in a relational database?
With document databases empowering developers to build faster, most relational databases have added support for JSON. However, simply adding a JSON data type does not bring the benefits of a native document database. Why? Because the relational approach detracts from developer productivity, rather than improve it. These are some of the things developers have to deal with.
Proprietary Extensions
Working with documents means using custom, vendor-specific SQL functions which will not be familiar to most developers, and which don’t work with your favorite SQL tools. Add low-level JDBC/ODBC drivers and ORMs and you face complex development processes resulting in low productivity.
Primitive Data Handling
Presenting JSON data as simple strings and numbers rather than the rich data types supported by native document databases such as MongoDB makes computing, comparing, and sorting data complex and error prone.
Poor Data Quality & Rigid Tables
Relational databases offer little to validate the schema of documents, so you have no way to apply quality controls against your JSON data. And you still need to define a schema for your regular tabular data, with all the overhead that comes when you need to alter your tables as your application’s features evolve.
Low Performance
Most relational databases do not maintain statistics on JSON data, preventing the query planner from optimizing queries against documents, and you from tuning your queries.
No native scale-out
Traditional relational databases offer no way for you to partition (“shard”) the database across multiple instances to scale as workloads grow. Instead you have to implement sharding yourself in the application layer, or rely on expensive scale-up systems.
What are the use cases for document databases?
Document databases are general-purpose databases that serve a variety of use cases for both transactional and analytical applications:
Single view or data hub
Customer data management and personalization
Internet of Things (IoT) and time-series data
Product catalogs and content management
Payment processing
Mobile apps
Mainframe offload
Operational analytics
Real-time analytics
Using MongoDB with Python
To interact with MongoDB using Python, we will be using the pymongo driver. Make sure you have MongoDB installed and running on your machine. You can install pymongo using pip:
pip install pymongo
Now, let's see how we can interact with a MongoDB database:
Establish a Connection:
First, we need to establish a connection to our MongoDB server.
from pymongo import MongoClient
# establish a connection to the MongoDB server
client = MongoClient('localhost', 27017)
Creating a Database and a Collection:
In MongoDB, databases hold collections of documents. You can create a database and a collection as follows:
# create a new database
db = client['example_db']
# create a new collection
collection = db['example_collection']
Inserting Data:
We can insert data into our collection in the form of Python dictionaries.
# sample data
data = {
'name': 'John Doe',
'email': 'johndoe@example.com',
'age': 30
}
# insert the data into the collection
result = collection.insert_one(data)
Retrieving Data:
We can retrieve data from our collection using various queries:
# retrieve all documents from the collection
for document in collection.find():
print(document)
# retrieve a specific document
query = {'name': 'John Doe'}
document = collection.find_one(query)
print(document)
Updating Data:
We can also update existing data:
# update query
update_query = {'name': 'John Doe'}
new_values = {"$set": {'email': 'newemail@example.com'}}
# update the document
collection.update_one(update_query, new_values)
Deleting Data:
# delete query
delete_query = {'name': 'John Doe'}
# delete the document
collection.delete_one(delete_query)
Join Collections:
In MongoDB, you can perform a join operation using the $lookup
stage in aggregation. This operation lets you combine data from multiple collections into a single result.
Let's consider we have two collections: orders
and products
. The orders
collection contains a productId
field that references documents in the products
collection.
Here is how you can join these two collections in Python using pymongo:
from pymongo import MongoClient
# establish a connection to the MongoDB server
client = MongoClient('localhost', 27017)
# create a handle to the 'test' database
db = client['test']
# create handles to the 'orders' and 'products' collections
orders = db['orders']
products = db['products']
# use the aggregation framework to perform the join
pipeline = [
{
'$lookup': {
'from': 'products', # name of the foreign collection
'localField': 'productId', # field from the orders collection
'foreignField': '_id', # field from the products collection
'as': 'productData' # output array field
}
}
]
# execute the aggregation pipeline
joined_data = orders.aggregate(pipeline)
# print the result
for data in joined_data:
print(data)
This pipeline will add a productData
field to each document from the orders
collection. The productData
field is an array containing the matching documents from the products
collection.
Keep in mind that the $lookup
stage performs a left outer join, which means that the pipeline will return a combination of all documents from the orders
collection and any documents from the products
collection that match. If there is no match, productData
will be an empty array.
Also, make sure your MongoDB server is running and replace 'localhost' and '27017' with your MongoDB server's IP and port if it's not running on your local machine.
Remember to always close the connection once you're done:
client.close()