What is a distributed database?
Distributed databases are all around us: they power the world’s most visited websites such as Google and Facebook, and they’re responsible for huge technological changes such as the rise of cloud computing. But what are distributed databases exactly, and how do distributed databases work?
What are distributed databases?
As the name suggests, a distributed database is a database that is distributed across multiple servers or machines. This could mean multiple computers in the same location (e.g. an office), or multiple computers connected over a network.
One reason why distributed databases are necessary is due to the explosive growth of big data. Massive tech companies such as Google and Facebook store a mind-boggling amount of information. In 2017, for example, Facebook revealed that it generates more than 500 terabytes (500,000 gigabytes) of data every day. This is far more than any single machine is capable of storing.
The benefits of distributed databases also include greater scalability and availability. For example, if one of the machines in the distributed system goes down, the data on that machine will be temporarily unavailable, but the data on the other machines can still be accessible while you work to restore the connection.
In most cases, for the convenience of the end user, distributed databases need to behave as though they are actually a single database in a single physical location. We’ll discuss the details of how distributed databases work in the next section.
How do distributed databases work?
Building a distributed database is substantially more complex than a database that runs on a single machine. Users must be able to treat a distributed database as a single logical entity, without worrying about the underlying implementation. In addition, database transactions occurring on one machine must maintain the integrity of the databases on other machines in the system.
To keep the data in the distributed database up-to-date, there are two fundamental processes that underlie every distributed database system. Both of these techniques aim to keep the database contents identical across different machines, but with different methods:
- Replication: Database replication is a time- and effort-intensive process that automatically propagates changes to a database to the other copies of that database.
- Duplication: Database duplication creates an exact copy of the original database (usually the “master” database). Rather than occurring instantaneously, as with replication, duplication usually runs at a set time (e.g. after business hours, to avoid disrupting users).
Distributed databases may use either or both of the following strategies for distributing information across multiple machines:
- Replication: Distributed databases that use replication copy the entire database, making identical replicas in different locations.
- Fragmentation: Distributed databases that use fragmentation divide the database into multiple parts, storing these partitions on different machines. Horizontal fragmentation splits up the database while keeping individual records intact; vertical fragmentation splits up each individual record.
Two other important terms in the world of distributed databases are homogeneous and heterogeneous databases:
- Homogeneous databases use the same database management system (DBMS) and the same operating system on all of the system’s machines.
- Heterogeneous databases may use different DBMS, operating systems, and/or data models on different machines.
Homogeneous databases can further be divided into autonomous and non-autonomous databases. Autonomous homogeneous databases consist of nodes that operate independently and exchange information with each other using message passing. Non-autonomous homogeneous databases are coordinated by a central system.
Meanwhile, heterogeneous databases can also be divided into federated and non-federated databases, similar to the distinction with homogeneous databases. Federated heterogeneous databases consist of nodes that operate independently, while non-federated heterogeneous databases are coordinated by a central system.
Distributed databases with Redis
Redis is an open-source, in-memory data structure store that is used to implement NoSQL key-value databases, caches, and message brokers. One major benefit of Redis is that it’s easy for users to build distributed databases using tools such as Redis Cluster and Redis Sentinel.
Redis Cluster is a distributed implementation of Redis that automatically shards (i.e. partitions) data across multiple Redis nodes. Redis Cluster uses master-slave replication , in which a single “master” node is backed up by multiple secondary “slave” nodes. In the event that the master node fails, one of the slave nodes can become the new master and continue operations without disruption for the end user.
Another tool for building distributed databases with Redis is Redis Sentinel , a high-availability solution for Redis that helps monitor and automatically handle failover events in a master-slave architecture.