Understanding Database Synchronization: An Overview of Change Data Capture
Today's organization collects large amounts of data in various systems. No single system can handle all data storage, querying, and processing needs. Even the most nontrivial applications combine different technologies to satisfy their requirements: an OLTP database to serve user requests, a cache to speed up common requests, a full-text search, and a data warehouse for analytics. In today's complex architecture, this list can expand quickly into various types of databases, services, and platforms or applications. Keeping these diverse components synchronized is crucial for ensuring data integrity, achieving real-time analytics, and providing seamless user experiences.
This article will briefly explain why synchronization matters and the approaches to solving this issue. Let's begin with why synchronization matters.
Why Understanding Database Synchronization Matters?
Accessing and analyzing data can be challenging with the large amounts of data spread across multiple systems. Having a consistent and up-to-date view of your data is essential to get the maximum value from your data.
Synchronization ensures that the same data is available across different databases or different locations within a database. This consistency is vital for maintaining the integrity of the data, ensuring that all users and systems are accessing the same, up-to-date information.
Here are several other reasons why synchronization matters:
Collaboration and Accessibility: Different users or systems may need to work with the same data in a collaborative environment. Synchronization allows for real-time or near-real-time data sharing, ensuring all collaborators can access the most current information.
Performance Optimization: Synchronization can help balance the load among various servers or databases. Distributing data across multiple locations can improve the performance of the database systems, as it allows for faster access and processing of data.
Real-time Analytics and Reporting: For businesses that rely on real-time analytics and reporting, synchronization ensures that the data used for these purposes is current and accurate, leading to better decision-making and more efficient operations.
Scalability: As organizations grow and their data needs expand, synchronization allows for the seamless integration of new databases or data sources. This scalability is essential for organizations to adapt to changing requirements without disrupting their operations.
As mentioned, modern architectures has numerous systems (databases, caches, etc.). Each of these systems has its copy of the data. As the same or related data appears in several places, they must be kept in sync.
Historically, there are two approaches to solving this problem: batch processing and dual writes. Let's explore these approaches here.
We often achieve data synchronization using the extract, transform, and load (ETL) processes; we take a full copy of a database, change it, and load it into other systems, databases, or data warehouses. In other words, batch processing.
With batch processing, the ETL pipeline is scheduled to be periodically executed, say every midnight or every hour. ETLs can replicate the entire database or only the new and updated data. These types of replication are called full replication and incremental replication.
But traditional batch processing has a few general drawbacks:
It is not real-time
Performing full replication is very inefficient
If we keep querying for new or updated data, it may degrade the source database performance.
Difficulties in tracking deletes in the source database in incremental replication
Developers can write an application code that explicitly writes to each system when data changes: for example, writing to the database and then updating other systems (for instance, search index or cache).
But this approach has a serious problem: race conditions. Here is an example.
In this example, two clients concurrently want to update item X: client 1 wants to set the value to A, and Client 2 wants to set it to B. Both clients write the new value to the database and then the search index. Due to unlucky timing, the requests are interleaved: the database first sees the write from client 1, setting the value to A, then the write from client 2, setting the value to B, so the final value in the database is B. The search index first sees the write from client 2, then client 1, so the final value in the search index is A. The two systems are now inconsistent, even though no error occurred.
Note: The above example is taken from the excellent Designing Data-Intensive Applications book.
This approach also brings other problems:
Fault-tolerance: If one write succeeds and the other fails, it can lead to data inconsistencies.
Increased complexity: Developers need to be aware of the multiple systems. And error handling can be very complex.
Performance overhead: Writing data to multiple systems can increase the I/O operation, leading to increased latency and reduced performance.
Change Data Capture (CDC)
Change data capture (CDC) is the process of observing all data changes written to a database and extracting them in a form that can be reflected to other systems so that the systems have an accurate copy. For instance, you can record changes in a database and replicate these changes to a search index. By applying the log of changes in the same sequence, it's expected that the data in the search index will align with the database data. In this scenario, the search index and any other resulting data systems act as recipients or consumers of the change data stream.
CDC is fascinating because changes are made available as a stream immediately as they are written. This opens up several use cases:
Continuous data replication. In batch mode, the source database cannot accept new writes, including schema changes, until the process is complete. If the amount of data is enormous, this can take a lot of time. CDC solves this problem by continuously replicating changed data to the derived data system.
Integration with microservices architecture. As companies move away from large, singular architecture towards microservices, they must transfer data from original databases to multiple destination systems. Change Data Capture (CDC) can be employed to maintain synchrony between both source and target data storage systems.
Cloud adoption. Organizations are increasingly shifting towards cloud-based solutions to enhance their flexibility and adaptability. Utilizing cloud-native services allows businesses to concentrate on developing fresh digital experiences instead of dedicating time and resources to set up, maintain, and oversee their databases and infrastructure.
Change data capture essentially designates one database as the primary source (where changes are recorded) and the rest as secondary or follower databases. A log-based message broker is ideally suited for relaying change events from the source database because it maintains the sequence of messages.
CDC systems monitor the source database and identify insertions, updates, or deletions. Once a change is identified, it replicates the data to target systems or data stores.
Numerous systems use this idea: Debezium, LinkedIn's Databus, Facebook's Wormhole, Yahoo! 's Sherpa, Mongoriver, Oracle's GoldenGate, and many more.
While the log-based CDC is the most efficient way to implement CDC, there are other ways you can implement it. Here are two different methods for implementing CDC:
Trigger-based. You have probably heard of database triggers. Database triggers can be used to implement change data capture. We do this by registering triggers that observe all changes to data tables and add corresponding entries to a changelog table. The upside of this approach is that most databases support triggers, and they are widely used. However, they tend to be fragile and have significant performance overheads. Furthermore, it requires changes to the source database schema.
Timestamp-based. We can add a column to the schema of tables that reflects the time of the most recent change. This column can be called modified, updated, etc. The other systems can query this field and get the records updated since the last execution time. The timestamp-based approach is simple to use and implement, but we can only process soft deletes and not DELETE operations. And it adds computational overhead to the source system since the other system has to filter the table to get the last updated values.
Pros of CDC
Timeliness: Changes are captured and replicated in real time, ensuring up-to-date data across systems.
Reduced Load: CDC reduces the need for bulk data load operations, minimizing the impact on source systems.
Flexibility: It can be implemented with various databases and integrates well with modern data architectures.
Cons of CDC
Complexity: Setting up and maintaining CDC can be complex, especially in heterogeneous database environments.
Resource Intensive: Requires additional resources for monitoring and replicating changes.
Potential for Data Loss: If the CDC system fails, changes made during the downtime might not be captured.
Database synchronization in today's multi-system environments is not just a technical necessity but a strategic imperative. That's why understanding database synchronization matters. Change Data Capture (CDC) emerges as a robust solution, offering real-time data replication, integration adaptability, and cloud migration support. While CDC presents its own set of challenges, including complexity and resource demands, its advantages in maintaining data consistency, performance optimization, and supporting scalable architectures are undeniable. As organizations continue to evolve and their data ecosystems become increasingly complex, the role of the CDC in ensuring seamless, accurate, and efficient data synchronization will only grow in importance. This makes understanding and implementing CDC a matter of operational efficiency and a cornerstone of successful digital transformation.