Change Data Capture (CDC) is a technique used to track changes at the row level in database operations (insertions, updates, deletions) and notify other systems in the order of events. In disaster recovery scenarios, CDC primarily synchronizes data between a primary and a backup database, enabling real-time data syncing from the primary to the secondary database.
source ----------> CDC ----------> sink Apache SeaTunnel CDCSeaTunnel CDC offers two types of data synchronization:
The lock-free snapshot synchronization phase is emphasized because many existing CDC platforms, such as Debezium, may lock tables during historical data synchronization. Snapshot reading is the process of synchronizing a database’s historical data. The basic flow of this process is as follows:
storage -------------> splitEnumerator ---------- split ----------> reader ^ | | | \----------------- report -----------/Split Partitioning
splitEnumerator (split distributor) partitions the table data into multiple splits based on specified fields (such as table ID or unique keys) and defined step size.
\ Parallel Processing
Each split is assigned to a different reader for parallel reading. A single reader will occupy one connection.
\ Event Feedback
After completing the read operation for a split, each reader reports progress back to the splitEnumerator. The metadata for the split is provided as follows:
String splitId # Routing ID TableId tableId # Table ID SeatunnelRowType splitKeyType # The type of field used for partitioning Object splitStart # Start point of the partition Object splitEnd # End point of the partition\ Once the reader receives the split information, it generates the appropriate SQL statements. Before starting, it logs the current split’s corresponding position in the database log. After completing the current split, the reader reports progress to the splitEnumerator with the following data:
String splitId # Split ID Offset highWatermark # Log position corresponding to the split, for future validation Incremental SynchronizationThe incremental synchronization phase begins after the snapshot read phase. In this stage, any changes occurring in the source database are captured and synchronized to the backup database in real time. This phase listens to the database log (e.g., MySQL binlog). Incremental tracking is usually single-threaded to avoid duplicate pulls of the binlog and reduce database load. Therefore, only one reader is used, occupying a single connection.
data log -------------> splitEnumerator ---------- split ----------> reader ^ | | | \----------------- report -----------/\ In the incremental synchronization phase, all splits and tables from the snapshot phase are combined into a single split. The split metadata during this phase is as follows:
String splitId Offset startingOffset # The lowest log start position among all splits Offset endingOffset # Log end position, or "continuous" if ongoing, e.g., in the incremental phase List\ The CompletedSnapshotSplitInfo fields are as follows:
String splitId TableId tableId SeatunnelRowType splitKeyType Object splitStart Object splitEnd Offset watermark # Corresponds to the highWatermark in the reportThe split in the incremental phase contains the watermark for all splits in the snapshot phase. The minimal watermark is selected as the starting point for incremental synchronization.
Exactly-Once SemanticsWhether in the snapshot read or incremental read phase, the database might also change for synchronization. How do we guarantee exactly one delivery?
Snapshot Read PhaseIn the snapshot read phase, for example, a split is being synchronized while changes are happening, such as the insertion of a row k3, an update to k2, and a deletion of k1. If no task identification is used during the read process, the updates could be lost. SeaTunnel handles this by:
\
\ If high = low, the data for the split has not changed during the read. If (high - low) > 0, changes have occurred during processing. In such a case, SeaTunnel will:
\
\
insert k3 update k2 delete k1 | | | v v v bin log --|---------------------------------------------------|-- log offset low watermark high watermark CDC reads: k1 k3 k4 | Replays v Real data: k2 k3' k4 Incremental PhaseBefore starting the incremental phase, SeaTunnel first validates all splits from the previous step. Between splits, data may be updated, for instance, if new records are inserted between split1 and split2, they could be missed during the snapshot phase. To recover this data between splits, SeaTunnel follows this approach:
\
\
|------------filter split2-----------------| |----filter split1------| data log -|-----------------------|------------------|----------------------------------|- log offset min watermark split1 watermark split2 watermark max watermark Checkpoint and ResumeWhat about pausing and resuming CDC? SeaTunnel uses a distributed snapshot algorithm (Chandy-Lamport):
Assume the system has two processes, p1 and p2, where p1 has three variables X1 Y1 Z1 and p2 has three variables X2 Y2 Z2. The initial states are as follows:
p1 p2 X1:0 X2:4 Y1:0 Y2:2 Z1:0 Z2:3\ At this point, p1 initiates a global snapshot. p1 first records its process state, then sends a marker to p2.
\ Before the marker reaches p2, p2 sends message M to p1.
p1 p2 X1:0 -------marker-------> X2:4 Y1:0 <---------M---------- Y2:2 Z1:0 Z2:3\ Upon receiving the marker, p2 records its state, and p1 receives the message M. Since p1 already performed a local snapshot, it only needs to log the message M. The final snapshot looks like this:
p1 M p2 X1:0 X2:4 Y1:0 Y2:2 Z1:0 Z2:3\ In SeaTunnel CDC, markers are sent to all readers, split enumerators, writers, and other nodes, each keeping its memory state.
All Rights Reserved. Copyright , Central Coast Communications, Inc.