How ClickPipes Solves the Challenge of Synchronizing Large Tables

Learn how ClickPipes tackles the complex challenge of synchronizing large database tables to ClickHouse with its innovative sharding solution.

In today's data-driven business environment, real-time analytics and decision-making are crucial for enterprise success. Our newly developed product is dedicated to providing customers with a truly real-time cloud data warehouse solution, which includes two core components:

  1. ClickHouse Cloud: Through this service, users can quickly create an efficient, low-latency cloud data warehouse for fast responses to complex queries
  2. Real-time Data Integration Service: Using CDC (Change Data Capture) mechanisms to synchronize data from users' existing databases to the cloud data warehouse in real-time, achieving continuous updates of real-time data

This combination allows users to enjoy both real-time data and real-time queries, providing a new data service experience.

In the process of using the product, to fully leverage ClickHouse's data analysis performance, the data tables synchronized to the data warehouse are often large in size. How to quickly and accurately ingest large data tables into the warehouse is a critical capability of ClickPipes. This article will introduce our large table sharding synchronization solution in detail.

Challenges of Traditional Data Synchronization Methods

Traditional large data table synchronization is typically accomplished through the following methods:

Full End-to-End Synchronization

Using a SQL query to export all data in the table, then writing it to the target system. This method is simplest to implement and works well for small data tables, but faces the following problems when handling large tables:

  • High source database pressure: When performing a full scan of the entire table, it causes enormous pressure on the source database, especially in high-concurrency production environments. Long-duration read pressure may affect users' online business operations.
  • Source database resource exhaustion: To maintain long-duration cursor queries, some databases use transaction mechanisms to temporarily store relevant data changes during the cursor's existence in memory (like Oracle) or temporary tables (like SQL Server). Full table scans affect all records, which causes significant additional resource consumption in the database. If not released for a long time, it will eventually lead to database errors and interrupted synchronization.
  • Slow synchronization speed: With huge data volumes, the synchronization process may take hours or even longer.
  • Non-interruptible: When exporting data with a single SQL query, once the transmission process is interrupted and the cursor exits, there is no effective mechanism to resume it. The entire process needs to start from the beginning.
  • Poor real-time performance: All new data changes need to wait until the full synchronization is complete, which causes the data in the target warehouse to experience overall delays of several hours.

Kafka Transfer

First exporting data to a message queue, then consuming it into the data warehouse. Compared to direct end-to-end synchronization, this method has the following advantages:

  • Significantly improves data export speed
  • To some extent, alleviates the problem of long-duration SQL query occupation

Although the message queue transfer solution has certain advantages, on one hand, it doesn't fundamentally change the mechanism of reading from the source table, only alleviating the problem; on the other hand, the additional components increase maintenance complexity and operational costs.

These problems limit the application of traditional synchronization methods in real-time scenarios. We will introduce our sharding synchronization solution and the different effects it brings in solving this problem.

ClickPipes' Sharding Synchronization Solution

To solve the above problems, we designed a HASH-based sharding solution for large table synchronization. This solution decomposes one-time synchronization into multiple small-scale synchronizations by sharding the source table data, not only significantly improving synchronization performance but also solving many problems that exist in one-time large-scale queries.

Technical Implementation Principles

  1. HASH Sharding:
    • First, the data is HASH-sharded and modulated based on the table's primary key or other unique identifier fields, dividing the table into several small datasets.
    • When reading data from the source database, adjust "SELECT * FROM TABLE" to "SELECT * FROM TABLE WHERE mod(pk) % 10 == 0" to reduce the data volume of a single query.
    • According to the modulo number, adjust the previous single query to N query operations.
  2. Parallel Processing: Multiple shards can be synchronized, read, encoded/decoded, and written in parallel.
  3. Checkpoint Recording: During task execution, completed data shards will be recorded in the task status. The next time the task starts, it will skip already completed shards, achieving shard-level checkpoint resumption.
  4. Incremental Data Synchronization by Shard: For scenarios with high real-time requirements, combined with CDC-captured incremental data, data from completed shards can enter the incremental phase first, ensuring the target data is as up-to-date as possible.

Technical Advantages

  1. Performance Improvement:
    • Sharding synchronization breaks down a single large-scale data query into multiple small-batch queries that can be executed in parallel, fully utilizing the computational capabilities of the source database, improving data reading speed by approximately 3-5 times.
    • The data integration service uses multi-threading technology to encode and decode transmitted data in parallel, which can improve processing speed several times over.
  2. Reduced Impact on Source Database:
    • After breaking down large SQL queries into smaller ones, the existence time of each cursor in the source database is significantly reduced. The additional memory and temporary table resource consumption used by the source database to maintain transactions is also significantly reduced, greatly decreasing the impact of the synchronization process on the source database.
    • Sharding synchronization avoids the problem of long-duration cursor locking, greatly reducing the risk of table locking and data conflicts.
  3. Improved Reliability:
    • Each shard's failure can be independently retried, enhancing the stability of synchronization tasks.
    • When transmission is interrupted due to network failures, hardware failures, database failures, or various other reasons, the task can skip already synchronized shards and continue from incomplete shards the next time it starts, without having to restart everything from the beginning. This achieves checkpoint resumption functionality and reduces the overall time required to complete the task.
  4. Ensuring Real-time Performance: Traditional synchronization solutions need to wait until all full data is completed before synchronizing new events to ensure complete data consistency. After sharding the table, each shard can begin synchronizing incremental data immediately after completing its full data. This means that as long as the data exists in the target database, it will quickly become up-to-date without waiting for all full data to complete before updating, which improves data real-time performance to some extent.

Case Study

A jewelry company has a historical order table stored in Oracle. Due to years of data accumulation, its total data volume has reached 1TB. Oracle's speed for running aggregate statement queries for analysis is very slow. The operations department has a very strong demand to quickly analyze these orders, and they want the analysis results to be real-time.

Initially, they tried to export data using simple tools and import it into the data warehouse, but during the data export process, since Oracle always implements snapshot reading for queries, data changes during reading are stored using the Undo Tablespace. Due to the extended duration of the full data export process, they inevitably encountered errors like "ORA-01555: snapshot too old: rollback segment number 35 with name "_SYSSMU35_2598521919$" too small", and multiple retries could not resolve the issue.

After using ClickPipes, the data transmission engine automatically decomposed a single SQL query from the source database into multiple small-range data queries, successfully avoiding the ORA-01555 error and accurately transmitting the data to the data warehouse. At the same time, the automatically initiated CDC mechanism ensured that the data in the warehouse was always accurate and real-time in subsequent periods. The operations department could finally quickly obtain real-time analysis results.

Conclusion

The large table sharding synchronization function is an important cornerstone for building a real-time cloud data warehouse. Through HASH sharding and parallel processing technology, ClickPipes successfully solves the performance bottlenecks and reliability issues of traditional synchronization methods, bringing users a more efficient and real-time data service experience.

If you are also looking for a cloud data warehouse solution that can meet real-time analysis needs, we welcome you to try our product and experience the powerful advantages brought by large table sharding synchronization!