How ClickPipes Solves ClickHouse Data Update Performance Issues
In the world of data analytics, ClickHouse shines with its extremely fast query speed and excellent analytical capabilities. However, as a data warehouse database designed for analytics, it still suffers from a common issue in such databases - poor performance when updating and deleting data.
ClickPipes provides a complete real-time cloud data warehouse solution, including real-time data integration and ClickHouse-optimized data ingestion. This article will describe our understanding of ClickHouse data ingestion issues from a technical perspective, how we designed and optimized our solution to address these problems, and how we ensure that data in ClickHouse is always up-to-date.
Technical Challenges of ClickHouse Data Ingestion
ClickHouse is an excellent analytical data warehouse with extremely high data insertion efficiency and aggregation analysis performance. In contrast, its performance in data updates and deletions is severely lacking. We can understand the reason for this phenomenon by examining its standard storage engine, MergeTree.
The MergeTree engine, as the name suggests, consists of "Merge" and "Tree." The "Tree" means that data is organized according to sorting rules established by the primary key, making queries more efficient. This is an indexing method adopted by many databases.
The reason for ClickHouse's poor update and delete performance lies in the "Merge" part. "Merge" refers to the process of merging data. In ClickHouse, data is divided into multiple small parts when written, called "parts." The engine periodically merges these small parts into larger ones according to configured rules, a process known as "Merge."
Image: Add an illustration explaining how Merge works
Merge can remove duplicate data, optimize sorting, and reorganize data according to partitions. While there are many benefits, to maximize performance, data in this structure cannot be changed once written. When performing updates and deletions, ClickHouse actually needs to query and rewrite the parts containing the data. It affects not just one record but all data in a part, which is fundamentally much less efficient than traditional databases.
When implementing data updates and deletions, the officially recommended instructions are to use the ALTER command, as shown below:
- Update data:
ALTER TABLE table UPDATE value = newValue WHERE XXX
- Delete data:
ALTER TABLE table DELETE WHERE XXX
In actual testing, we found that the execution rate of ALTER is only about 1/1000 of INSERT, which cannot meet the data update performance requirements in real-time integration scenarios.
Replacing Updates and Deletions with Insertions
To solve the efficiency problem of updates and deletions, we designed a ClickHouse-optimized solution based on the characteristics of MergeTree. It includes three parts: table statement adjustments, update operation adjustments, and delete operation adjustments.
1. Table Statement Adjustments
When creating tables, we chose the ReplacingMergeTree engine as the standard engine and utilized its merge update capability.
In addition to retaining ordinary primary keys, we added several fields, including:
version
field: Records a version number, set to the current time when writing, which is also the basis for merge deduplicationis_deleted
field: Indicates whether the data has been deleteddelete_time
field: Specifies the deletion time, providing a basis for automatic data cleanup
Example table creation statement:
CREATE TABLE table (
id UInt64,
value String,
version UInt64 DEFAULT now(),
is_deleted UInt8 DEFAULT 0,
delete_time DateTime DEFAULT now()
) ENGINE = ReplacingMergeTree(version)
PRIMARY KEY id
ORDER BY id
TTL delete_time + INTERVAL 1 SECOND DELETE WHERE is_deleted = 1;
2. Converting Update Operations to Insert Operations
To optimize update efficiency, traditional update operations are converted to insert operations. The rule is: each time data needs to be updated, insert a new data record and ensure new data overwrites old data through the version field.
For example, to update a record: id=1, value="A"
to id=1, value="B"
, the previous operation statement was:
ALTER TABLE table UPDATE value = "B" WHERE id=1
The optimized operation statement is:
INSERT INTO table (id, value) VALUES(1, "B")
Since ClickHouse data insertion is very fast, this operation will be completed 1000 times faster.
3. Converting Delete Operations to Insert Operations
Similar to update operations, delete operations are also converted to insert operations. The rule is: each time data needs to be deleted, insert a new data record and set the is_deleted field to 1.
For example, to delete a record: id=1, value="A"
, the previous operation statement was:
ALTER TABLE table DELETE WHERE id=1
The optimized operation statement is:
INSERT INTO table (id, is_deleted) VALUES(1, 1)
Through this method, deletion efficiency can be consistent with insertion operations, maintaining overall data processing efficiency.
Query Optimization
After using the above optimization solution, query operations need appropriate adjustments. The rules are as follows:
- Add the FINAL keyword after all queries to ensure that the latest version of data is displayed in the query results
- Add WHERE is_deleted = 0 to all query conditions to filter out deleted data
For example: Before optimization, the SQL statement to query table was:
SELECT * FROM table
After optimization, it needs to be adjusted to:
SELECT * FROM table FINAL WHERE is_deleted = 0
Principle Analysis
Compared to using ALTER TABLE for data updates and deletions, ClickPipes' solution replaces parts reorganization operations with fast data insertion, resulting in a thousand-fold performance improvement.
For the inserted data, in update scenarios, the version field ensures that the updated data is retained during the next data merge. In deletion scenarios, the combination of is_deleted and TTL's delete_time ensures that deleted data will definitely be removed during the next data merge, ensuring data consistency ultimately.
During queries, the combination of the FINAL keyword and the is_deleted = 0 filter ensures that even if merging has not been completed, the queried data is definitely correct.
It's worth noting that for this method to be effective, ClickPipes needs to obtain complete field data during each update. This depends on the source database's CDC settings and log settings. After enabling this feature, ClickPipes will perform preliminary checks to ensure that the source database's configuration can meet the requirements for fast updates. If the requirements cannot be met, ClickPipes will suggest that users still use the traditional ALTER mode for data updates and deletions to ensure data accuracy even at a lower frequency of data changes.
Image: Add a product image of configuration detection here
Conclusion
Through reasonable table structure design and optimization of update and delete operations, we have successfully overcome the technical challenges of ClickHouse in efficient data ingestion, making the speed of updates and deletions consistent with insertion speed, truly achieving real-time data synchronization and efficient processing. This solution not only improves performance but also brings users a smoother experience.
In the future, we will further optimize the solution and explore more possibilities to fully unleash the potential of ClickHouse and empower real-time data processing.
Technical Architecture Analysis of ClickPipes Real-Time Cloud Data Warehouse
Explore the technical architecture behind ClickPipes, a comprehensive real-time cloud data warehouse solution that combines ClickHouse's analytical power with advanced data integration capabilities.
How ClickPipes Solves the Challenge of Synchronizing Large Tables
Learn how ClickPipes tackles the complex challenge of synchronizing large database tables to ClickHouse with its innovative sharding solution.