How ClickPipes Solves ClickHouse Data Update Performance Issues

Learn how ClickPipes addresses the performance challenges of updating and deleting data in ClickHouse while maintaining real-time data accuracy.

In the world of data analytics, ClickHouse shines with its extremely fast query speed and excellent analytical capabilities. However, as a data warehouse database designed for analytics, it still suffers from a common issue in such databases - poor performance when updating and deleting data.

ClickPipes provides a complete real-time cloud data warehouse solution, including real-time data integration and ClickHouse-optimized data ingestion. This article will describe our understanding of ClickHouse data ingestion issues from a technical perspective, how we designed and optimized our solution to address these problems, and how we ensure that data in ClickHouse is always up-to-date.

Technical Challenges of ClickHouse Data Ingestion

ClickHouse is an excellent analytical data warehouse with extremely high data insertion efficiency and aggregation analysis performance. In contrast, its performance in data updates and deletions is severely lacking. We can understand the reason for this phenomenon by examining its standard storage engine, MergeTree.

The MergeTree engine, as the name suggests, consists of "Merge" and "Tree." The "Tree" means that data is organized according to sorting rules established by the primary key, making queries more efficient. This is an indexing method adopted by many databases.

The reason for ClickHouse's poor update and delete performance lies in the "Merge" part. "Merge" refers to the process of merging data. In ClickHouse, data is divided into multiple small parts when written, called "parts." The engine periodically merges these small parts into larger ones according to configured rules, a process known as "Merge."

Image: Add an illustration explaining how Merge works

Merge can remove duplicate data, optimize sorting, and reorganize data according to partitions. While there are many benefits, to maximize performance, data in this structure cannot be changed once written. When performing updates and deletions, ClickHouse actually needs to query and rewrite the parts containing the data. It affects not just one record but all data in a part, which is fundamentally much less efficient than traditional databases.

When implementing data updates and deletions, the officially recommended instructions are to use the ALTER command, as shown below:

  • Update data: ALTER TABLE table UPDATE value = newValue WHERE XXX
  • Delete data: ALTER TABLE table DELETE WHERE XXX

In actual testing, we found that the execution rate of ALTER is only about 1/1000 of INSERT, which cannot meet the data update performance requirements in real-time integration scenarios.

Replacing Updates and Deletions with Insertions

To solve the efficiency problem of updates and deletions, we designed a ClickHouse-optimized solution based on the characteristics of MergeTree. It includes three parts: table statement adjustments, update operation adjustments, and delete operation adjustments.

1. Table Statement Adjustments

When creating tables, we chose the ReplacingMergeTree engine as the standard engine and utilized its merge update capability.

In addition to retaining ordinary primary keys, we added several fields, including:

  • version field: Records a version number, set to the current time when writing, which is also the basis for merge deduplication
  • is_deleted field: Indicates whether the data has been deleted
  • delete_time field: Specifies the deletion time, providing a basis for automatic data cleanup

Example table creation statement:

CREATE TABLE table (
    id UInt64,
    value String,
    version UInt64 DEFAULT now(),
    is_deleted UInt8 DEFAULT 0,
    delete_time DateTime DEFAULT now()
) ENGINE = ReplacingMergeTree(version)
PRIMARY KEY id
ORDER BY id
TTL delete_time + INTERVAL 1 SECOND DELETE WHERE is_deleted = 1;

2. Converting Update Operations to Insert Operations

To optimize update efficiency, traditional update operations are converted to insert operations. The rule is: each time data needs to be updated, insert a new data record and ensure new data overwrites old data through the version field.

For example, to update a record: id=1, value="A" to id=1, value="B", the previous operation statement was:

ALTER TABLE table UPDATE value = "B" WHERE id=1

The optimized operation statement is:

INSERT INTO table (id, value) VALUES(1, "B")

Since ClickHouse data insertion is very fast, this operation will be completed 1000 times faster.

3. Converting Delete Operations to Insert Operations

Similar to update operations, delete operations are also converted to insert operations. The rule is: each time data needs to be deleted, insert a new data record and set the is_deleted field to 1.

For example, to delete a record: id=1, value="A", the previous operation statement was:

ALTER TABLE table DELETE WHERE id=1

The optimized operation statement is:

INSERT INTO table (id, is_deleted) VALUES(1, 1)

Through this method, deletion efficiency can be consistent with insertion operations, maintaining overall data processing efficiency.

Query Optimization

After using the above optimization solution, query operations need appropriate adjustments. The rules are as follows:

  1. Add the FINAL keyword after all queries to ensure that the latest version of data is displayed in the query results
  2. Add WHERE is_deleted = 0 to all query conditions to filter out deleted data

For example: Before optimization, the SQL statement to query table was:

SELECT * FROM table

After optimization, it needs to be adjusted to:

SELECT * FROM table FINAL WHERE is_deleted = 0

Principle Analysis

Compared to using ALTER TABLE for data updates and deletions, ClickPipes' solution replaces parts reorganization operations with fast data insertion, resulting in a thousand-fold performance improvement.

For the inserted data, in update scenarios, the version field ensures that the updated data is retained during the next data merge. In deletion scenarios, the combination of is_deleted and TTL's delete_time ensures that deleted data will definitely be removed during the next data merge, ensuring data consistency ultimately.

During queries, the combination of the FINAL keyword and the is_deleted = 0 filter ensures that even if merging has not been completed, the queried data is definitely correct.

It's worth noting that for this method to be effective, ClickPipes needs to obtain complete field data during each update. This depends on the source database's CDC settings and log settings. After enabling this feature, ClickPipes will perform preliminary checks to ensure that the source database's configuration can meet the requirements for fast updates. If the requirements cannot be met, ClickPipes will suggest that users still use the traditional ALTER mode for data updates and deletions to ensure data accuracy even at a lower frequency of data changes.

Image: Add a product image of configuration detection here

Conclusion

Through reasonable table structure design and optimization of update and delete operations, we have successfully overcome the technical challenges of ClickHouse in efficient data ingestion, making the speed of updates and deletions consistent with insertion speed, truly achieving real-time data synchronization and efficient processing. This solution not only improves performance but also brings users a smoother experience.

In the future, we will further optimize the solution and explore more possibilities to fully unleash the potential of ClickHouse and empower real-time data processing.