Oct 22, 2024

Technical Architecture Analysis of ClickPipes Real-Time Cloud Data Warehouse

Explore the technical architecture behind ClickPipes, a comprehensive real-time cloud data warehouse solution that combines ClickHouse's analytical power with advanced data integration capabilities.

In the big data era, enterprises have increasingly demanding requirements for data real-time capabilities and query performance. To meet these needs, we've developed a brand-new product - a solution specifically designed to provide real-time cloud data warehouse services for customers. This article will detail the technical architecture of this product, helping you understand its core components and working principles.

Product Overview

ClickPipes aims to provide customers with an efficient, real-time data warehouse solution. The product consists of two major capabilities:

ClickHouse Cloud: Using ClickHouse Cloud, users can quickly create an extremely fast data warehouse that meets the demands of high concurrency and large-scale data queries.
Data Integration Service: Through CDC (Change Data Capture) mechanisms, data from users' existing databases is synchronized to the data warehouse in real-time, ensuring data timeliness and immediate query responses.

This combination of dual capabilities allows users to enjoy not only real-time data updates but also fast and efficient queries on the data warehouse, achieving truly real-time data warehouse services.

Image: Add an architecture diagram here

Core Components Introduction

1. Computation Engine

The computation engine is the execution core of the entire data synchronization task, implemented as an independent, single Java program. Its main responsibilities include:

Table Structure Mapping: Mapping source database tables to ClickHouse table structures and completing automatic table creation
Full Data Reading: Reading data from the source database through sharded concurrent methods
Incremental Data Reading: Reading incremental events from the source database through CDC mechanisms
Data Processing: Performing common data processing such as table renaming, field renaming, adding new fields, etc.
Data Writing: Writing processed data to the target data warehouse

Image: Add an engine composition architecture diagram here

In addition, the computation engine is also responsible for some non-data functional tasks, such as task log and metric collection and reporting, error retry, task progress saving, checkpoint resumption, processing preview, and other functions, helping users use the product smoothly and conveniently.

The computation engine supports two deployment methods:

Cloud-Hosted Deployment: Hosted by us in the cloud, users only need to ensure the engine can access their source and target databases, without worrying about its deployment and maintenance.
Private Network Independent Deployment: If users' network security does not allow exposing database addresses, users can also download the computation engine to their private network for independent deployment. This engine is exclusively for the user, better meeting security and compliance requirements.

Image: Add a network deployment diagram showing both modes

2. Task Manager

The task management module is responsible for building, scheduling, and managing the entire data synchronization task, mainly consisting of the following two parts:

Web Frontend: Provides a user-friendly interface for users to create and manage data synchronization tasks
Backend Module: Includes functions such as task building, task scheduling, metadata management, etc., providing comprehensive support for the frontend and computation engine

The backend module is in a key position in the product. It connects to users through the frontend to understand user intentions and presents data and task situations to users. It connects to the engine through a private protocol to dispatch tasks and scheduling, ensuring tasks run in a healthy state, and notifying users through notifications and emails when errors occur. All state persistence is also stored in a highly available database through the backend module, ensuring all configurations are not lost. Through the integration of the backend module, all components work together to form a complete user experience.

Similar to the engine, the backend module provides services through a single, independent Java process specifically designed for data services. It provides services in a position that users are not aware of. The backend module always runs in cloud service mode, and users do not need to deploy it locally.

Image: Add an architecture diagram showing how the backend module works with other modules

3. State Storage

To ensure that user task configurations, various synchronization progresses, logs, metrics, alerts, user payment information, and other states are not lost, a reliable and fast database is needed to store them.

We chose MongoDB as the state database for storing all task configurations and running states. MongoDB's high performance and flexible data model make it efficient in managing a large amount of task information. MongoDB naturally supports replica sets and automatic failover, ensuring system stability and reliability.

Image: Add a rough table structure diagram of state storage

4. Cloud Service Manager

The cloud service manager is responsible for managing user information, registration subscription information, and other key data. It ensures that each user's resources and permissions are reasonably allocated and managed, and rejects illegal requests. It also supports dynamic adjustment of subscription services, improving user experience.

Image: Add a functional diagram

The cloud service manager is separated from the task manager in logical code, but they work together as a single process at runtime to reduce the complexity of component deployment and maintenance.

5. Cloud Service Support Components

To ensure stable, reliable, and secure operation of cloud services, there are some other support components providing assistance, including:

K8s Management Group: Provides high availability services for all nodes except databases
K8s Computation Group: Provides deployment and dynamic scaling capabilities for cloud-hosted engines, ensuring the stability of data transmission nodes
WAF: Application firewall, used to prevent malicious request attacks and maintain service stability
Monitor: Service monitoring and alerting, inspecting the health status of processes, resource usage, core logic interfaces, as well as the operation status of CDN, domain names, certificates, and other services, ensuring that the R&D department can be notified promptly when services have anomalies
CI/CD Service: Provides compilation from code to artifacts, as well as online service update and rollback operations

Image: Add a K8s overview diagram

Key Technology Choices

In addition to running components, ClickPipes has made the following considerations in key functional design:

1. Pluggable Data Sources

Each supported data source is not pre-installed in the computation service. When these data sources need to be used, the computation service will inquire and download from the task manager, and cache them locally for next use. If there are updates, they will also be automatically downloaded and updated locally.

Image: Add an operational diagram

This design has several benefits:

Reduces the size of the computation service: If data sources are pre-installed, as more and more data sources are supported, the installation package of the computation service will become larger and larger, which will lead to longer installation download times and poorer customer experience.
Quick experience of new data sources: When new data sources are released, users do not need to upgrade their local computation service to quickly use the new data sources.
Fast data source upgrades: When adding features or fixing bugs for existing data sources, users do not need to manually perform update operations to immediately use the upgraded data sources, providing a smooth experience.

With the pluggable data source design, ClickPipes can continuously update and upgrade the data sources in the service, and users can immediately use them without any operations, greatly reducing users' maintenance costs.

2. Offline Operation Capability

The computation service has the ability to operate offline. When the computation service cannot connect to the task manager, it will continue to maintain the continuous operation of tasks, providing higher data accuracy assurance. There are many offline situations, such as:

User network interruption: The user's network environment loses access to the outside for some reason
ClickPipes network interruption: The cloud service provider has a failure and cannot communicate with the user's computation service
ClickPipes failure: The cloud service has a bug and cannot provide service
Intermediate communication anomaly: Network operator failure, communication quality deteriorates

When these problems occur, the local computation service will continue to synchronize data according to the existing task definition and progress, without interruption or error. However, due to communication interruption, monitoring information, logs, and synchronization breakpoints will temporarily be unable to be reported and persisted. After the network recovers, this information will be restored and reported.

Image: Add an offline operation diagram

The ability to operate offline greatly enhances the fault tolerance of the system, ensuring that the transmission of the data itself will not be affected under any extreme circumstances.

3. Using ClickHouse Cloud as a Built-in Data Warehouse

When choosing a data warehouse, we selected the most well-known product for analytical databases: ClickHouse. It has fast data insertion capabilities, appropriate data update and deletion capabilities, and provides extremely fast data aggregation query capabilities, meeting the usage scenario requirements of our product.

Between self-built services and directly using ClickHouse Cloud services, we chose cloud services that can be billed by usage and whose computing and storage resources can be automatically scaled. This can also reduce our maintenance costs and improve the stability of services for end users.

Users can directly subscribe to ClickHouse Cloud services through ClickPipes. In addition to using it as a target database, we will also ensure that the service will not have anomalies through additional operational investment, and will regularly back up data, making it easy to restore data when you make operational errors. You can also subscribe to ClickHouse Cloud services yourself or deploy ClickHouse in a private network to use this product. If you subscribe or deploy yourself, ClickPipes will not guarantee the availability of your target database.

Conclusion

Through the design of the above technical architecture, our real-time cloud data warehouse product not only has efficient data synchronization and query capabilities but also performs excellently in terms of system stability, flexibility, and maintainability. Whether users want to quickly deploy a data warehouse in the cloud or need to run independently in a private network, they can find a suitable solution. In the future, we will continue to optimize the architecture, further improve the product's performance and user experience, and help enterprises achieve greater value in the data-driven era.

By deeply analyzing the product's technical architecture, we hope you have a comprehensive understanding of this real-time cloud data warehouse. Whether you are a technical leader or a data analyst, this product will become a powerful tool for you to achieve data real-time and efficient queries. We look forward to working with you to move towards a data-driven future together.

How ClickPipes Solves Limited Network Access to Local Databases

Discover how ClickPipes addresses the challenge of securely and efficiently transferring data from restricted network environments to cloud data warehouses.

How ClickPipes Solves ClickHouse Data Update Performance Issues

Learn how ClickPipes addresses the performance challenges of updating and deleting data in ClickHouse while maintaining real-time data accuracy.