Jul 18, 2024

Automatic End-to-End Schema Inference: How ClickPipes Makes It Possible

Discover how ClickPipes automatically maps and creates target table structures in ClickHouse without requiring users to write a single line of SQL code.

In the rapidly developing landscape of cloud data warehouses and data integration services, building a truly real-time data warehouse service is a challenge faced by many users. Our newly released real-time cloud data warehouse product, ClickPipes, composed of ClickHouse Cloud service and real-time data integration service, focuses on solving this problem for users.

When performing real-time data synchronization, automatically and accurately creating target tables in ClickHouse Cloud from table structures originating from various data sources is a crucial part of the product usage process. This article will introduce ClickPipes' solution to this problem, allowing users to obtain accurate real-time data tables in the target data warehouse without writing any SQL.

Intermediate Data Type Design

To accomplish automatic mapping between multi-source data tables and ClickHouse, we abstract the data types of any database into a set of independent basic types, including:

Integers with ranges
Decimals with precision
Strings with length and fixed width
Long text
Binary with length
Boolean
Date
Time
Enumeration types

During the development of each data source, its own data types are mapped to intermediate types through configuration. After the data source is declared, the engine understands the conversion from their own data types (such as int) to standard types (such as Number). This provides the prerequisite for the engine to perform automatic inference later. The declaration format is as follows:

{
  "int[($zerofill)]": {
    "to": "Number",
    "bit": 32,
    "precision": 10,
    "value": [
      -2147483648,
      2147483647
    ]
  },
  "float": {
    "to": "Number",
    "precision": [
      1,
      6
    ],
    "scale": [
      0,
      6
    ],
    "fixed": false
  }
}

Table Structure Generation

Before synchronizing data, the computation engine automatically reads the data structure of the source table, converts it into a virtual table composed of standard intermediate types, then parses the type mapping declaration of the target table, and deduces the physical table structure that the target table should have from the virtual table, and then performs automatic table creation.

For example, for the scenario of MySQL synchronizing to ClickHouse, the integer type is declared in MySQL as: int -> Number, and in ClickHouse as: Int8 -> Number. If there is a MySQL table containing an int type, when creating a table in ClickHouse, it will be automatically inferred as an Int8 type.

In the actual process, the engine will automatically infer the type that can accurately store in the target and has the smallest extension range through data type, whether it is filled, whether it is fixed length, data range, and other attributes. Extension range refers to, for example, if a string type can accommodate, then a long text type will not be used, and if a regular integer can accommodate, then a long integer will not be used. This ensures data accuracy while avoiding the use of types that occupy more storage and have more limited usage.

Through this design, when creating synchronization tasks in ClickPipes, target tables can be automatically created without users having to create tables in advance.

ClickPipes will also pre-check whether the target table exists. If users have already created the table themselves, the synchronization process will use it instead of clearing it. If the automatic mapping of the table structure does not meet the user's expectations, users can also create tables according to their expected table structure through manual table creation.

Model Modification in Processors

Before data is synchronized to the target, ClickPipes provides some common data processing capabilities. Among them, some processors will affect the final target table structure, such as: table renaming, field renaming, field deletion, field addition, adding current time field, and other processors.

In addition, ClickPipes supports user-defined processors based on Python. Users can write Python code themselves to perform more complex transformations and processing of data.

When performing data synchronization, the computation engine will calculate and process the model according to different situations:

For table and field processors, it will automatically process the target table structure according to the configuration
For Python processors, the engine will randomly extract several pieces of data from the source data table, process them through the processor, obtain the processed values, and then use these values for the final table structure inference

Through this design, no matter how users configure and use data synchronization tasks, they can automatically complete accurate table structure creation in ClickHouse, avoiding manual operations.

Conclusion

ClickPipes, through abstracting intermediate data types, can quickly and accurately infer the data types in the target table, and automatically calculate and infer table structure changes that occur at processing nodes, completing the automatic creation capability of target tables. This greatly reduces the operational complexity for users, allowing them to obtain accurate real-time data without writing any SQL code.

ClickPipes, A Real Real Data Integration Tool for ClickHouse

ClickPipes connects traditional databases to ClickHouse, providing true real-time data synchronization and lightning-fast query capabilities for modern analytics.

How ClickPipes Solves Limited Network Access to Local Databases

Discover how ClickPipes addresses the challenge of securely and efficiently transferring data from restricted network environments to cloud data warehouses.