Sep 5, 2024

How ClickPipes Solves Limited Network Access to Local Databases

Discover how ClickPipes addresses the challenge of securely and efficiently transferring data from restricted network environments to cloud data warehouses.

A significant challenge in real-time data warehousing is how to efficiently and stably transfer local data to the cloud for data analysis and processing. Network instability, complex topological structures, and measures taken for data security often make this process difficult.

In ClickPipes, we solve this problem by allowing users to download and deploy data collection nodes locally. This article will introduce our considerations for using this solution and why we didn't choose network proxies, a common alternative.

Sources of Limited Network Access

When synchronizing data, the data collection service needs to access the user's database address. However, for various reasons, users' database addresses often cannot be easily exposed, such as:

Database Configuration: From database administrators, databases are only configured to listen to internal network IPs
Data Security Issues: Some database transmissions don't have SSL, directly transmitting data over the public network poses a risk of leakage
Firewall Policies: From network engineers, databases are not allowed direct external access
Organizational Barriers: From managers, external access to databases will not be allowed due to warehousing requirements

Generally speaking, configuring a database to be directly accessible from the public network is not an easy thing to implement.

Common Solution: Network Connectivity Through SSH Reverse Tunnels

Although databases are not allowed to be accessed from the public network, accessing databases from a secure network is relatively easy. Therefore, personnel with data warehousing needs can apply for a separate machine in the secure network, open its SSH port, and then use SSH's reverse tunnel capability to expose the database in the internal network. This way, the data collection service can access the database in the isolated network.

Because SSH reverse tunnels are based on the SSH protocol, the transmission is necessarily encrypted, which to some extent solves the problem of ensuring data security. However, this solution still has significant limitations, such as:

Unstable Long Connections: SSH sessions need to maintain long connections, and any network interruption will cause the tunnel to fail
Firewall Interference: Some complex enterprise-level NAT or firewalls may cause session connections to break, especially when there is no traffic transmission for a long time
Complex Deployment Management: Measures to maintain automatic reconnection of connections require additional management components

Additionally, when network problems occur, since the data collection service is outside the network, it cannot do any fault-tolerant processing except for error alerting. To solve this problem, ClickPipes uses a different approach: we run the entire computing service directly in the isolated network to gain improved stability.

Innovative Design: Local Operation of Data Collection Services

Thanks to ClickPipes' ingenious engineering implementation, the program code of the data integration service itself is only 300MB in size. We have proposed a brand-new architectural design that allows users to deploy data collection and transmission nodes directly in the same network environment as the database. Its specific working method is:

After registering for the service, users can click the "Deploy Computing Service" button in the product
Users prepare the environment needed to run the service themselves, ensuring that this environment can access the user's database instances
In the environment, execute the deployment command, and ClickPipes will bind this data computing service to the user's account, only this user can use this service
When creating data sources, users can manually bind data sources to computing services, different data sources can be bound to different computing services
When tasks run, they will automatically select qualified computing services for scheduling and running

Compared to using SSH proxy machines for transit, this design has the following advantages:

1. Simpler Configuration

Collection nodes run directly in the network environment where the database is located, no additional network transit settings are needed
Users can deploy collection nodes on any device (such as servers or virtual machines) where the database is located, without additional configuration of complex network topologies

2. Lower Cost

After data processing logic is completed locally, data is pushed to the cloud in batches, effectively alleviating network bandwidth pressure
Data filtering and other logic can be completed within the internal network, without the need for public network transmission, reducing bandwidth costs
Source/collection service/target can be freely deployed in the same VPC network, not incurring public network bandwidth fees

3. More Stable Transmission

Eliminates SSH Tunnel dependencies, greatly reducing the risk of data synchronization failures caused by network interruptions or fluctuations
Completely independent from central service dependencies, data can still be transmitted normally when the network is unstable or the central service fails
Local storage mechanisms ensure that data is not lost before transmission is completed, enhancing the robustness of the system
According to different amounts of data transmitted, users can deploy appropriate hardware resources themselves, ensuring performance

4. Better Data Security

Data is transmitted directly from the isolated network to the target, without going through any transit throughout the process, making it more secure
The target can be deployed in areas that only support access from isolated networks, making transmission more secure
The transmission service is exclusively used by users, providing higher security

Conclusion

Our new local data synchronization solution provides users with a stable, efficient, and secure data transmission solution through a decentralized collection node design. This innovative design is particularly suitable for scenarios with limited networks or high security requirements, allowing enterprises to focus more on data analysis and business decisions without worrying about data transmission issues.

In the future, we will continue to optimize this solution, providing more convenient management tools and intelligent optimization functions to further enhance the user experience.

Automatic End-to-End Schema Inference: How ClickPipes Makes It Possible

Discover how ClickPipes automatically maps and creates target table structures in ClickHouse without requiring users to write a single line of SQL code.

Technical Architecture Analysis of ClickPipes Real-Time Cloud Data Warehouse

Explore the technical architecture behind ClickPipes, a comprehensive real-time cloud data warehouse solution that combines ClickHouse's analytical power with advanced data integration capabilities.