How ClickPipes Solves Limited Network Access to Local Databases

Discover how ClickPipes addresses the challenge of securely and efficiently transferring data from restricted network environments to cloud data warehouses.

A significant challenge in real-time data warehousing is how to efficiently and stably transfer local data to the cloud for data analysis and processing. Network instability, complex topological structures, and measures taken for data security often make this process difficult.

In ClickPipes, we solve this problem by allowing users to download and deploy data collection nodes locally. This article will introduce our considerations for using this solution and why we didn't choose network proxies, a common alternative.

Sources of Limited Network Access

When synchronizing data, the data collection service needs to access the user's database address. However, for various reasons, users' database addresses often cannot be easily exposed, such as:

  • Database Configuration: From database administrators, databases are only configured to listen to internal network IPs
  • Data Security Issues: Some database transmissions don't have SSL, directly transmitting data over the public network poses a risk of leakage
  • Firewall Policies: From network engineers, databases are not allowed direct external access
  • Organizational Barriers: From managers, external access to databases will not be allowed due to warehousing requirements

Generally speaking, configuring a database to be directly accessible from the public network is not an easy thing to implement.

Common Solution: Network Connectivity Through SSH Reverse Tunnels

Although databases are not allowed to be accessed from the public network, accessing databases from a secure network is relatively easy. Therefore, personnel with data warehousing needs can apply for a separate machine in the secure network, open its SSH port, and then use SSH's reverse tunnel capability to expose the database in the internal network. This way, the data collection service can access the database in the isolated network.

Because SSH reverse tunnels are based on the SSH protocol, the transmission is necessarily encrypted, which to some extent solves the problem of ensuring data security. However, this solution still has significant limitations, such as:

  1. Unstable Long Connections: SSH sessions need to maintain long connections, and any network interruption will cause the tunnel to fail
  2. Firewall Interference: Some complex enterprise-level NAT or firewalls may cause session connections to break, especially when there is no traffic transmission for a long time
  3. Complex Deployment Management: Measures to maintain automatic reconnection of connections require additional management components

Additionally, when network problems occur, since the data collection service is outside the network, it cannot do any fault-tolerant processing except for error alerting. To solve this problem, ClickPipes uses a different approach: we run the entire computing service directly in the isolated network to gain improved stability.

Innovative Design: Local Operation of Data Collection Services

Thanks to ClickPipes' ingenious engineering implementation, the program code of the data integration service itself is only 300MB in size. We have proposed a brand-new architectural design that allows users to deploy data collection and transmission nodes directly in the same network environment as the database. Its specific working method is:

  1. After registering for the service, users can click the "Deploy Computing Service" button in the product
  2. Users prepare the environment needed to run the service themselves, ensuring that this environment can access the user's database instances
  3. In the environment, execute the deployment command, and ClickPipes will bind this data computing service to the user's account, only this user can use this service
  4. When creating data sources, users can manually bind data sources to computing services, different data sources can be bound to different computing services
  5. When tasks run, they will automatically select qualified computing services for scheduling and running

Compared to using SSH proxy machines for transit, this design has the following advantages:

1. Simpler Configuration

  • Collection nodes run directly in the network environment where the database is located, no additional network transit settings are needed
  • Users can deploy collection nodes on any device (such as servers or virtual machines) where the database is located, without additional configuration of complex network topologies

2. Lower Cost

  • After data processing logic is completed locally, data is pushed to the cloud in batches, effectively alleviating network bandwidth pressure
  • Data filtering and other logic can be completed within the internal network, without the need for public network transmission, reducing bandwidth costs
  • Source/collection service/target can be freely deployed in the same VPC network, not incurring public network bandwidth fees

3. More Stable Transmission

  • Eliminates SSH Tunnel dependencies, greatly reducing the risk of data synchronization failures caused by network interruptions or fluctuations
  • Completely independent from central service dependencies, data can still be transmitted normally when the network is unstable or the central service fails
  • Local storage mechanisms ensure that data is not lost before transmission is completed, enhancing the robustness of the system
  • According to different amounts of data transmitted, users can deploy appropriate hardware resources themselves, ensuring performance

4. Better Data Security

  • Data is transmitted directly from the isolated network to the target, without going through any transit throughout the process, making it more secure
  • The target can be deployed in areas that only support access from isolated networks, making transmission more secure
  • The transmission service is exclusively used by users, providing higher security

Conclusion

Our new local data synchronization solution provides users with a stable, efficient, and secure data transmission solution through a decentralized collection node design. This innovative design is particularly suitable for scenarios with limited networks or high security requirements, allowing enterprises to focus more on data analysis and business decisions without worrying about data transmission issues.

In the future, we will continue to optimize this solution, providing more convenient management tools and intelligent optimization functions to further enhance the user experience.