Today at The Kafka Summit in San Francisco, LinkedIn announced a new load balancing tool called Cruise Control, which has been developed to help keep Kafka clusters up and running.
The company developed Kafka, an open source message streaming tool to help make it easier to move massive amounts of data around a network from application to application. It has become so essential today that LinkedIn has dedicated 1800 servers moving over 2 trillion transactions per day through Kafka, Jiangjie Qin, lead software engineer on the Cruise Control project told TechCrunch.
With that kind of volume, keeping the Kafka clusters running has become mission-critical, so earlier this year the team decided to create a tool that would recognize when a cluster was going to break. Then based on a set of predefined rules, it would auto configure the cluster to use the correct number of resources, fix itself and keep running. The tool became Cruise Control
Prior to creating Cruise Control, engineers would have to manually reconfigure a cluster each time one went down, and Qin says this was a tricky proposition because it could end up having a cascading impact across clusters if it was configured incorrectly. By putting the machine in charge of cluster management with some human oversight, it greatly simplified the process and allowed them to scale cluster repair to meet the needs of their growing network in a way that just wasn’t possible when the engineering team had to do all of the work manually.
At its core, Qin explained this was a load balancing problem. Did the cluster have the right number of resources to stay running without having a negative impact on other clusters in the network. He said this was a matter of identifying some common configurations and applying a set of goals to each one. The machine can very quickly assess the needs of the cluster, check it against the set of common configurations and a set of goals to choose the correct one.
To make sure, it’s on track, it’s possible to put a human check in the workflow where Cruise Control asks an engineer to review the optimization plan before continuing.
If this seems like a tool that would have been nice to have before this, Qin acknowledges that it is, but it took the scalability issues to drive the company to apply the engineering resources to find a solution to the problem.
Disrupt 2026: The tech ecosystem, all in one room
Your next round. Your next hire. Your next breakout opportunity. Find it at TechCrunch Disrupt 2026, where 10,000+ founders, investors, and tech leaders gather for three days of 250+ tactical sessions, powerful introductions, and market-defining innovation. Register now to save up to $400.
Save up to $300 or 30% to TechCrunch Founder Summit
1,000+ founders and investors come together at TechCrunch Founder Summit 2026 for a full day focused on growth, execution, and real-world scaling. Learn from founders and investors who have shaped the industry. Connect with peers navigating similar growth stages. Walk away with tactics you can apply immediately
Offer ends March 13.
It took about half a year of tinkering to find the right solution where the machine could process the changes more efficiently than humans could. The company plans to release the tool to the open source community with the goal of not only improving the way it keeps Kafka clusters in balance, but also applying the same load balancing principles to other distributed systems, which should come in handy for a number of use cases, Qin says.
