Adopting Karpenter enabled us to achieve 25%-40% cost savings through efficient Spot instance management and resource optimization. Let's take a look on how we achieved this.
In the first part of this series, I’ve gone through what Karpenter is, what are its benefits, limitations and what are the things to consider when migrating from cluster-autoscaler (CAS).
The second part of this blog series is more focused on the actual real-life experience and examples of real cost savings, that we were able to achieve thanks to Karpenter.
Cost Saving Opportunities
Setting aside the performance and reliability benefits, one of the key difference between CAS and Karpenter is that Karpenter considers costs when scheduling a new node. As explained in Part 1 of this series, Karpenter utilizes the AWS EC2 Fleet API instead of AWS Autoscaling groups. Using EC2 fleet, Karpenter is able to obtain the current price for each instance type and take it into consideration along the other constrains and scheduling factors.
Effectively, Karpenter is able to choose the cheapest node available for your workload at that time.
Bin Packaging
Another layer, which affects the cost saving is the bin packaging approach that Karpenter uses. When scheduling a new node Karpenter batches the pending pods together and choose the node size based on their combined resource requests. It tries to minimize wasted resources in the cluster. Moreover, it is able to reschedule existing workload if the size of the cluster and need for capacity changes. Meaning, if there are smaller nodes that can accommodate your existing workload, Karpenter can replace the nodes in the cluster in their favour.
Spot Instances
This approach proves especially useful if you use Spot instances in your cluster. Karpenter can effectively pick up the cheapest, most size appropriate node at that time. Moreover, thanks to EC2 Fleet, it’s able to consider the interruption chance of the Spot instance and picks up the instances with lower chance of being terminated.
- The EC2 fleet API attempts to provision the instance type based on the Price Capacity Optimized allocation strategy. For the on-demand capacity type, this is effectively equivalent to the
lowest-price
allocation strategy. For the spot capacity type, Fleet will determine an instance type that has both the lowest price combined with the lowest chance of being interrupted. Note that this may not give you the instance type with the strictly lowest price for spot. 1
On-demand Instances
In case of lack of Spot capacity, Karpenter is able to fallback to on-demand instances. A crucial feature, which was hard to achieve with CAS. Spot instances of some instance families, such as the variants of t3, can be often lacking in some regions. We’ve run into this issue especially in eu-central-1, where during large cluster upgrades, where we had to rollout entire clusters, we often lacked capacity to do so. This resulted in paused rollouts and the need to introduce other instance families into our CAS managed node groups. Karpanter effectively eliminated this issue and gives the ability to fallback to on-demands if needed.
Customer Examples
We offer Karpenter as part of our cloud platform called LARA. We utilize this platform for almost all of our customers and it gives us the opportunity to maintain and upgrade their environments more efficiently. As part of our regular maintenance cycle, we rolled out Karpenter to their environments. Originally, this was to address the issues with CAS, such as PVC allocation issues and performance. These and other issues were mentioned in Part 1 of this series.
The surprising side effect was, how much we were able to lower the cost of running these Kubernetes clusters just by switching to Karpenter. See for yourself.
It turned out, by switching to Karpenter, we achieved around 25% – 40% cost savings. The setup didn’t involve any other changes. We were always running fully on Spot instances and had right sized the workload running in the cluster. Just the bin packing algorithm and ability to pick the cheapest node at that time allowed us to make these savings.
Other Considerations
When you’re using EC2 Saving plans, it might be a good idea to to utilize weighted NodePools and prioritize the instance families that you’ve purchased the saving plan for. In our case, this was not really an issue, since we’re running all of the customer workload on Spot instances.
There’s an open issue for allowing Karpenter to automatically take Saving plans into an account. However for now, you’ll have to utilize the weights manually, and simply prefer the instance families that you’re purchased.
Conclusion
Although our original motivation of introducing Karpenter was not purely to save costs, it turns out it’s a very effective tool to achieve that. Even though our setup was optimized to use Spot instances and types of instances most fitting for the workload we run, Karpenter introduced more flexibility. It saved around 25% – 40% of the costs in our clusters.
For us, Karpenter has become the standard way of scaling EKS clusters for each of our customers, with little to no side effects. Although we still utilize CAS to run a very small node pool for our system tooling, including Karpenter, all of the application workloads of our customers now run on Karpenter managed nodes.
It’s good to mention that there are other approaches for optimizing your EKS bill, which can lead to further savings. It’s a good practice to regularly right size your workloads and tune the resource requests to the actual usage patterns. Tuning the scaling policies and utilizing custom metrics for scaling, for example lengths of message queues, can be a great way to save some money as well. Another great tip, is to turn of unnecessary workloads, such as non-production environments during nights and weekends, when nobody is using them. But that’s a story for another blog. Stay tuned.