At Pixel Federation, we currently run our infrastructure on a privately hosted OpenStack. After a while we have decided to check and see if there are any better options on how to run our workloads and deploy our code.
There were two obvious choices:
◦ Run everything in containers on Kubernetes
◦ See if public clouds make sense
After a long research, we have concluded that answers for both questions is yes. So I’m happy to report that right now we are already running production on Kubernetes on Amazon AWS. I will try to answer the following questions about our deployments:
How do we deploy Kubernetes?
What have we learned after running Kubernetes in production for a while aka Day Two Kubernetes Operations
How to stay out of trouble with Kubernetes
Kubernetes Deployment
For Kubernetes deployment we use a tool called Kops. It has an excellent AWS support, but it also works great with other public cloud providers (GCE, VMware Vsphere), too. Kops can be used as it is or it can generate Cloudformation/Terraform templates. From our own experience we use terraform and so far it has worked great. Kubernetes is a great tool and you can run almost anything there.
Here are some things what we learned:
Make your cluster Highly Available by using multiple masters/nodes in different AZ and by using Autoscalling.
◦ For nodes make sure you have at least 2 nodes per AZ.
◦ Don’t run stateful services inside Kubernetes, it’s best to run DB/Redis/ElasticCache outside as managed services. ◦ Use ELK/EFK stack for logs gathering (it will make any troubleshooting easier). ◦ Deploy Prometheus/Grafana for monitoring.
Use Kubernetes addons to extend its functionality:
◦ External-DNS ◦ Cluster-autoscaller ◦ Ingress controllers
Application Deployment
Deploying applications to Kubernetes might be challenging and we learned a lot by doing it.
- Healthchecks are really important, make sure you have them in your application and that they work correctly.
- Use one processper container, use sidecar containers if necessary.
- Make sure you setup HPA (Horizontal Pod Autoscaler) and CA(Cluster Autoscaler) properly.
- Use HELM for application manifest deployment.
- Use the best deployment update method for your application.
- Kubernetes integrated deployment types
- Rolling Update
- Recreate
- Helm scripted
- Blue/Green
- A/B
- Canary
Day Two Kubernetes Operations
Running Kubernetes in production might be a challenging effort, but we have found out that following some simple rules can really make your life easier.
Kubernetes Scheduler
It will run your servers to the ground. Default settings for scheduler is to use as many resources as possible on your nodes. That might cause some instability and troubles with your app so make sure you reserve some resource for the operating system. Especially RAM is really important. Currently, we save at least 15% of ram on each node for OS use.
Cluster Autoscaler
Make sure you configure your app deployments properly by using — PodDistributionBudget — podAntiAffinity to distribute your pods around nodes properly. Make sure that Autoscaling groups running your nodes has AWS ASG rebalance policy disabled.
Make sure you set Autoscalling group limits with some room to grow. Kubernetes can handle the huge load but it still needs to compute resources. Be aware of AWS own limits, too. There is nothing worse than having a spike and not been able to launch new nodes because you reached your limit for a particular region.
Node problems
Detecting any problems on kubernetes nodes as soon as possible is critical. It helps you prevent any troubles with miss behaving applications and makes your life easier. For node problem detection we use node-problem-detector
Application troubles
Make sure you define application pod limits/requests properly. It might be a pain at the beginning as you have to understand how much ram/cpu your application needs but it will help kubernetes scheduler make proper decisions.
Make sure your Health Checks work properly and that they give you right info. Worst case for you is GREEN health check with application not able to serve customers.
Container Security
Make sure you check your containers for any security issues and update them often. Running 3 year-old containers full of security issues is a really BAD practice.
Future
◦ We are currently planning to deploy our first production cluster to China.
◦ We plan to evaluate Mesh networks istio.
◦ We would like to test AWS ALB ingress controller.
So this is our approach to Kubernetes. What is yours? Any tips or struggles you would like to share and discuss? Then join our Facebook group called Free to Play Game Developers and feel free to invite your fellow game developers as well.