How to air gap an EKS workload

AUTHOR

Jakub Stehlík

DevOps Engineer

Explore best practices for isolating sensitive workloads, managing private communication, configuring VPC endpoints, and securing applications with firewalls and private artifact repositories.

Introduction

Whether you are deploying sensitive software, need to adhere to strict compliance policies, or just don’t like to expose your traffic to prying eyes, air gapping might be the way to go.

Warning: Completely isolating applications or networks may not always be feasible and definitely isn’t easy. Why? The components in the air gapped environment may by design require access to the internet without possibility of reconfiguration. Introducing air gapping gives you more control, which in turn introduces more management overhead. Therefore, you should carefully weigh your needs in terms of functionality, security, time and cost, and be ready to make exceptions where necessary.

In this blog, we will show you how to isolate sensitive workloads running on AWS EKS, attempting to come as close to a fully air gapped setup as possible, while also mentioning the exceptions you may have to make along the way.

Architecture

The entire architecture is divided into three AWS accounts:

Workload: The account where the sensitive workload lives. Your infrastructure may have multiple workload accounts (e. g. dev, test, prod).
Shared: Houses common tools, CI/CD pipelines, and artifact repositories.
Network: AWS Transit Gateway, firewall, VPNs and other networking tools are deployed here.

3rd parties in AWS as well as internal users want to communicate with the deployed workload. The workload itself may unavoidably need to access the internet.

Requirements

Workload accounts running sensitive applications should be isolated; little to no traffic to the internet from applications or infrastructure components should be allowed, and all traffic to/from the internet should be inspected by the firewall. AWS services should be accessed via internal AWS network (if possible). EKS clusters should be installed privately. Users should access the applications via private connections.
Artifacts should be kept in private artifact repositories. This includes Docker images, Helm charts, software packages and libraries. These artifacts should be checked for vulnerabilities both during the initial download from the internet and also regularly in the internal repositories to prevent supply chain attacks.
The shared account adheres to the same rules as workload accounts, but with more exceptions. For example, the CI/CD pipelines in the shared account validate external artifacts and download them for internal use, and thus require access to the internet.

Basic Steps (TL;DR)

Configure the network components (subnets, Transit Gateway, firewall, VPC endpoints).
Store artifacts in private repositories. This includes Docker images, Helm charts, software packages, libraries etc.
Install and configure EKS.
Install and configure the workload.

Network

Transit Gateway

Communication between AWS accounts can be handled e. g. by AWS Transit Gateway. The Transit Gateway (TGW) can be deployed in any account (in our case network account), and shared via AWS Resource Access Manager. All VPCs that connect to the TGW need to have a TGW attachment configured.

Communication between accounts passing through the TGW may be inspected by the firewall – which incurs latency and cost – or may be controlled more simply, e. g. via implementing security groups.

Subnets

Having a clear subnetting strategy can simplify your routing and firewall configuration. All subnets in workload accounts should be private. Regarding EKS, different subnets should be created for EKS control plane and data plane (nodes, pods…). AWS services (e.g. RDS, EFS…), access to the TGW, or access components to/from 3rd party applications should be placed in separate subnets.

Firewall

One thing is having a firewall, another is having a firewall configured correctly. The firewall should deny all traffic from workload accounts to the internet except for a few special exceptions:

AWS services that cannot be accessed privately (e.g. Route53 only has public endpoints and doesn’t integrate with PrivateLink)
Public endpoints used by critical add-ons (e.g. AWS pricing API called by Cluster Autoscaler or Karpenter)
Critical 3rd party services called by your workload

Configure your routing so that all traffic to the internet first passes through the firewall – we don’t want anything to slip through unchecked. A SIEM solution can be deployed alongside the firewall to detect anomalous, undesired or downright dangerous network traffic patterns.

Accessing your applications

Your applications may be running in an isolated environment, but you still might want users or maintainers to access them. AWS offers private API Gateways or PrivateLink connections for 3rd parties to connect to your application. Alternatively, you can expose the applications publicly, but in that case, the public endpoint should be given the best protection possible, for example by implementing mTLS, or enabling WAF and DDoS protection.

Internal users should access your apps and infra via VPN. For a fully private connection between your on-premise environment and AWS, you may try AWS Direct Connect.

Putting it together

A 3rd party makes a request to your application via a PrivateLink connection. The request reaches its intended destination – a pod running on the EKS cluster. To complete the request, the pod sadly needs to call an external service running on the internet.

The packet first enters the TGW attachment and is transmitted via the Transit Gateway to the network account, where it is routed to the firewall. The firewall validates whether such communication is allowed, and if everything is OK, sends the packet to the internet via the NAT Gateway (in a public subnet) and Internet Gateway. The response follows the same route in reverse order. The entire communication from 3rd party to the internet and back remains private.

VPC Endpoints Interlude

VPC endpoints can be used to access AWS services privately. The only downside of VPC endpoints – most of them are expensive.

Gateway VPC endpoints

S3 and DynamoDB can be accessed via free Gateway VPC endpoints. These endpoints can be deployed in all AWS accounts without incurring any charges. Gateway VPC endpoints only manage traffic from their given VPC (can’t be used for cross-VPC/cross-account routing or access from on-premises).

Interface VPC endpoints

Interface VPC endpoints (powered by AWS PrivateLink) are rather expensive; monthly pricing for 1 VPC endpoint in 3 AZs in Frankfurt region without traffic costs is $0.012 * 3 * 730 = $26.28. They should therefore be deployed sparingly, only for services that you use and only in accounts where you actually intend to call the given AWS services. Here is the official list of AWS services which integrate with AWS PrivateLink: https://docs.aws.amazon.com/vpc/latest/privatelink/aws-services-privatelink-support.html.

Interface VPC endpoints can be deployed in a central AWS account (with traffic from other AWS accounts or on-premises routed to them), or per account. The following diagram shows the monthly price comparison of deploying centrally vs deploying in multiple accounts for different traffic sizes.

Note the break-even points between centralised deployment and each account having its own VPC endpoint. The price hike for the central endpoint is caused by traffic having to pass through TGW to reach the endpoint.

VPC endpoint policies

VPC endpoints can be enriched with policies, which further limit which AWS resources can be accessed via the endpoints. For example, in a policy for the ECR VPC endpoint, you can specify which ECR repositories can be accessed via the endpoint, and if the access is read-write, or read-only.

Exceptions

Not all AWS services are reachable via VPC endpoints. Route 53, AWS’s crucial DNS service is only reachable via a public endpoint. Traffic to/from Route 53 should therefore be whitelisted in the firewall.

Artifacts

Before installing EKS and applications, the environment needs to have access to the artifacts that are actually going to be installed. An air gapped environment can’t download them from the internet. Therefore, the artifacts either need to be stored privately (ideal), or installed via a private connection to a trusted source at runtime (not ideal).

Downloading the artifacts

Retrieving the artifacts from 3rd party sources and scanning need to be automated; the less manual overhead, the better. Example automation:

Provide list of desired artifacts to a pipeline
Pipeline downloads the artifacts and scans them for vulnerabilities
Artifacts get approved by stakeholders
Pipeline pushes the artifacts to their designated private artifact repository

Scanning for vulnerabilities

Scans in the pipeline can be managed by a variety of tools, e.g. Trivy. ECR provides one free basic scan for each pushed image a day.

Scanning should be performed both during the initial download of a new artifact and regularly on already stored artifacts, as vulnerabilities may appear over time. A patching process should be in place to remedy vulnerabilities found by the scan.

Artifact storage

Different artifacts require different artifact repositories:

For simplicity, the artifact repositories should be located in a central location, e. g. the shared account. All other accounts can download the artifacts from there via private channels (TGW, ECR VPC endpoints…).

EKS

Installation

The EKS cluster should be installed in the control plane subnet. An appropriate security group should be created and attached to allow access to the API endpoint from VPN or a different trusted network.

The cluster API endpoint needs to be private. You can verify that the cluster endpoint is private by performing a DNS lookup for the endpoint; a public DNS server should return a private IP address.

Nodes

EKS optimized AMIs are readily available, check https://docs.aws.amazon.com/eks/latest/userguide/eks-optimized-ami.html. The instance may attempt to update packages, e.g. via yum during spin-up. Make sure that the package repository is reachable from the air gapped environment via a firewall rule or VPC endpoint.

For example, the Amazon Linux 2 default package repository is located in S3 (https://aws.amazon.com/premiumsupport/knowledge-center/ec2-al1-al2-update-yum-without-internet/). If you have an S3 VPC endpoint configured in your environment, the instance should have no problem reaching the repository. Alternatively, you can build a golden AMI with all packages already installed.

Nodes should be placed in the EKS data plane subnet (not the control plane subnet). Their respective EC2 instances should be assigned security groups to allow only specified traffic to pass in and out of the instance.

Workload

Now that the EKS cluster is installed, we can run workloads on it. Let’s divide the workloads into two categories:

External: All workloads that you don’t develop yourself (contributing to open-source doesn’t count, sorry): add-ons, observability tools, service meshes, bought products. You can’t change their code, but you can configure them.
Internal: Your own applications. You have complete control over their design, code and configuration.

External workloads

To make external workload installation air gapped, a few things must be ensured:

Use of private artifacts (Helm charts, Docker images…)
Configuration of functionality (if necessary)

Since we don’t have access to the internet, we can’t just download the necessary artifacts from the internet; they have to come from internal artifact repositories. If you deploy workloads via Helm charts, make sure the Helm releases use charts from private repositories.

Careful, some workloads use more than one container. Make sure you have all the images in your private repository before installation, and modify your configurations to pull images from these repositories.

If you 100% need a workload that must access the internet, and there is no way around it, reconfigure the firewall to allow the necessary traffic to pass. For example, the add-ons Cluster Autoscaler and Karpenter require access to public AWS pricing endpoints. Whitelist them in the firewall.

Some workloads may need extra configuration to work in an air gapped environment. For example, Cluster Autoscaler must be explicitly configured to call regional AWS STS endpoints by setting the environment variable AWS_STS_REGIONAL_ENDPOINTS=regional to be able to reach the STS VPC endpoint.

Internal workloads

The upside compared to external workloads is that you are in control. You can design the applications to live in an air gapped environment and change them as you see fit.

Of course, most of what has been said about external workloads counts for internal workloads as well. Their artifacts must be retrieved from private registries and code/configuration has to take into account the lack of internet access. Firewall exceptions may also need to be implemented based on the application’s communication requirements.

Applications that make calls to AWS services must be configured to use regional endpoints. A good example of this is the STS endpoint in AWS SDK – older versions of the SDK have the global STS endpoint as default instead of the regional one https://docs.aws.amazon.com/sdkref/latest/guide/feature-sts-regionalized-endpoints.html.

Further isolating the workload

Both external and internal workloads running on EKS can further be isolated and protected by implementing Kubernetes and EKS networking best practices and tools.

You can set up Kubernetes network policies to control which pods, namespaces or IP ranges the pod is allowed to communicate with. EKS also offers security groups for pods, allowing you to set ingress and egress rules for individual pods instead of having to rely just on the security groups assigned to their nodes/EC2 instances.

Various 3rd party tools such as Cilium or Calico can be deployed to monitor, secure and limit your EKS workload traffic.

Kubernetes networking and security best practices can be configured according to the CIS Kubernetes Benchmark, or its EKS spin-off, the CIS Amazon EKS Benchmark.

Conclusion

If you need to isolate an EKS workload, set up private communication in your AWS environments, cut all traffic to the internet, pull all artifacts from private repositories, install a private EKS cluster, make sure your workload doesn’t need anything from the internet, and you are good to go.

Depending on your circumstances, you may be forced to implement firewall exceptions, download artifacts from public sources, or allow your workload to call public endpoints.

The initial deployment is just the beginning of the grind. As you develop and update your workload, so will you update the air gapped setup itself, whether by changing firewall rules, dealing with the onslaught of new artifact vulnerabilities, or upgrading your tooling.

We trust this post gave you the tools and ideas you need to prepare yourself for deploying and running an air gapped EKS workload.