Building a Terraform EKS cluster requires more than a simple terraform apply. The critical work—the engineering that distinguishes a fragile, high-maintenance cluster from a resilient, production-ready one—is completed before writing any HCL.
Designing Your EKS Cluster Blueprint with Terraform
A comprehensive blueprint is the foundation of a successful EKS deployment. This initial design phase prevents costly refactoring and ensures the cluster is secure, scalable, and reproducible by default.

The first critical decision is defining the network topology. A well-designed Virtual Private Cloud (VPC) is the bedrock of a secure EKS cluster. This involves more than just selecting a CIDR block; it requires strategic network segmentation to achieve both security and high availability.
Architecting the Network Foundation
Your VPC architecture must isolate resources based on their required level of internet exposure. A proven, battle-tested pattern includes:
- Public Subnets: Designated exclusively for internet-facing resources like Application Load Balancers (ALBs) and NAT Gateways. These subnets have a direct route to an Internet Gateway (IGW). No worker nodes or sensitive resources should reside here.
- Private Subnets: A protected zone for EKS worker nodes. These subnets have no direct route to the IGW, shielding container workloads from unsolicited inbound traffic.
- NAT Gateways: To enable private nodes to pull container images from public registries (e.g., Docker Hub, ECR Public), place NAT Gateways in the public subnets. This provides controlled, one-way outbound internet access without exposing nodes to inbound connections.
For high availability, the architecture must span multiple Availability Zones (AZs). Provision at least two pairs of public and private subnets, with each pair distributed across a different AZ. This is a non-negotiable requirement for surviving an AZ failure.
Defining Critical IAM Roles and Policies
Misconfigured Identity and Access Management (IAM) is a primary source of failure in EKS clusters, leading to issues like nodes failing to join or pods being denied access to AWS services.
Define the necessary IAM roles as code within Terraform to establish a declarative and auditable security posture. The minimum required roles are:
- EKS Control Plane Role: Grants the EKS service permissions to manage AWS resources on your behalf, such as creating network interfaces (ENIs) that connect the control plane to your VPC.
- EKS Node Group Role: Attached to the EC2 worker nodes. It requires essential AWS-managed policies like
AmazonEKSWorkerNodePolicy,AmazonEC2ContainerRegistryReadOnly, andAmazonEKS_CNI_Policyto allow nodes to register with the control plane, pull images, and manage pod networking.
Managing these roles and policies as code is superior to manual configuration in the AWS console, which inevitably leads to configuration drift and security vulnerabilities. This Infrastructure as Code (IaC) approach ensures a consistent and auditable security posture.
Choosing the Right Terraform Module
Leveraging a community-vetted Terraform module accelerates development and incorporates best practices. The two most prominent choices represent different architectural philosophies:
| Module Approach | Key Characteristic | Best For |
|---|---|---|
terraform-aws-modules/eks |
Flexibility and Control | Teams requiring granular control over every cluster component and who are prepared to manage a comprehensive set of configuration inputs. |
| CloudPosse Modules | Opinionated and Convention-Based | Teams prioritizing rapid deployment and a convention-over-configuration model with pre-configured best practices for a turnkey solution. |
The official terraform-aws-modules/eks module offers extensive configurability at the cost of a steeper learning curve. In contrast, modules from providers like CloudPosse make opinionated design choices about networking and security to deliver a faster path to a production-ready cluster. The selection depends on team expertise and organizational requirements.
Provisioning the Core EKS Control Plane
With the blueprint finalized, the next step is to provision the EKS control plane. This process must prioritize stability, security, and team collaboration from the outset, beginning with a remote backend for Terraform state.
Setting Up a Remote Backend and State Locking
Do not store Terraform state files locally for production infrastructure. Local state is a single point of failure that risks making your infrastructure unmanageable if the file is lost or corrupted.
For AWS, the standard for remote state management is an S3 bucket for state file storage and a DynamoDB table for state locking. The lock is a critical mechanism that prevents concurrent terraform apply operations from corrupting the state file.
Define this configuration in your root module, typically in a backend.tf file:
terraform {
required_providers {
aws = {
source = "hashicorp/aws"
version = "~> 5.0"
}
}
backend "s3" {
bucket = "your-company-terraform-eks-state"
key = "prod/eks/terraform.tfstate"
region = "us-east-1"
dynamodb_table = "your-company-terraform-state-lock"
encrypt = true
}
}
This configuration instructs Terraform where to store its state. The encrypt = true parameter is essential; it ensures the state file, which may contain sensitive data, is encrypted at rest in S3.
Instantiating the EKS Cluster Module
With the backend configured, you can instantiate the EKS module. Using a proven module like terraform-aws-modules/eks abstracts away the complexity of provisioning and integrates best practices.
The module requires inputs such as the VPC and subnet IDs from your network configuration and the ARN of the EKS control plane IAM role. This is also where you configure core cluster parameters, including the Kubernetes version. A standard configuration enables both public and private API server endpoints, providing administrative access via kubectl from the internet while ensuring node-to-control-plane communication remains within the VPC. For more details on this integration, refer to our guide on using Kubernetes and Terraform.
Community solutions like Amazon EKS Blueprints for Terraform have significantly streamlined this process. Since 2021, they have helped over 10,000 AWS customers and partners reduce EKS setup time from months to days, making the terraform eks cluster a global standard. Clients have realized 50% faster CI/CD pipelines and reported 35% cost savings due to optimized add-on management.
Your main module block will reference outputs from your network and IAM modules:
module "eks" {
source = "terraform-aws-modules/eks/aws"
version = "20.8.4" # ALWAYS pin module versions
cluster_name = "my-production-cluster"
cluster_version = "1.29"
vpc_id = module.vpc.vpc_id
subnet_ids = module.vpc.private_subnets
cluster_endpoint_public_access = true
cluster_endpoint_private_access = true
eks_managed_node_groups = {
# Node group configurations defined here
}
# ... other configurations
}
Technical Best Practice: Always pin the version of your Terraform modules and providers. Using a floating version like
latestcan introduce breaking changes during a routineterraform init, leading to unexpected and potentially destructive plans.
After executing terraform apply, the module provisions the control plane and generates a kubeconfig file. You can configure the module to output this file's content, enabling immediate kubectl access to the newly created EKS cluster.
Configuring Node Groups and Essential Add-ons
A live EKS control plane requires a data plane—the worker nodes that execute application workloads. Your choice of compute layer directly impacts cost, performance, and operational overhead.
This is a key decision point in the process. The path you take for team collaboration and state management often points you toward using certain tools and best practices, as this quick decision tree shows.

As you can see, thinking about how your team will work together pushes you toward remote state backends and proven Terraform modules right from the start.
Choosing Your Compute Layer
EKS offers three primary compute options, each catering to different requirements for control, management, and cost.
Amazon EKS Managed Node Groups: The default choice for most use cases, providing a balance of control and automation. AWS manages the node lifecycle, including patching, updates, and graceful termination. You retain control over instance types, scaling policies, and launch templates.
Self-Managed Node Groups: For scenarios requiring maximum control. This option is necessary when using custom AMIs, executing complex bootstrap scripts, or adhering to strict security hardening standards not supported by managed groups. The trade-off is that you assume full responsibility for the entire node lifecycle.
AWS Fargate: A serverless compute engine that abstracts away the underlying nodes entirely. You define pod specifications (vCPU, memory), and Fargate provisions the necessary compute. It is an excellent choice for microservices, event-driven applications, and workloads with unpredictable scaling patterns.
EKS Node Group Comparison
This table provides a concise comparison of the three compute options:
| Feature | Managed Node Groups | Self-Managed Node Groups | AWS Fargate |
|---|---|---|---|
| Management | AWS-managed lifecycle | User-managed lifecycle | Fully serverless |
| Customization | Moderate (AMIs, launch templates) | High (Full EC2 control) | Low (Pod-level only) |
| Best For | General-purpose workloads | Custom security/OS needs | Serverless, bursty apps |
| Pricing | On-Demand, Spot, Savings Plans | On-Demand, Spot, Savings Plans | Per vCPU/memory per second |
The fundamental trade-off is between control and convenience. Increased control necessitates greater operational responsibility.
Implementing a Hybrid Node Strategy
You are not limited to a single compute type. A powerful cost-optimization strategy involves mixing different node types within the same cluster.
For instance, deploy critical, stateful applications on a reliable On-Demand Managed Node Group. For stateless, fault-tolerant workloads like batch processing, create a separate node group that utilizes Spot Instances. Spot can reduce EC2 costs by up to 90%, but instances can be reclaimed with a two-minute notice. This hybrid model provides stability for core services while achieving significant cost savings for eligible workloads.
Deploying Essential Add-ons with Terraform
A new EKS cluster is incomplete without essential services for networking, storage, and service discovery. Managing these components as code using Terraform is non-negotiable for a reliable and reproducible environment. The kubernetes and helm providers for Terraform are indispensable for this task.
Many advanced modules integrate this functionality. For example, CloudPosse's terraform-aws-eks-cluster component has been validated in thousands of production EKS deployments and manages the entire stack, from nodes to critical add-ons. Teams that fully automate their cluster deployments report 45% faster release cycles and a 30% lower total cost of ownership.
The minimum required add-ons include:
- AWS VPC CNI Plugin: The core networking component that assigns VPC IP addresses to pods, enabling native communication with each other and other AWS services.
- Amazon EBS CSI Driver: Enables stateful applications to dynamically provision and attach persistent storage using Amazon EBS volumes via the PersistentVolumeClaim (PVC) interface.
- CoreDNS: The cluster's internal DNS service. It facilitates service discovery by allowing applications to resolve other services using stable DNS names instead of ephemeral pod IP addresses.
By defining these add-ons as
helm_releaseorkubernetes_manifestresources in Terraform, you ensure that every cluster instance (development, staging, production) is an exact, version-controlled replica. This practice eliminates configuration drift and makes the entire EKS stack auditable.
Bolting Down and Lighting Up Your Cluster
With the data plane operational, the next phase focuses on security and observability. A new terraform eks cluster without robust security and monitoring is an opaque system vulnerable to misconfigurations and threats. This stage transforms the cluster into a transparent and secure platform.

The cornerstone of EKS security is the integration between AWS IAM and Kubernetes Role-Based Access Control (RBAC). This integration allows you to enforce the principle of least privilege for users, service accounts, and applications.
Taming Access with IAM and RBAC
By default, only the IAM principal (user or role) that created the EKS cluster has administrative access. To grant access to other principals, you must edit the aws-auth ConfigMap in the kube-system namespace. Manual management of this ConfigMap is error-prone and leads to security vulnerabilities.
A declarative approach is to manage this mapping within Terraform. The terraform-aws-modules/eks module provides a structured aws_auth_roles input for this purpose. It allows you to map IAM roles to Kubernetes user groups, such as the built-in system:masters group or more restrictive custom groups.
Here is an example of granting cluster access to a DevOps team's IAM role:
aws_auth_roles = [
{
rolearn = "arn:aws:iam::123456789012:role/DevOpsTeamRole"
username = "devops:{{SessionName}}"
groups = [
"system:masters" # Or a more restrictive custom group
]
}
]
With this configuration, any user who assumes the DevOpsTeamRole can authenticate to the cluster using kubectl and will be granted the permissions associated with the system:masters group.
After initial setup, performing a cloud security assessment is recommended to identify and remediate any potential misconfigurations.
Assembling Your Observability Stack
An observability stack is an essential toolset for debugging, performance tuning, and threat detection. It is built upon the "three pillars" of observability: metrics, logs, and traces.
Your strategy should include:
- Metrics Collection: Gathering time-series data from the control plane, nodes, and applications.
- Log Aggregation: Centralizing logs from all containers and system components.
- Visualization: Transforming raw data into actionable dashboards.
For metrics, the de facto open-source standard is Prometheus. It can be deployed via its Helm chart using the Terraform Helm provider. For a detailed walkthrough, see our guide on integrating Prometheus with Kubernetes.
A key insight: Avoid building a comprehensive observability system from the start. Adopt an iterative approach. Begin by collecting metrics from the control plane and nodes. As new services are deployed, instrument them with application-level metrics. This incremental strategy delivers value faster and is more manageable.
Wiring Up Logging and Metrics Pipelines
We will use Terraform to deploy the necessary agents to the cluster. For logging, Fluent Bit is an excellent choice due to its low resource footprint and high performance. Deploy it as a DaemonSet to ensure it runs on every node, collecting container logs and forwarding them to a backend like Amazon CloudWatch Logs.
For metrics, while Prometheus is the standard, managing its storage and scalability can be operationally intensive. AWS Managed Service for Prometheus (AMP) offloads this burden. You can use Terraform to provision an AMP workspace and configure an in-cluster Prometheus server to remote-write all its collected metrics to AMP for long-term storage and querying.
The dominance of this automated approach is reflected in market trends. The Terraform AWS provider recently surpassed 5 billion downloads, demonstrating its ubiquity. This is mirrored by a 300% increase in downloads for EKS-related modules, with solutions like AWS EKS Blueprints at the forefront. This is no longer a niche practice; it is a foundational skill. You can read more about this trend at HashiCorp's blog.
Automating Deployments and Managing the Cluster Lifecycle
The primary value of Infrastructure as Code extends beyond initial provisioning. It lies in creating a hands-off, reproducible system for managing the cluster's entire lifecycle.
This involves integrating your Terraform code into a CI/CD pipeline, establishing a Git-driven workflow where every infrastructure change—from a Kubernetes version upgrade to a node group modification—is managed through a pull request. This provides a transparent, auditable history of your production environment. Mastering the principles of continuous deployment is essential.
Building a CI/CD Pipeline with GitHub Actions
GitHub Actions is an ideal tool for this, as it co-locates your pipeline definition with your infrastructure code. A workflow can be configured to automatically execute terraform plan on every pull request and post the output as a comment, providing an immediate impact analysis.
The following is a functional workflow file (.github/workflows/terraform.yml) for this purpose:
name: 'Terraform EKS Cluster CI/CD'
on:
push:
branches:
- main
pull_request:
jobs:
terraform:
name: 'Terraform'
runs-on: ubuntu-latest
steps:
- name: Checkout
uses: actions/checkout@v3
- name: Setup Terraform
uses: hashicorp/setup-terraform@v2
with:
terraform_version: 1.8.0
- name: Configure AWS Credentials
uses: aws-actions/configure-aws-credentials@v4
with:
aws-access-key-id: ${{ secrets.AWS_ACCESS_KEY_ID }}
aws-secret-access-key: ${{ secrets.AWS_SECRET_ACCESS_KEY }}
aws-region: us-east-1
- name: Terraform Init
run: terraform init -backend-config="bucket=your-tf-state-bucket" -backend-config="key=eks/${{ github.ref_name }}/terraform.tfstate" -backend-config="region=us-east-1"
- name: Terraform Plan
if: github.event_name == 'pull_request'
run: terraform plan -no-color
- name: Terraform Apply
if: github.ref == 'refs/heads/main' && github.event_name == 'push'
run: terraform apply -auto-approve
This workflow triggers on pull requests and pushes to main. It checks out the code, configures AWS credentials from GitHub secrets, and initializes Terraform. Crucially, it only runs terraform plan for pull requests and defers terraform apply until the changes are merged into the main branch.
Managing Multiple Environments
Most organizations operate multiple environments (e.g., development, staging, production). Duplicating Terraform code for each environment is inefficient and error-prone.
Terraform workspaces are the solution.
Workspaces enable you to use a single set of configuration files to manage multiple, distinct state files. By creating dev, staging, and prod workspaces, Terraform will maintain a separate terraform.tfstate file for each in your S3 backend.
The terraform.workspace variable can then be used to parameterize your code for each environment, such as using smaller instance types in development or increasing node counts in production.
# locals.tf
locals {
instance_types = {
dev = "t3.medium"
prod = "m5.large"
}
}
# main.tf
module "eks" {
# ...
eks_managed_node_groups = {
general = {
instance_types = [local.instance_types[terraform.workspace]]
# ... other node group settings
}
}
}
This technique promotes a DRY (Don't Repeat Yourself) codebase while providing the flexibility required for multi-environment management.
Executing Zero-Downtime EKS Upgrades
Kubernetes version upgrades are an inevitable operational task. Using Terraform enables a controlled, zero-downtime upgrade process.
The upgrade is a two-step procedure:
Upgrade the Control Plane: Increment the
cluster_versionargument in your EKS module configuration and runterraform apply. AWS will perform an in-place upgrade of the control plane components with no impact on running workloads.Rotate the Worker Nodes: After the control plane upgrade is complete, the worker nodes are still running the old version. For Managed Node Groups, Terraform can orchestrate a rolling update. It will provision new nodes with the updated Kubernetes version, then safely cordon, drain, and terminate the old nodes.
A common failure pattern is attempting to upgrade the control plane and nodes simultaneously. This can cause nodes to fail registration with the cluster. Always perform the upgrade in two distinct phases: control plane first, then nodes.
Advanced Lifecycle Management
To achieve full automation, enhance your CI/CD pipeline with these advanced practices:
Drift Detection: Configuration drift occurs when manual changes are made to the infrastructure, causing it to deviate from the code. Schedule a daily
terraform planjob in your CI/CD pipeline and configure alerts to notify you of any detected drift. This serves as a safety net against out-of-band modifications.Cost Analysis: Integrate a tool like Infracost into your pipeline. It analyzes the
terraform planand posts a comment on pull requests detailing the cost impact of the proposed changes. This makes cost a visible and reviewable part of the development lifecycle, preventing budget overruns.
Common Questions and Roadblocks
Even with a robust plan, managing a Terraform EKS cluster presents challenges. Here are technical answers to frequently encountered issues.
How Do I Handle Kubernetes Secrets in a Git Repository?
Never commit plain-text secrets to a Git repository. This is a critical security vulnerability. The correct practice is to store sensitive data externally in a dedicated secrets management system.
A robust solution is to use AWS Secrets Manager. Pods can then fetch these secrets at runtime using the AWS Secrets & Configuration Provider (ASCP). This controller projects secrets from Secrets Manager into the pod as mounted files or environment variables.
This decouples the secret lifecycle from the infrastructure code lifecycle. Secrets require more frequent rotation and stricter access controls. Using AWS Secrets Manager maintains a declarative infrastructure while ensuring sensitive data remains secure.
Another common pattern, particularly in GitOps workflows, is Sealed Secrets. This involves encrypting Kubernetes Secret manifests with a public key before committing them to Git. A controller running in the cluster holds the corresponding private key and is the only entity capable of decrypting the secrets, ensuring the Git repository contains only encrypted data.
What's the Best Way to Tackle EKS Version Upgrades?
A Kubernetes version upgrade with Terraform must be executed as a deliberate, two-phase process to avoid downtime and node registration failures.
First, upgrade the control plane. Increment the cluster_version in your EKS module configuration and apply the change. Wait for AWS to complete the background upgrade process, which does not affect your workloads.
Once the control plane upgrade is complete, rotate the worker nodes, which are still running the previous Kubernetes version. For Managed Node Groups, Terraform automates this via a rolling update, provisioning new nodes, and then gracefully cordoning, draining, and terminating the old ones.
Always validate this process in a pre-production environment. Before initiating any upgrade, thoroughly review the official Kubernetes release notes and the EKS-specific update guide for deprecated APIs or breaking changes that could impact your applications.
Why Does My Terraform Plan Want to Recreate the Whole Cluster?
A plan that proposes to destroy and recreate an EKS cluster is typically caused by changing a resource attribute that Terraform cannot update in-place, forcing a replacement.
The most common attributes that force a cluster replacement are:
- Changing the
nameof theaws_eks_clusterresource. - Modifying the
vpc_idto which the cluster is attached. - Altering the
subnet_idsafter initial creation.
To prevent accidental destruction, add the prevent_destroy = true lifecycle block to your primary aws_eks_cluster resource definition. This acts as a safety mechanism, causing Terraform to error out if a plan includes the destruction of the cluster, forcing a manual review. Meticulously review every terraform plan in your CI pipeline before approving an apply to a production environment.
At OpsMoon, we specialize in cloud-native infrastructure engineering. Our team can design, build, and manage your Terraform EKS cluster, integrating best practices for security, automation, and cost optimization from day one. Accelerate your project and bypass the steep learning curve by partnering with us. Start with a free work planning session today at https://opsmoon.com.





































