Kubernetes Orchestration

Overview

Kubernetes is the dominant container orchestration platform — the infrastructure layer that takes containerised applications and runs them reliably at scale. Where Docker defines how individual containers are built and run, Kubernetes defines how containers are deployed across a cluster of machines, how they are kept running when failures occur, how traffic is routed to them, how they scale in response to load, and how deployments are rolled out without downtime.

The operational problems Kubernetes solves become apparent when running containerised applications in production at any meaningful scale. Running a single container on a single server is straightforward. Running multiple replicas of that container across multiple servers, ensuring that the correct number of healthy replicas is always running, routing traffic only to healthy replicas, replacing failed replicas automatically, deploying new versions without dropping requests, and managing the configuration and secrets that containers need — this is where Kubernetes provides the orchestration that manual container management cannot replicate.

Kubernetes is not appropriate for every application. The operational overhead of a Kubernetes cluster — whether self-managed or cloud-managed — is significant. For applications with modest traffic that can run on a single server or a simple VPS deployment, the complexity of Kubernetes is not justified by the availability and scaling benefits it provides. For applications that need high availability, horizontal scaling, zero-downtime deployments, and the operational infrastructure that large-scale production services require, Kubernetes is the platform that provides these capabilities with the ecosystem maturity that alternatives lack.

We work with Kubernetes for the client projects and infrastructure deployments where its capabilities are genuinely required — designing cluster architecture, writing production-quality manifests, configuring the operational infrastructure (ingress, certificate management, monitoring, logging), and integrating Kubernetes deployment into CI/CD pipelines.

What Kubernetes Orchestration Covers

Cluster architecture and setup. The cluster design decisions that determine how the Kubernetes environment is structured — the node pool configuration, the network plugin, the storage provisioner, and the cluster management approach.

Managed Kubernetes services: Amazon EKS, Google GKE, and Azure AKS provide managed control planes that eliminate the operational burden of managing Kubernetes masters — the hosted control plane is maintained, patched, and scaled by the cloud provider. Node pool configuration for managed clusters: selecting instance types appropriate to the workload profile (compute-optimised for CPU-intensive workloads, memory-optimised for in-memory data processing, general purpose for mixed workloads), configuring node pool autoscaling that adds and removes nodes based on resource demand, and separating workloads across node pools where isolation or different resource profiles are required.

Network plugin selection: CNI (Container Network Interface) plugin configuration — Calico for network policy enforcement, AWS VPC CNI for native VPC networking on EKS, Cilium for eBPF-based networking with advanced observability. Network policy implementation that enforces which pods can communicate with which other pods — the Kubernetes network firewall that prevents lateral movement between application components.

Storage configuration: StorageClass definitions that map persistent volume claims to the underlying storage provisioner — AWS EBS for block storage, AWS EFS for shared file storage, Google Persistent Disk. Storage class performance tiers and reclaim policies.

Workload deployment. The Kubernetes workload resources that run containerised applications.

Deployment configuration: the Deployment manifest that specifies the container image, the number of replicas, the resource requests and limits, the environment variables, the volume mounts, and the update strategy. Rolling update configuration that replaces old pods incrementally — the maxUnavailable and maxSurge settings that control how many pods are replaced simultaneously and how many extra pods are created during the rollout. minReadySeconds and progressDeadlineSeconds configuration that controls how long Kubernetes waits for each new pod to become healthy before proceeding.

Pod specification: container image reference with the registry, repository, and tag. Resource requests (the minimum resources the scheduler requires to be available before placing the pod on a node) and resource limits (the maximum resources the container is allowed to consume). Liveness probes that detect and restart containers that are running but not healthy. Readiness probes that gate traffic routing — Kubernetes only routes traffic to pods that pass their readiness probe, preventing requests from reaching pods that are starting up or temporarily overloaded. Startup probes for applications with slow initialisation that would otherwise trigger liveness probe failures during startup.

StatefulSet for stateful workloads: the Kubernetes workload type for applications that require stable network identities and persistent storage that follows the pod across rescheduling — databases, message brokers, and other stateful services. Ordered pod creation and deletion, stable DNS names for each pod, and PersistentVolumeClaim templates that provision storage for each pod instance.

DaemonSet for node-level workloads: the workload type that ensures one pod instance runs on every node — used for log collectors, metrics agents, and network proxies that need to run on every cluster node.

Jobs and CronJobs: batch workloads that run to completion rather than continuously. CronJob for scheduled batch processing — the database backup job, the nightly data processing job, the periodic cleanup task. Job configuration with retry limits and completion criteria.

Service discovery and load balancing. Kubernetes Services that provide stable network endpoints for pod sets.

ClusterIP services: internal cluster services accessible by DNS name from other pods within the cluster. The primary service type for inter-service communication — the database service that application pods access by service name rather than by pod IP. DNS-based service discovery: Kubernetes configures cluster DNS so that every service is accessible at service-name.namespace.svc.cluster.local.

NodePort and LoadBalancer services: services that expose workloads outside the cluster. LoadBalancer services on cloud providers provision a cloud load balancer (AWS ELB, Google Cloud Load Balancer) that routes external traffic to the pods.

Headless services: services without a ClusterIP that return DNS records for individual pod IPs — used with StatefulSets where clients need to address individual pod instances directly.

Ingress and external traffic management. The Ingress resource and Ingress controller that manage HTTP and HTTPS routing from outside the cluster to services inside.

Ingress controller deployment: NGINX Ingress Controller as the most widely deployed ingress implementation — the controller that watches Ingress resources and configures NGINX to route traffic according to the defined rules. AWS ALB Ingress Controller for EKS deployments that use AWS Application Load Balancer as the ingress layer. Traefik for ingress with built-in Let's Encrypt integration.

Ingress rules: hostname-based routing that directs traffic for different domains to different backend services. Path-based routing that routes requests to different backends based on the URL path prefix. TLS termination at the ingress layer with certificates managed by cert-manager.

Certificate management. cert-manager for automated TLS certificate provisioning and renewal.

cert-manager installation and configuration: the cert-manager deployment that watches Certificate and Issuer resources and automates certificate lifecycle management. ClusterIssuer configuration for Let's Encrypt — ACME HTTP-01 and DNS-01 challenge configuration for certificate issuance. Certificate renewal automation that renews certificates before they expire without manual intervention.

Certificate resources: Certificate manifests that request certificates for specific domains, referencing the appropriate Issuer or ClusterIssuer. Integration with Ingress resources through the cert-manager.io/cluster-issuer annotation that automatically provisions and manages certificates for Ingress TLS configuration.

Configuration and secret management. ConfigMaps and Secrets for managing application configuration and sensitive values.

ConfigMap for non-sensitive configuration: application configuration files, environment variable collections, and other configuration data mounted into pods as files or injected as environment variables. ConfigMap immutability for configuration that should not change without pod restarts.

Secret management: Kubernetes Secrets for sensitive values — database passwords, API keys, TLS certificates. External Secrets Operator for integration with external secret stores — AWS Secrets Manager, HashiCorp Vault, Azure Key Vault — that stores secrets outside the cluster and synchronises them into Kubernetes Secrets. The external secret management approach that avoids storing sensitive values in version control or in plain Kubernetes Secrets that are only base64-encoded.

Sealed Secrets for GitOps: encrypting Kubernetes Secrets into SealedSecret resources that can be safely committed to version control — the approach that enables GitOps workflows where all cluster configuration is in version control without exposing secret values.

Horizontal Pod Autoscaling. HPA resources that automatically scale the number of pod replicas based on observed metrics.

CPU and memory-based autoscaling: HPA configuration that scales replicas up when average CPU or memory utilisation exceeds the configured target and scales down when utilisation falls below it. Minimum and maximum replica counts that bound the autoscaling range. Scale-down stabilisation windows that prevent rapid oscillation between replica counts.

Custom metric autoscaling: HPA configuration based on application-specific metrics — queue depth, request latency, connections per second — using the custom metrics API and a metrics adapter (KEDA for event-driven autoscaling, Prometheus Adapter for Prometheus-based custom metrics). KEDA for scaling to zero — the event-driven autoscaling that scales deployments down to zero replicas when there is no work and back up when work arrives.

Vertical Pod Autoscaler: VPA recommendations for optimal resource requests and limits based on observed resource usage — the tooling that prevents resources being under-requested (causing resource contention) or over-requested (wasting cluster capacity).

Namespace and multi-tenancy. Kubernetes namespaces for workload isolation within a shared cluster.

Namespace structure: the namespace design that organises workloads by environment (development, staging, production), by team, or by application — with the RBAC policies that control which users and service accounts have access to which namespaces. Resource quotas per namespace that limit the total CPU, memory, and object counts that workloads in the namespace can consume.

RBAC configuration: Role and ClusterRole definitions that grant specific permissions, RoleBinding and ClusterRoleBinding that assign roles to users and service accounts. The principle of least privilege applied to service account permissions — pods that do not need to interact with the Kubernetes API run with minimal RBAC permissions.

Monitoring and observability. Operational visibility into cluster and application health.

kube-state-metrics and metrics-server: the components that expose cluster-level metrics — pod status, resource utilisation, deployment health — to monitoring systems. Prometheus for metrics collection and alerting. Grafana dashboards for cluster and application metrics visualisation. AlertManager for routing and managing Prometheus alerts.

Log aggregation: the log collection pipeline that captures pod logs and centralises them for search and analysis. Fluent Bit or Fluentd as the log collector DaemonSet that captures logs from container stdout and ships them to a centralised log store — Elasticsearch, AWS CloudWatch, Google Cloud Logging, or Loki.

Distributed tracing: OpenTelemetry collector deployment for distributed tracing data collection and export to Jaeger, Tempo, or commercial APM platforms.

Helm for application packaging. Helm charts as the packaging format for Kubernetes applications — the template system that parameterises Kubernetes manifests for deployment across multiple environments with different configurations.

Chart development: writing Helm charts that encapsulate all the Kubernetes resources an application needs — Deployments, Services, Ingress, ConfigMaps, HPA — with values files that expose the configuration that varies between environments. Named templates for reducing manifest duplication within a chart.

values.yaml management: the per-environment values files that customise the chart for development, staging, and production deployments — different image tags, different resource limits, different replica counts, different ingress hostnames. Helm secrets for managing encrypted values files.

Helm release management: the Helm release lifecycle — install, upgrade, rollback — and the release history that records every deployed version. Helm chart repositories for storing and distributing charts within an organisation.

GitOps with ArgoCD. ArgoCD for GitOps-based continuous deployment — the declarative deployment model where the desired cluster state is defined in a Git repository and ArgoCD continuously reconciles the actual cluster state to match.

ArgoCD application configuration: Application resources that define the source repository, the target cluster and namespace, and the sync policy. Automated sync that applies changes from the repository to the cluster without manual intervention. Health status tracking that reports whether deployed resources are healthy and in sync with the repository.

Technologies Used

Kubernetes — container orchestration platform
Amazon EKS / Google GKE / Azure AKS — managed Kubernetes services
Helm — Kubernetes package manager and deployment templating
Kustomize — Kubernetes manifest customisation without templating
ArgoCD — GitOps continuous deployment
NGINX Ingress Controller — HTTP/HTTPS ingress and routing
cert-manager — automated TLS certificate management
External Secrets Operator — external secret store integration
KEDA — event-driven and scale-to-zero autoscaling
Prometheus / Grafana — metrics collection, alerting, and visualisation
Fluent Bit / Fluentd — log collection and aggregation
OpenTelemetry — distributed tracing collection
Calico / Cilium — network policy enforcement
GitHub Actions / GitLab CI/CD — CI/CD pipeline integration for Kubernetes deployment

Kubernetes When It Is Justified

Kubernetes is genuinely powerful infrastructure — and genuinely complex infrastructure. The value it provides in high availability, horizontal scaling, zero-downtime deployments, and operational consistency is real, but so is the operational overhead it introduces. Kubernetes makes sense when the application's availability and scaling requirements justify that overhead. For a business-critical application that needs to stay available across server failures, that needs to scale horizontally under variable load, and that deploys frequently enough to benefit from zero-downtime rolling deployments — Kubernetes provides infrastructure that simpler approaches cannot match.

For a modest internal tool or a low-traffic application that runs comfortably on a single server — Kubernetes is overhead without corresponding benefit, and a well-configured VPS deployment with Docker and systemd is the more appropriate choice.

Production Kubernetes Infrastructure That Operates Reliably

Kubernetes clusters configured to production standards — with correct resource limits, working health probes, automated certificate renewal, properly scoped RBAC, and the monitoring that surfaces problems before they become outages — provide the reliable, scalable infrastructure that production applications require.