Kubernetes Infrastructure

Home Projects Infrastructure Kubernetes Infrastructure

2025 – 2026 Infrastructure

Overview

This project is a self-designed, self-built Kubernetes homelab running across 3 desktop PCs and 4 laptops, interconnected through a single managed switch, a dedicated router handling inter-VLAN routing and NAT, and a wireless radio providing mesh backhaul for the laptop nodes. The goal was to replicate a production-grade container orchestration environment entirely from consumer hardware — no cloud, no rented rack space, no managed services.

The cluster runs kubeadm-bootstrapped Kubernetes on Ubuntu Server VMs provisioned inside Proxmox, with Calico as the CNI for pod networking and network policy enforcement, MetalLB for bare-metal LoadBalancer services, Longhorn for distributed persistent storage across nodes, and a full observability stack built on Prometheus + Grafana + Loki. Deployments are automated through Bash scripts and Docker Compose for auxiliary services, with GitOps workflows managing application manifests.

Every component — from the physical cable runs and VLAN trunking to the Kubernetes RBAC policies and Grafana dashboards — was planned, configured, and documented by hand. The cluster currently hosts all self-hosted services (Nextcloud, AI model, DNS, email, wiki, search engine, photo server) in production.

Physical Topology

The entire infrastructure runs from a single physical location. Three desktop PCs serve as the primary compute and storage nodes, while four laptops act as lightweight worker nodes and provide redundancy. All wired nodes connect to a managed switch with 802.1Q VLAN trunking. The router handles inter-VLAN routing, DHCP reservation, NAT to the public internet, and port forwarding for externally-facing services. A wireless radio bridges the laptop nodes into the cluster network over a dedicated 5 GHz backhaul link.

graph TD subgraph Internet["Public Internet"] ISP["ISP Uplink"] end subgraph Edge["Edge Layer"] RTR["Router\nNAT / Firewall / DHCP\nInter-VLAN Routing\nPort Forwarding"] end subgraph Network["Network Fabric"] SW["Managed Switch\n802.1Q VLAN Trunking\nVLAN 10: Management\nVLAN 20: Cluster\nVLAN 30: Storage"] RADIO["Wireless Radio\n5 GHz Backhaul\nBridge Mode"] end subgraph DesktopNodes["Desktop Nodes (Wired)"] PC1["PC-01 • Control Plane\n8C/32GB/512GB NVMe\nUbuntu Server 24.04\netcd + API Server"] PC2["PC-02 • Worker Node\n6C/16GB/1TB SSD\nUbuntu Server 24.04\nLonghorn Storage"] PC3["PC-03 • Worker Node\n6C/16GB/1TB SSD\nUbuntu Server 24.04\nLonghorn Storage"] end subgraph LaptopNodes["Laptop Nodes (Wireless)"] LP1["Laptop-01 • Worker\n4C/8GB/256GB"] LP2["Laptop-02 • Worker\n4C/8GB/256GB"] LP3["Laptop-03 • Worker\n4C/8GB/512GB"] LP4["Laptop-04 • Worker\n4C/16GB/512GB"] end ISP -->|"WAN"| RTR RTR -->|"Trunk"| SW SW -->|"Eth VLAN 20"| PC1 SW -->|"Eth VLAN 20"| PC2 SW -->|"Eth VLAN 20"| PC3 SW -->|"Eth VLAN 20"| RADIO RADIO -->|"5 GHz Bridge"| LP1 RADIO -->|"5 GHz Bridge"| LP2 RADIO -->|"5 GHz Bridge"| LP3 RADIO -->|"5 GHz Bridge"| LP4 style ISP fill:#1a1a2e,stroke:#ff4444,color:#e0e0e0 style RTR fill:#1a1a2e,stroke:#ff8800,color:#e0e0e0 style SW fill:#1a1a2e,stroke:#00ff88,color:#e0e0e0 style RADIO fill:#1a1a2e,stroke:#00ff88,color:#e0e0e0 style PC1 fill:#1a1a2e,stroke:#00d4ff,color:#e0e0e0 style PC2 fill:#1a1a2e,stroke:#00d4ff,color:#e0e0e0 style PC3 fill:#1a1a2e,stroke:#00d4ff,color:#e0e0e0 style LP1 fill:#1a1a2e,stroke:#aa88ff,color:#e0e0e0 style LP2 fill:#1a1a2e,stroke:#aa88ff,color:#e0e0e0 style LP3 fill:#1a1a2e,stroke:#aa88ff,color:#e0e0e0 style LP4 fill:#1a1a2e,stroke:#aa88ff,color:#e0e0e0

Network Architecture

Network segmentation is enforced at the switch level using three VLANs. The router performs inter-VLAN routing with firewall rules restricting lateral movement between segments. This mirrors enterprise network design where management, application, and storage traffic are isolated to prevent blast radius expansion during a compromise.

VLAN	ID	Subnet	Purpose
Management	10	10.10.10.0/24	SSH access, Proxmox UI, router admin, switch management, IPMI/BMC
Cluster	20	10.20.20.0/24	Kubernetes API server, pod-to-pod (Calico overlay), service mesh, MetalLB external IPs
Storage	30	10.30.30.0/24	Longhorn replication, NFS mounts, ZFS snapshot sync, backup traffic

graph LR subgraph VLAN10["VLAN 10 — Management"] SSH["SSH Bastion"] PMX["Proxmox Web UI"] SWMGMT["Switch Mgmt"] end subgraph VLAN20["VLAN 20 — Cluster"] API["K8s API Server :6443"] CALICO["Calico CNI\nBGP / VXLAN"] MLB["MetalLB\n10.20.20.200-250"] PODS["Pod Network\n192.168.0.0/16"] end subgraph VLAN30["VLAN 30 — Storage"] LH["Longhorn\niSCSI Replication"] NFS["NFS Exports"] ZFS["ZFS Snapshots"] end VLAN10 ---|"Firewall Rules"| VLAN20 VLAN20 ---|"Firewall Rules"| VLAN30 style SSH fill:#1a1a2e,stroke:#ff8800,color:#e0e0e0 style PMX fill:#1a1a2e,stroke:#ff8800,color:#e0e0e0 style SWMGMT fill:#1a1a2e,stroke:#ff8800,color:#e0e0e0 style API fill:#1a1a2e,stroke:#00d4ff,color:#e0e0e0 style CALICO fill:#1a1a2e,stroke:#00d4ff,color:#e0e0e0 style MLB fill:#1a1a2e,stroke:#00d4ff,color:#e0e0e0 style PODS fill:#1a1a2e,stroke:#00d4ff,color:#e0e0e0 style LH fill:#181818,stroke:#1e1e1e,color:#888 style NFS fill:#181818,stroke:#1e1e1e,color:#888 style ZFS fill:#181818,stroke:#1e1e1e,color:#888

Kubernetes Architecture

graph TD subgraph ControlPlane["Control Plane — PC-01"] ETCD["etcd\nCluster State Store"] APIS["kube-apiserver\n:6443"] SCHED["kube-scheduler"] CM["kube-controller-manager"] end subgraph Workers["Worker Nodes — PC-02, PC-03, LP-01..04"] KP["kubelet"] KPR["kube-proxy\niptables / IPVS"] CR["containerd\nContainer Runtime"] end subgraph Networking["Network Layer"] CAL["Calico CNI\nBGP Peering + NetworkPolicy"] MLBX["MetalLB\nL2 / ARP Mode\nExternal IP Pool"] ING["Nginx Ingress Controller\nTLS Termination"] end subgraph Storage["Storage Layer"] LHN["Longhorn\nDistributed Block Storage\n3x Replication"] PV["PersistentVolumes"] PVC["PersistentVolumeClaims"] end subgraph Observability["Observability Stack"] PROM["Prometheus\nMetrics Collection"] GRAF["Grafana\nDashboards + Alerts"] LOKI["Loki\nLog Aggregation"] PRMTL["Promtail\nLog Shipping"] NX["Node Exporter\nHardware Metrics"] end APIS --> ETCD SCHED --> APIS CM --> APIS KP -->|"watch/report"| APIS KP --> CR KPR --> CAL CAL --> MLBX MLBX --> ING LHN --> PV PV --> PVC NX -->|"metrics"| PROM KP -->|"metrics"| PROM PROM --> GRAF PRMTL -->|"logs"| LOKI LOKI --> GRAF style ETCD fill:#1a1a2e,stroke:#ff4444,color:#e0e0e0 style APIS fill:#1a1a2e,stroke:#ff4444,color:#e0e0e0 style SCHED fill:#1a1a2e,stroke:#ff4444,color:#e0e0e0 style CM fill:#1a1a2e,stroke:#ff4444,color:#e0e0e0 style KP fill:#1a1a2e,stroke:#00d4ff,color:#e0e0e0 style KPR fill:#1a1a2e,stroke:#00d4ff,color:#e0e0e0 style CR fill:#1a1a2e,stroke:#00d4ff,color:#e0e0e0 style CAL fill:#1a1a2e,stroke:#00ff88,color:#e0e0e0 style MLBX fill:#1a1a2e,stroke:#00ff88,color:#e0e0e0 style ING fill:#1a1a2e,stroke:#00ff88,color:#e0e0e0 style LHN fill:#181818,stroke:#aa88ff,color:#e0e0e0 style PV fill:#181818,stroke:#aa88ff,color:#e0e0e0 style PVC fill:#181818,stroke:#aa88ff,color:#e0e0e0 style PROM fill:#1a1a2e,stroke:#ffcc00,color:#e0e0e0 style GRAF fill:#1a1a2e,stroke:#ffcc00,color:#e0e0e0 style LOKI fill:#1a1a2e,stroke:#ffcc00,color:#e0e0e0 style PRMTL fill:#1a1a2e,stroke:#ffcc00,color:#e0e0e0 style NX fill:#1a1a2e,stroke:#ffcc00,color:#e0e0e0

Tech Stack

Layer	Technology	Role
Hardware	3 × Desktop PC, 4 × Laptop	Bare-metal compute and storage nodes
Networking	Managed Switch (802.1Q), Router, 5 GHz Radio	VLAN segmentation, NAT, wireless mesh backhaul
Hypervisor	Proxmox VE 8.x	VM provisioning, snapshots, live migration
OS	Ubuntu Server 24.04 LTS	Minimal server images inside Proxmox VMs
Container Runtime	containerd	CRI-compliant runtime for Kubernetes
Orchestration	Kubernetes (kubeadm)	Cluster bootstrapping, upgrades, node management
CNI	Calico	Pod networking, BGP peering, NetworkPolicy enforcement
Load Balancer	MetalLB (L2 mode)	External IP allocation for bare-metal LoadBalancer services
Ingress	Nginx Ingress Controller	HTTP/HTTPS routing, TLS termination, path-based routing
Storage	Longhorn	Distributed block storage with 3x replication across nodes
TLS	Let’s Encrypt + cert-manager	Automated certificate provisioning and renewal
Monitoring	Prometheus + Node Exporter	Metrics collection from nodes, pods, and Kubernetes internals
Dashboards	Grafana	Visualization, alerting rules, and SLA tracking
Logging	Loki + Promtail	Centralized log aggregation and querying
Automation	Bash, Docker Compose	Cluster provisioning scripts, auxiliary service orchestration
DNS	Pi-hole + Unbound	Internal cluster DNS resolution and ad/threat blocking

Build Process

Hardware Preparation & Network Wiring

Each desktop PC and laptop was prepared with a clean BIOS/UEFI configuration, boot order set to PXE/USB, and hardware diagnostics run to verify RAM, disk health, and thermal performance. Cat6 Ethernet cables were run from each desktop to the managed switch. The switch was configured with three VLANs (10, 20, 30) and trunk ports for the router uplink. The wireless radio was mounted and configured in bridge mode on VLAN 20 to extend the cluster network to the laptop nodes over 5 GHz.

# Switch VLAN configuration (CLI example)
enable
configure terminal

vlan 10
  name MANAGEMENT
vlan 20
  name CLUSTER
vlan 30
  name STORAGE

# Trunk port to router
interface GigabitEthernet0/1
  switchport mode trunk
  switchport trunk allowed vlan 10,20,30

# Access ports for desktop nodes (VLAN 20)
interface range GigabitEthernet0/2-4
  switchport mode access
  switchport access vlan 20

# Trunk port to wireless radio
interface GigabitEthernet0/5
  switchport mode trunk
  switchport trunk allowed vlan 20

Proxmox Installation & VM Provisioning

Proxmox VE 8.x was installed on each desktop PC as the base hypervisor. Ubuntu Server 24.04 LTS VMs were created on each node with resource allocations matched to the hardware. VMs were configured with two virtual NICs: one on VLAN 20 (cluster traffic) and one on VLAN 30 (storage traffic). Cloud-init templates were used to standardize hostname, SSH keys, and network configuration across all VMs.

# Create a cloud-init template VM on Proxmox
qm create 9000 --name ubuntu-cloud --memory 4096 --cores 2 \
  --net0 virtio,bridge=vmbr0,tag=20 \
  --net1 virtio,bridge=vmbr0,tag=30 \
  --scsihw virtio-scsi-single

# Import Ubuntu cloud image
qm set 9000 --scsi0 local-lvm:0,import-from=/var/lib/vz/template/iso/ubuntu-24.04-server-cloudimg-amd64.img
qm set 9000 --ide2 local-lvm:cloudinit
qm set 9000 --boot order=scsi0
qm set 9000 --serial0 socket --vga serial0

# Configure cloud-init defaults
qm set 9000 --ciuser taki --cipassword changeme \
  --sshkeys ~/.ssh/authorized_keys \
  --ipconfig0 ip=dhcp \
  --ipconfig1 ip=10.30.30.X/24,gw=10.30.30.1

# Clone template for each node
for i in 1 2 3 4 5 6; do
  qm clone 9000 10${i} --name k8s-node-${i} --full
  qm start 10${i}
done

Kubernetes Cluster Bootstrap with kubeadm

All nodes were prepared with the Kubernetes prerequisites: swap disabled, kernel modules loaded (br_netfilter, overlay), sysctl parameters set for IP forwarding, and containerd installed as the CRI runtime. The control plane was initialized on PC-01, and worker nodes joined using the bootstrap token.

# Run on ALL nodes
cat <<EOF | sudo tee /etc/modules-load.d/k8s.conf
overlay
br_netfilter
EOF
sudo modprobe overlay && sudo modprobe br_netfilter

cat <<EOF | sudo tee /etc/sysctl.d/k8s.conf
net.bridge.bridge-nf-call-iptables  = 1
net.bridge.bridge-nf-call-ip6tables = 1
net.ipv4.ip_forward                 = 1
EOF
sudo sysctl --system

sudo swapoff -a && sudo sed -i '/swap/d' /etc/fstab

# Install containerd
sudo apt install -y containerd
sudo mkdir -p /etc/containerd
containerd config default | sudo tee /etc/containerd/config.toml
sudo sed -i 's/SystemdCgroup = false/SystemdCgroup = true/' /etc/containerd/config.toml
sudo systemctl restart containerd

# Install kubeadm, kubelet, kubectl
sudo apt install -y apt-transport-https ca-certificates curl gpg
curl -fsSL https://pkgs.k8s.io/core:/stable:/v1.30/deb/Release.key | \
  sudo gpg --dearmor -o /etc/apt/keyrings/kubernetes-apt-keyring.gpg
echo 'deb [signed-by=/etc/apt/keyrings/kubernetes-apt-keyring.gpg] https://pkgs.k8s.io/core:/stable:/v1.30/deb/ /' | \
  sudo tee /etc/apt/sources.list.d/kubernetes.list
sudo apt update && sudo apt install -y kubelet kubeadm kubectl
sudo apt-mark hold kubelet kubeadm kubectl

# Initialize control plane (PC-01)
sudo kubeadm init \
  --pod-network-cidr=192.168.0.0/16 \
  --apiserver-advertise-address=10.20.20.10 \
  --control-plane-endpoint=10.20.20.10:6443

mkdir -p $HOME/.kube
sudo cp /etc/kubernetes/admin.conf $HOME/.kube/config

# Join workers (run on each worker node)
sudo kubeadm join 10.20.20.10:6443 \
  --token <token> \
  --discovery-token-ca-cert-hash sha256:<hash>

Calico CNI & Network Policy Deployment

Calico was deployed as the CNI plugin for pod networking with BGP peering between nodes and NetworkPolicy enforcement for micro-segmentation. Network policies were written to restrict pod-to-pod traffic: only explicitly allowed communication paths are permitted, following a default-deny ingress posture.

# Install Calico operator and custom resources
kubectl create -f https://raw.githubusercontent.com/projectcalico/calico/v3.28.0/manifests/tigera-operator.yaml
kubectl create -f https://raw.githubusercontent.com/projectcalico/calico/v3.28.0/manifests/custom-resources.yaml

# Verify Calico pods are running
kubectl get pods -n calico-system -w

# Default-deny ingress NetworkPolicy (applied per namespace)
apiVersion: networking.k8s.io/v1
kind: NetworkPolicy
metadata:
  name: default-deny-ingress
  namespace: production
spec:
  podSelector: {}
  policyTypes:
    - Ingress

# Allow only Nginx Ingress to reach web pods
apiVersion: networking.k8s.io/v1
kind: NetworkPolicy
metadata:
  name: allow-ingress-to-web
  namespace: production
spec:
  podSelector:
    matchLabels:
      app: web
  ingress:
    - from:
        - namespaceSelector:
            matchLabels:
              name: ingress-nginx
      ports:
        - protocol: TCP
          port: 8080

MetalLB & Nginx Ingress Controller

MetalLB was deployed in L2 (ARP) mode to provide external IP addresses for LoadBalancer services on the bare-metal cluster. An IP pool of 10.20.20.200–250 was reserved on VLAN 20. The Nginx Ingress Controller was deployed as a DaemonSet to handle all HTTP/HTTPS traffic with TLS termination via cert-manager and Let’s Encrypt.

# Install MetalLB
kubectl apply -f https://raw.githubusercontent.com/metallb/metallb/v0.14.5/config/manifests/metallb-native.yaml

# Configure IP address pool
apiVersion: metallb.io/v1beta1
kind: IPAddressPool
metadata:
  name: cluster-pool
  namespace: metallb-system
spec:
  addresses:
    - 10.20.20.200-10.20.20.250

---
apiVersion: metallb.io/v1beta1
kind: L2Advertisement
metadata:
  name: cluster-l2
  namespace: metallb-system

# Install Nginx Ingress Controller
kubectl apply -f https://raw.githubusercontent.com/kubernetes/ingress-nginx/controller-v1.10.1/deploy/static/provider/baremetal/deploy.yaml

# Install cert-manager for automated TLS
kubectl apply -f https://github.com/cert-manager/cert-manager/releases/download/v1.15.0/cert-manager.yaml

# ClusterIssuer for Let's Encrypt
apiVersion: cert-manager.io/v1
kind: ClusterIssuer
metadata:
  name: letsencrypt-prod
spec:
  acme:
    server: https://acme-v02.api.letsencrypt.org/directory
    email: [email protected]
    privateKeySecretRef:
      name: letsencrypt-prod
    solvers:
      - http01:
          ingress:
            class: nginx

Longhorn Distributed Storage

Longhorn was deployed for persistent storage with 3x replication across the desktop nodes. Each desktop contributes its SSD/NVMe storage to the Longhorn pool. Storage traffic is isolated on VLAN 30 to prevent replication I/O from contesting with cluster API and pod traffic on VLAN 20. Scheduled snapshots and backups to an NFS target provide disaster recovery capability.

# Install Longhorn
kubectl apply -f https://raw.githubusercontent.com/longhorn/longhorn/v1.6.2/deploy/longhorn.yaml

# Set as default StorageClass
kubectl patch storageclass longhorn -p \
  '{"metadata": {"annotations":{"storageclass.kubernetes.io/is-default-class":"true"}}}'

# Verify volumes and nodes
kubectl -n longhorn-system get pods
kubectl -n longhorn-system get nodes.longhorn.io

# Example PVC using Longhorn
apiVersion: v1
kind: PersistentVolumeClaim
metadata:
  name: nextcloud-data
  namespace: production
spec:
  accessModes:
    - ReadWriteOnce
  storageClassName: longhorn
  resources:
    requests:
      storage: 100Gi

Observability Stack — Prometheus + Grafana + Loki

The full observability stack was deployed using the kube-prometheus-stack Helm chart. Prometheus scrapes metrics from Node Exporter (hardware), kubelet (pod resources), kube-state-metrics (Kubernetes object states), and application-level exporters. Grafana provides pre-built dashboards for cluster health, node resources, pod performance, and Longhorn storage utilization. Loki with Promtail aggregates logs from all pods and system journals into a queryable interface within Grafana.

# Install Helm
curl https://raw.githubusercontent.com/helm/helm/main/scripts/get-helm-3 | bash

# Add repos
helm repo add prometheus-community https://prometheus-community.github.io/helm-charts
helm repo add grafana https://grafana.github.io/helm-charts
helm repo update

# Install kube-prometheus-stack
helm install monitoring prometheus-community/kube-prometheus-stack \
  --namespace monitoring --create-namespace \
  --set grafana.adminPassword='secureDashboard' \
  --set prometheus.prometheusSpec.retention=30d \
  --set prometheus.prometheusSpec.storageSpec.volumeClaimTemplate.spec.storageClassName=longhorn \
  --set prometheus.prometheusSpec.storageSpec.volumeClaimTemplate.spec.resources.requests.storage=50Gi

# Install Loki + Promtail
helm install loki grafana/loki-stack \
  --namespace monitoring \
  --set promtail.enabled=true \
  --set loki.persistence.enabled=true \
  --set loki.persistence.storageClassName=longhorn \
  --set loki.persistence.size=20Gi

# Expose Grafana via Ingress
apiVersion: networking.k8s.io/v1
kind: Ingress
metadata:
  name: grafana
  namespace: monitoring
  annotations:
    cert-manager.io/cluster-issuer: letsencrypt-prod
spec:
  ingressClassName: nginx
  tls:
    - hosts: [grafana.tyfsadik.org]
      secretName: grafana-tls
  rules:
    - host: grafana.tyfsadik.org
      http:
        paths:
          - path: /
            pathType: Prefix
            backend:
              service:
                name: monitoring-grafana
                port:
                  number: 80

Services Running on the Cluster

Service	Namespace	Domain	Replicas
Nextcloud (Cloud Storage)	production	cloud.tyfsadik.org	2
Private AI Model (Ollama)	ai	ai.tyfsadik.org	1
Pi-hole + Unbound DNS	dns	Internal	2
Email Server (Postfix/Dovecot)	mail	tyfsadik.org	1
SearXNG Search Engine	search	search.tyfsadik.org	2
Photo Server	media	photo.tyfsadik.org	1
Wiki Server (Wikipedia Mirror)	wiki	wiki.tyfsadik.org	1
Grafana Dashboards	monitoring	grafana.tyfsadik.org	1
Prometheus + Loki	monitoring	Internal	1

Deployment Workflow

flowchart LR A["Developer\ngit push"] -->|"webhook"| B["GitHub Actions\nCI Pipeline"] B -->|"build + push"| C["Container Registry\nDocker Hub"] C -->|"image pull"| D["Kubernetes\nkubectl apply"] D --> E["Rolling Update\nzero-downtime deploy"] E --> F["Health Check\nliveness + readiness"] F -->|"metrics"| G["Prometheus\nmonitoring"] G --> H["Grafana\nalert + dashboard"] style A fill:#1a1a2e,stroke:#00d4ff,color:#e0e0e0 style B fill:#1a1a2e,stroke:#ffcc00,color:#e0e0e0 style C fill:#1a1a2e,stroke:#00ff88,color:#e0e0e0 style D fill:#1a1a2e,stroke:#00d4ff,color:#e0e0e0 style E fill:#1a1a2e,stroke:#00d4ff,color:#e0e0e0 style F fill:#1a1a2e,stroke:#00ff88,color:#e0e0e0 style G fill:#1a1a2e,stroke:#ffcc00,color:#e0e0e0 style H fill:#1a1a2e,stroke:#ffcc00,color:#e0e0e0

Challenges & Solutions

Laptop nodes dropping from cluster over Wi-Fi: The laptop worker nodes periodically lost connectivity over the wireless bridge, causing kubelet to miss heartbeats and the control plane to mark them as NotReady. Resolved by deploying a dedicated 5 GHz radio in bridge mode with a fixed channel, disabling power management on the laptop NICs (iw dev wlan0 set power_save off), and increasing the node-monitor-grace-period to 60s to tolerate brief interruptions without evicting pods.
Longhorn replication saturating the network: Initial deployment placed storage replication on the same VLAN as cluster traffic, causing API server latency spikes during large write operations. Resolved by creating a dedicated VLAN 30 for storage traffic and configuring Longhorn to use the VLAN 30 interface for replication.
MetalLB ARP conflicts with router DHCP: MetalLB’s L2 mode ARP responses conflicted with the router’s DHCP leases when the IP pool overlapped with the DHCP range. Fixed by reserving 10.20.20.200–250 as a static range excluded from DHCP and configuring MetalLB to only advertise within that range.
etcd performance on a single control plane node: With only one control plane node, etcd write latency spiked during heavy scheduling. Mitigated by placing etcd data on the NVMe drive (not SSD), tuning heartbeat-interval and election-timeout, and ensuring no other I/O-heavy workloads run on PC-01.
TLS certificate provisioning for multiple subdomains: cert-manager’s HTTP-01 solver required each subdomain to be publicly reachable, which conflicted with internal-only services. Resolved by using DNS-01 challenge validation for internal services and HTTP-01 for public-facing ones.
Resource contention on 8 GB laptop nodes: Laptops with only 8 GB RAM struggled when Longhorn and Prometheus exporters consumed too much memory alongside application pods. Fixed by applying resource limits and requests to all pods, tainting the laptop nodes with node-role=lightweight:PreferNoSchedule, and using node affinity rules to keep heavy workloads on the desktop nodes.

Security Hardening

RBAC: Least-privilege service accounts per namespace; no pods run with cluster-admin
NetworkPolicy: Default-deny ingress on all namespaces; explicit allow rules per service
Pod Security Standards: restricted profile enforced via Pod Security Admission; no privileged containers
Image Scanning: Trivy scans on all container images before deployment; CVE alerts sent to Grafana
Secrets Management: Kubernetes Secrets encrypted at rest with aescbc encryption provider
SSH Hardening: Key-only authentication, fail2ban on all nodes, management access restricted to VLAN 10
Firewall Rules: Router ACLs restrict inter-VLAN traffic; only necessary ports open between segments

What I Learned

End-to-end Kubernetes cluster lifecycle: bootstrapping, upgrading, scaling, and troubleshooting with kubeadm
Bare-metal networking for Kubernetes: VLAN segmentation, BGP with Calico, ARP-based load balancing with MetalLB
Distributed storage engineering: Longhorn replication, IOPS tuning, and failure recovery across physical nodes
Enterprise-grade observability: Prometheus metric design, Grafana alerting pipelines, Loki log correlation
Physical infrastructure design: cable management, switch VLAN configuration, wireless backhaul for cluster nodes
Security-first architecture: network micro-segmentation, RBAC design, pod security standards, image vulnerability scanning
Resource management on heterogeneous hardware: taints, tolerations, affinity rules, and resource quotas to balance workloads across nodes with different capabilities
The discipline of documentation: every configuration change logged, every decision recorded, every diagram kept current

Kubernetes kubeadm Calico MetalLB Longhorn Prometheus Grafana Loki Proxmox Docker Ubuntu containerd Nginx Ingress cert-manager VLAN Bare-Metal Homelab Self-Hosted