On-Premises Infrastructure Project
Objective
Design and build a complete on-premises multi-server infrastructure in a virtualized lab environment simulating a small enterprise data center. The project integrates all core skills from the Computer Systems Technology program: a gateway router with VLANs and iptables, primary and replica database servers with automated failover, a web application server behind a HAProxy load balancer, centralized authentication using FreeIPA/LDAP, a monitoring stack (Prometheus + Grafana), automated backup with offsite replication via rsync, and a documented disaster recovery runbook. The infrastructure is designed for scalability, redundancy, and operational resilience.
Tools & Technologies
Ubuntu Server 22.04 LTS— all nodes (8 VMs total)iptables / nftables— gateway firewall and NATHAProxy 2.6— load balancing and health checkingMariaDB Galera Cluster— synchronous multi-master replicationFreeIPA— centralized LDAP/Kerberos identity managementPrometheus + Grafana + Alertmanager— full observability stackrsync + SSH— automated backup and offsite replicationKeepalived (VRRP)— virtual IP for HAProxy high availabilityAnsible— configuration management and DR automation
Architecture Overview
Step-by-Step Process
Built a segmented network with five VLANs (Web, DB, Auth, Monitor, Backup) separated by the gateway VM acting as inter-VLAN router with iptables-based security policy.
# Gateway VM: 5 VLANs on a single interface using 802.1Q sub-interfaces
# /etc/netplan/00-gateway.yaml
network:
version: 2
ethernets:
enp0s8:
dhcp4: false
vlans:
enp0s8.10:
id: 10
link: enp0s8
addresses: [10.0.1.1/24] # Web VLAN
enp0s8.20:
id: 20
link: enp0s8
addresses: [10.0.2.1/24] # DB VLAN
enp0s8.30:
id: 30
link: enp0s8
addresses: [10.0.3.1/24] # Auth VLAN
enp0s8.40:
id: 40
link: enp0s8
addresses: [10.0.4.1/24] # Monitor VLAN
enp0s8.50:
id: 50
link: enp0s8
addresses: [10.0.5.1/24] # Backup VLAN
# iptables policy — DB VLAN only accessible from Web VLAN
iptables -A FORWARD -s 10.0.1.0/24 -d 10.0.2.0/24 -p tcp --dport 3306 -j ACCEPT
iptables -A FORWARD -s 10.0.2.0/24 -d 10.0.1.0/24 -m conntrack --ctstate ESTABLISHED -j ACCEPT
iptables -A FORWARD -s 10.0.2.0/24 -d 10.0.1.0/24 -j DROP # block unsolicited DB→Web
Deployed a three-node Galera synchronous multi-master replication cluster. Any node can accept writes; all nodes remain in sync via wsrep replication protocol.
# /etc/mysql/mariadb.conf.d/60-galera.cnf (all 3 nodes)
[mysqld]
binlog_format=ROW
default-storage-engine=innodb
innodb_autoinc_lock_mode=2
bind-address=0.0.0.0
# Galera Provider Configuration
wsrep_on=ON
wsrep_provider=/usr/lib/galera/libgalera_smm.so
# Galera Cluster Configuration
wsrep_cluster_name="CapstoneCluster"
wsrep_cluster_address="gcomm://10.0.2.11,10.0.2.12,10.0.2.13"
# Galera Synchronization Configuration
wsrep_sst_method=rsync
# Node-specific settings (change per node)
wsrep_node_address="10.0.2.11" # This node's IP
wsrep_node_name="db-01"
# Bootstrap first node only
sudo galera_new_cluster # on db-01 only
# Start remaining nodes
sudo systemctl start mariadb # on db-02, db-03
# Verify cluster
mysql -u root -e "SHOW STATUS LIKE 'wsrep_cluster_size';"
# Expected: wsrep_cluster_size = 3
mysql -u root -e "SHOW STATUS LIKE 'wsrep_cluster_status';"
# Expected: Primary
Configured HAProxy on two nodes with a shared virtual IP managed by Keepalived. If the active HAProxy fails, Keepalived promotes the standby node and the virtual IP migrates automatically.
# /etc/haproxy/haproxy.cfg
global
log /dev/log local0
maxconn 50000
user haproxy
group haproxy
defaults
log global
mode http
option httplog
timeout connect 5s
timeout client 30s
timeout server 30s
frontend http-in
bind *:80
default_backend web-servers
backend web-servers
balance roundrobin
option httpchk GET /health
http-check expect status 200
server web-01 10.0.1.11:80 check inter 2s fall 3 rise 2
server web-02 10.0.1.12:80 check inter 2s fall 3 rise 2
listen stats
bind *:8404
stats enable
stats uri /stats
stats refresh 5s
# /etc/keepalived/keepalived.conf (on LB-01, master)
vrrp_instance VI_1 {
state MASTER
interface enp0s3
virtual_router_id 51
priority 150
advert_int 1
authentication {
auth_type PASS
auth_pass capstone2025
}
virtual_ipaddress {
10.0.1.100/24
}
notify_master "/etc/keepalived/notify.sh MASTER"
notify_backup "/etc/keepalived/notify.sh BACKUP"
}
Implemented automated daily backups using rsync with SSH, hard-link based incremental backups (rsnapshot-style), and weekly offsite replication. A backup verification script confirms restore integrity.
#!/usr/bin/env bash
# /usr/local/bin/backup_infra.sh
# Hard-link incremental backup using rsync
set -euo pipefail
BACKUP_ROOT="/srv/backups"
DATE=$(date +%Y-%m-%d)
HOSTS=("10.0.1.11" "10.0.1.12" "10.0.2.11" "10.0.2.12" "10.0.2.13")
SSH_KEY="/root/.ssh/backup_key"
RETENTION_DAYS=30
for host in "${HOSTS[@]}"; do
HOST_NAME=$(ssh -i $SSH_KEY -o StrictHostKeyChecking=no \
backup@$host "hostname")
DEST="$BACKUP_ROOT/$HOST_NAME"
LAST="$DEST/latest"
TODAY="$DEST/$DATE"
mkdir -p "$TODAY"
rsync -az --delete \
--link-dest="$LAST" \
--exclude={'/proc','/sys','/dev','/tmp','/run'} \
-e "ssh -i $SSH_KEY -o StrictHostKeyChecking=no" \
"backup@$host:/" \
"$TODAY/" \
&& ln -sfn "$TODAY" "$LAST" \
&& echo "Backup complete: $HOST_NAME ($DATE)"
# Remove backups older than retention period
find "$DEST" -maxdepth 1 -type d -mtime +$RETENTION_DAYS -exec rm -rf {} \;
done
# DB dump backup (Galera safe — single-node consistent snapshot)
mysqldump --all-databases --single-transaction \
--host=10.0.2.11 --user=backup --password="$DB_BACKUP_PASS" \
| gzip > "$BACKUP_ROOT/mysql-$DATE.sql.gz"
Documented and tested a full DR procedure: simulating db-01 failure, verifying Galera cluster continued with 2 nodes, restoring from backup, and re-joining the recovered node to the cluster.
# DR Test: Simulate db-01 failure
# Step 1: Force-stop db-01 (simulates crash)
# (On the hypervisor, power off db-01 VM)
# Step 2: Verify cluster continues with 2 nodes
mysql -h 10.0.2.12 -u root -e "SHOW STATUS LIKE 'wsrep_cluster_size';"
# Expected: 2 (cluster degraded but operational)
# Step 3: Verify application still serves requests
curl http://10.0.1.100/ # Through virtual IP
# Step 4: Restore db-01 from backup
# Boot db-01 VM
# Restore MySQL data directory from backup server
rsync -az [email protected]:/srv/backups/db-01/latest/var/lib/mysql/ /var/lib/mysql/
# Step 5: Rejoin recovered node to cluster
# /etc/mysql/mariadb.conf.d/60-galera.cnf
# wsrep_cluster_address="gcomm://10.0.2.12,10.0.2.13" # Other 2 nodes only (no self)
sudo systemctl start mariadb
# Verify db-01 rejoined
mysql -u root -e "SHOW STATUS LIKE 'wsrep_cluster_size';"
# Expected: 3
# Update cluster address back to all 3 nodes
# wsrep_cluster_address="gcomm://10.0.2.11,10.0.2.12,10.0.2.13"
sudo systemctl reload mariadb
Complete Workflow
Challenges & Solutions
- Galera cluster split-brain during test failover — When both remaining nodes lost quorum simultaneously, each thought the other was the primary component and both declared themselves Primary. Fixed by understanding wsrep quorum rules — with 3 nodes, losing 2 simultaneously causes a quorum loss, not a split-brain. Restarted the cluster with
galera_new_clusterfrom the most current node. - Keepalived VRRP advertisements blocked by iptables — The gateway was dropping VRRP packets (protocol 112) between the two LB nodes. Added an explicit ACCEPT rule for VRRP multicast (224.0.0.18).
- rsync backup consuming too much disk with hard links — Initially creating full copies instead of hard links because the destination was on a different filesystem than expected. Ensured all backup directories were on the same mount point (
ext4partition) so hard-link deduplication worked. - FreeIPA enrollment failing on web VMs — The web VMs' DNS was pointing to the gateway, which wasn't resolving FreeIPA's Kerberos SRV records. Changed web VM DNS to point to FreeIPA server (10.0.3.10) for successful enrollment.
Key Takeaways
- Galera Cluster requires a minimum of three nodes for automatic quorum — with two nodes, losing one requires manual intervention. Three nodes allow one to fail while the cluster remains operational.
- Keepalived VRRP provides sub-second failover for virtual IPs, but requires the network to pass VRRP multicast advertisements — firewalls must explicitly permit protocol 112.
- Hard-link based incremental backups (rsnapshot pattern) provide excellent space efficiency — only changed files consume additional space, while each daily snapshot appears complete.
- Disaster recovery runbooks are only valuable if they are tested regularly — an untested DR procedure will fail under the pressure of a real incident. Testing revealed several gaps that would have extended recovery time.