Files
home-server/talos/CLUSTER-BUILD.md

8.5 KiB
Raw Blame History

Noble lab — Talos cluster build checklist

This document is the exported TODO for the noble Talos cluster (4 nodes). Commands and troubleshooting live in README.md.

Current state (2026-03-28)

  • Talos v1.12.6 (target) / Kubernetes as bundled — four nodes Ready unless upgrading; talosctl health; talos/kubeconfig for kubectl (root kubeconfig may still be a stub). Image Factory (nocloud installer): factory.talos.dev/nocloud-installer/249d9135de54962744e917cfe654117000cba369f9152fbab9d055a00aa3664f:v1.12.6
  • Cilium Helm 1.16.6 / app 1.16.6 (clusters/noble/apps/cilium/, phase 1 values).
  • MetalLB Helm 0.15.3 / app v0.15.3; IPAddressPool noble-l2 + L2Advertisement — pool 192.168.50.210192.168.50.229.
  • kube-vip DaemonSet 3/3 on control planes; VIP 192.168.50.230 on ens18 (vip_subnet /32 required — bare 32 breaks parsing). Verified from workstation: kubectl config set-cluster noble --server=https://192.168.50.230:6443 then kubectl get --raw /healthzok (talos/kubeconfig; see talos/README.md).
  • metrics-server Helm 3.13.0 / app v0.8.0clusters/noble/apps/metrics-server/values.yaml (--kubelet-insecure-tls for Talos); kubectl top nodes works.
  • Still open: Longhorn, Traefik, cert-manager, Argo CD, observability — checklist below.

Inventory

Host Role IP
helium worker 192.168.50.10
neon control-plane + worker 192.168.50.20
argon control-plane + worker 192.168.50.30
krypton control-plane + worker 192.168.50.40

Network reservations

Use Value
Kubernetes API VIP (kube-vip) 192.168.50.230 (see talos/README.md; align with talos/talconfig.yaml additionalApiServerCertSans)
MetalLB L2 pool 192.168.50.210192.168.50.229
Argo CD LoadBalancer Pick one IP in the MetalLB pool (e.g. 192.168.50.210)
Apps ingress DNS *.apps.noble.lab.pcenicni.dev
ExternalDNS Pangolin (map to supported ExternalDNS provider when documented)
Velero S3-compatible URL — configure later

Versions

  • Talos: v1.12.6 — align talosctl client with node image
  • Talos Image Factory (iscsi-tools + util-linux-tools): factory.talos.dev/nocloud-installer/249d9135de54962744e917cfe654117000cba369f9152fbab9d055a00aa3664f:v1.12.6 — same schematic must appear in machine.install.image after talhelper genconfig (bare metal may use metal-installer/ instead of nocloud-installer/)
  • Kubernetes: 1.35.2 on current nodes (bundled with Talos; not pinned in repo)
  • Cilium: 1.16.6 (Helm chart; see clusters/noble/apps/cilium/README.md)
  • MetalLB: 0.15.3 (Helm chart; app v0.15.3)
  • metrics-server: 3.13.0 (Helm chart; app v0.8.0)

Repo paths (this workspace)

Artifact Path
This checklist talos/CLUSTER-BUILD.md
Talos quick start + networking + kubeconfig talos/README.md
talhelper source (active) talos/talconfig.yaml — may be wipe-phase (no Longhorn volume) during disk recovery
Longhorn volume restore talos/talconfig.with-longhorn.yaml — copy to talconfig.yaml after GPT wipe (see talos/README.md §5)
Longhorn GPT wipe automation talos/scripts/longhorn-gpt-recovery.sh
kube-vip (kustomize) clusters/noble/apps/kube-vip/ (vip_interface e.g. ens18)
Cilium (Helm values) clusters/noble/apps/cilium/values.yaml (phase 1), optional values-kpr.yaml, README.md
MetalLB clusters/noble/apps/metallb/namespace.yaml (PSA privileged), ip-address-pool.yaml, kustomization.yaml, README.md
Longhorn Helm values clusters/noble/apps/longhorn/values.yaml
metrics-server (Helm values) clusters/noble/apps/metrics-server/values.yaml

Git vs cluster: manifests and talconfig live in git; talhelper genconfig -o out, bootstrap, Helm, and kubectl run on your LAN. See talos/README.md for workstation reachability (lab LAN/VPN), talosctl kubeconfig vs Kubernetes server: (VIP vs node IP), and --insecure only in maintenance.

Ordering (do not skip)

  1. Talos installed; Cilium (or chosen CNI) before most workloads — with cni: none, nodes stay NotReady / network-unavailable taint until CNI is up.
  2. MetalLB Helm chart (CRDs + controller) before kubectl apply -k on the pool manifests.
  3. clusters/noble/apps/metallb/namespace.yaml before or merged onto metallb-system so Pod Security does not block speaker (see apps/metallb/README.md).
  4. Longhorn: Talos user volume + extensions in talconfig.with-longhorn.yaml (when restored); Helm defaultDataPath in clusters/noble/apps/longhorn/values.yaml.

Prerequisites (before phases)

  • talos/talconfig.yaml checked in (VIP, API SANs, cni: none, iscsi-tools / util-linux-tools in schematic) — run talhelper validate talconfig talconfig.yaml after edits
  • Workstation on a routable path to node IPs or VIP (same LAN / VPN); talos/README.md §3 if kubectl hits wrong server: or network is unreachable
  • talosctl client matches node Talos version; talhelper for genconfig
  • Node static IPs (helium, neon, argon, krypton)
  • DHCP does not lease 192.168.50.210229, 230, or node IPs
  • DNS for API and apps as in talos/README.md
  • Git remote ready for Argo CD (argo-cd)
  • talos/kubeconfig from talosctl kubeconfig — root repo kubeconfig is a stub until populated

Phase A — Talos bootstrap + API VIP

  • talhelper gensecrettalhelper genconfig -o out (re-run genconfig after every talconfig edit)
  • apply-config all nodes (talos/README.md §2 — no --insecure after nodes join; use TALOSCONFIG)
  • talosctl bootstrap once; other control planes and worker join
  • talosctl kubeconfig → working kubectl (talos/README.md §3 — override server: if VIP not reachable from workstation)
  • kube-vip manifests in clusters/noble/apps/kube-vip
  • kube-vip healthy; vip_interface matches uplink (talosctl get links); VIP reachable where needed
  • talosctl health (e.g. talosctl health -n 192.168.50.20 with TALOSCONFIG set)

Phase B — Core platform

Install order: Ciliummetrics-serverLonghorn (Talos disk + Helm) → MetalLB (Helm → pool manifests) → ingress / certs / DNS as planned.

  • Cilium (Helm 1.16.6) — required before MetalLB if cni: none (clusters/noble/apps/cilium/)
  • metrics-server — Helm 3.13.0; values in clusters/noble/apps/metrics-server/values.yaml; verify kubectl top nodes
  • Longhorn — Talos: talconfig.with-longhorn.yaml + talos/README.md §5; Helm: clusters/noble/apps/longhorn/values.yaml (defaultDataPath /var/mnt/longhorn)
  • MetalLB — chart installed; pool + L2 from clusters/noble/apps/metallb/ applied (192.168.50.210229)
  • Service LoadBalancer test — assign an IP from 210229 (e.g. dummy LoadBalancer or Traefik)
  • Traefik LoadBalancer for *.apps.noble.lab.pcenicni.dev
  • cert-manager + ClusterIssuer (staging → prod)
  • ExternalDNS (Pangolin-compatible provider)

Phase C — GitOps

  • Argo CD bootstrap (clusters/noble/bootstrap/argocd, root app) — path TBD when added
  • Argo CD server LoadBalancer with dedicated pool IP
  • SSO — later

Phase D — Observability

  • kube-prometheus-stack (PVCs on Longhorn)
  • Loki + Fluent Bit; Grafana datasource

Phase E — Secrets

  • Sealed Secrets (optional Git workflow)
  • Vault in-cluster on Longhorn + auto-unseal
  • External Secrets Operator + Vault ClusterSecretStore

Phase F — Policy + backups

  • Kyverno baseline policies
  • Velero when S3 is ready; backup/restore drill

Phase G — Hardening

  • RBAC, network policies (Cilium), Alertmanager routes
  • Runbooks: API VIP, etcd, Longhorn, Vault

Quick validation

  • kubectl get nodes — all Ready
  • API via VIP :6443kubectl get --raw /healthzok with kubeconfig server: https://192.168.50.230:6443
  • Test LoadBalancer receives IP from 210229
  • Sample Ingress + cert + ExternalDNS record
  • PVC bound; Prometheus/Loki durable if configured

Keep in sync with talos/README.md and manifests under clusters/noble/.