Noble lab — Talos cluster build checklist

This document is the exported TODO for the noble Talos cluster (4 nodes). Commands and troubleshooting live in README.md.

Current state (2026-03-28)

Talos v1.12.6 (target) / Kubernetes as bundled — four nodes Ready unless upgrading; talosctl health; talos/kubeconfig for kubectl (root kubeconfig may still be a stub). Image Factory (nocloud installer): factory.talos.dev/nocloud-installer/249d9135de54962744e917cfe654117000cba369f9152fbab9d055a00aa3664f:v1.12.6
Cilium Helm 1.16.6 / app 1.16.6 (clusters/noble/apps/cilium/, phase 1 values).
MetalLB Helm 0.15.3 / app v0.15.3; IPAddressPool noble-l2 + L2Advertisement — pool 192.168.50.210–192.168.50.229.
kube-vip DaemonSet 3/3 on control planes; VIP 192.168.50.230 on ens18 (vip_subnet /32 required — bare 32 breaks parsing). Verified from workstation: kubectl config set-cluster noble --server=https://192.168.50.230:6443 then kubectl get --raw /healthz → ok (talos/kubeconfig; see talos/README.md).
metrics-server Helm 3.13.0 / app v0.8.0 — clusters/noble/apps/metrics-server/values.yaml (--kubelet-insecure-tls for Talos); kubectl top nodes works.
Longhorn Helm 1.11.1 / app v1.11.1 — clusters/noble/apps/longhorn/ (PSA privileged namespace, defaultDataPath /var/mnt/longhorn, preUpgradeChecker enabled); StorageClass longhorn (default); nodes.longhorn.io all Ready; test PVC Bound on longhorn.
Traefik Helm 39.0.6 / app v3.6.11 — clusters/noble/apps/traefik/; Service LoadBalancer EXTERNAL-IP 192.168.50.211; IngressClass traefik (default). Point *.apps.noble.lab.pcenicni.dev at 192.168.50.211. MetalLB pool verification was done before replacing the temporary nginx test with Traefik.
cert-manager Helm v1.20.0 / app v1.20.0 — clusters/noble/apps/cert-manager/; ClusterIssuer letsencrypt-staging and letsencrypt-prod (HTTP-01, ingress class traefik); ACME email certificates@noble.lab.pcenicni.dev (edit in manifests if you want a different mailbox).
Newt Helm 1.2.0 / app 1.10.1 — clusters/noble/apps/newt/ (fossorial/newt); Pangolin site tunnel — newt-pangolin-auth Secret (PANGOLIN_ENDPOINT, NEWT_ID, NEWT_SECRET). Public DNS is not automated with ExternalDNS: CNAME records at your DNS host per Pangolin’s domain instructions, plus Integration API for HTTP resources/targets — see clusters/noble/apps/newt/README.md. LAN access to Traefik can still use *.apps.noble.lab.pcenicni.dev → 192.168.50.211 (split horizon / local resolver).
Argo CD Helm 9.4.17 / app v3.3.6 — clusters/noble/bootstrap/argocd/; argocd-server LoadBalancer 192.168.50.210; app-of-apps scaffold under bootstrap/argocd/apps/ (edit root-application.yaml repoURL before applying).
kube-prometheus-stack — Helm chart 82.15.1 — clusters/noble/apps/kube-prometheus-stack/ (namespace monitoring, PSA privileged — node-exporter needs host mounts); Longhorn PVCs for Prometheus, Grafana, Alertmanager. Grafana Ingress: https://grafana.apps.noble.lab.pcenicni.dev (Traefik ingressClassName: traefik, cert-manager.io/cluster-issuer: letsencrypt-prod). helm upgrade --install with --wait is silent until done — use --timeout 30m (not 5m) and watch kubectl -n monitoring get pods -w in another terminal. Grafana admin password: Secret kube-prometheus-grafana, keys admin-user / admin-password.
Still open: Loki + Fluent Bit + Grafana datasource (Phase D).

Inventory

Host	Role	IP
helium	worker	`192.168.50.10`
neon	control-plane + worker	`192.168.50.20`
argon	control-plane + worker	`192.168.50.30`
krypton	control-plane + worker	`192.168.50.40`

Network reservations

Use	Value
Kubernetes API VIP (kube-vip)	`192.168.50.230` (see `talos/README.md`; align with `talos/talconfig.yaml` `additionalApiServerCertSans`)
MetalLB L2 pool	`192.168.50.210`–`192.168.50.229`
Argo CD `LoadBalancer`	Pick one IP in the MetalLB pool (e.g. `192.168.50.210`)
Traefik (apps ingress)	`192.168.50.211` — `metallb.io/loadBalancerIPs` in `clusters/noble/apps/traefik/values.yaml`
Apps ingress (LAN / split horizon)	`*.apps.noble.lab.pcenicni.dev` → Traefik LB
Grafana (Ingress + TLS)	`grafana.apps.noble.lab.pcenicni.dev` — `grafana.ingress` in `clusters/noble/apps/kube-prometheus-stack/values.yaml` (`letsencrypt-prod`)
Public DNS (Pangolin)	Newt tunnel + CNAME at registrar + Integration API — `clusters/noble/apps/newt/`
Velero	S3-compatible URL — configure later

Versions

Talos: v1.12.6 — align talosctl client with node image
Talos Image Factory (iscsi-tools + util-linux-tools): factory.talos.dev/nocloud-installer/249d9135de54962744e917cfe654117000cba369f9152fbab9d055a00aa3664f:v1.12.6 — same schematic must appear in machine.install.image after talhelper genconfig (bare metal may use metal-installer/ instead of nocloud-installer/)
Kubernetes: 1.35.2 on current nodes (bundled with Talos; not pinned in repo)
Cilium: 1.16.6 (Helm chart; see clusters/noble/apps/cilium/README.md)
MetalLB: 0.15.3 (Helm chart; app v0.15.3)
metrics-server: 3.13.0 (Helm chart; app v0.8.0)
Longhorn: 1.11.1 (Helm chart; app v1.11.1)
Traefik: 39.0.6 (Helm chart; app v3.6.11)
cert-manager: v1.20.0 (Helm chart; app v1.20.0)
Newt (Fossorial): 1.2.0 (Helm chart; app 1.10.1)
Argo CD: 9.4.17 (Helm chart argo/argo-cd; app v3.3.6)
kube-prometheus-stack: 82.15.1 (Helm chart prometheus-community/kube-prometheus-stack; app v0.89.x bundle)

Repo paths (this workspace)

Artifact	Path
This checklist	`talos/CLUSTER-BUILD.md`
Talos quick start + networking + kubeconfig	`talos/README.md`
talhelper source (active)	`talos/talconfig.yaml` — may be wipe-phase (no Longhorn volume) during disk recovery
Longhorn volume restore	`talos/talconfig.with-longhorn.yaml` — copy to `talconfig.yaml` after GPT wipe (see `talos/README.md` §5)
Longhorn GPT wipe automation	`talos/scripts/longhorn-gpt-recovery.sh`
kube-vip (kustomize)	`clusters/noble/apps/kube-vip/` (`vip_interface` e.g. `ens18`)
Cilium (Helm values)	`clusters/noble/apps/cilium/` — `values.yaml` (phase 1), optional `values-kpr.yaml`, `README.md`
MetalLB	`clusters/noble/apps/metallb/` — `namespace.yaml` (PSA privileged), `ip-address-pool.yaml`, `kustomization.yaml`, `README.md`
Longhorn	`clusters/noble/apps/longhorn/` — `values.yaml`, `namespace.yaml` (PSA privileged), `kustomization.yaml`
metrics-server (Helm values)	`clusters/noble/apps/metrics-server/values.yaml`
Traefik (Helm values)	`clusters/noble/apps/traefik/` — `values.yaml`, `namespace.yaml`, `README.md`
cert-manager (Helm + ClusterIssuers)	`clusters/noble/apps/cert-manager/` — `values.yaml`, `namespace.yaml`, `kustomization.yaml`, `README.md`
Newt / Pangolin tunnel (Helm)	`clusters/noble/apps/newt/` — `values.yaml`, `namespace.yaml`, `README.md`
Argo CD (bootstrap + app-of-apps)	`clusters/noble/bootstrap/argocd/` — `values.yaml`, `root-application.yaml`, `apps/`, `README.md`
kube-prometheus-stack (Helm values)	`clusters/noble/apps/kube-prometheus-stack/` — `values.yaml`, `namespace.yaml`

Git vs cluster: manifests and talconfig live in git; talhelper genconfig -o out, bootstrap, Helm, and kubectl run on your LAN. See talos/README.md for workstation reachability (lab LAN/VPN), talosctl kubeconfig vs Kubernetes server: (VIP vs node IP), and --insecure only in maintenance.

Ordering (do not skip)

Talos installed; Cilium (or chosen CNI) before most workloads — with cni: none, nodes stay NotReady / network-unavailable taint until CNI is up.
MetalLB Helm chart (CRDs + controller) before kubectl apply -k on the pool manifests.
clusters/noble/apps/metallb/namespace.yaml before or merged onto metallb-system so Pod Security does not block speaker (see apps/metallb/README.md).
Longhorn: Talos user volume + extensions in talconfig.with-longhorn.yaml (when restored); Helm defaultDataPath in clusters/noble/apps/longhorn/values.yaml.

Prerequisites (before phases)

talos/talconfig.yaml checked in (VIP, API SANs, cni: none, iscsi-tools / util-linux-tools in schematic) — run talhelper validate talconfig talconfig.yaml after edits
Workstation on a routable path to node IPs or VIP (same LAN / VPN); talos/README.md §3 if kubectl hits wrong server: or network is unreachable
talosctl client matches node Talos version; talhelper for genconfig
Node static IPs (helium, neon, argon, krypton)
DHCP does not lease 192.168.50.210–229, 230, or node IPs
DNS for API and apps as in talos/README.md
Git remote ready for Argo CD (argo-cd)
talos/kubeconfig from talosctl kubeconfig — root repo kubeconfig is a stub until populated

Phase A — Talos bootstrap + API VIP

talhelper gensecret → talhelper genconfig -o out (re-run genconfig after every talconfig edit)
apply-config all nodes (talos/README.md §2 — no --insecure after nodes join; use TALOSCONFIG)
talosctl bootstrap once; other control planes and worker join
talosctl kubeconfig → working kubectl (talos/README.md §3 — override server: if VIP not reachable from workstation)
kube-vip manifests in clusters/noble/apps/kube-vip
kube-vip healthy; vip_interface matches uplink (talosctl get links); VIP reachable where needed
talosctl health (e.g. talosctl health -n 192.168.50.20 with TALOSCONFIG set)

Phase B — Core platform

Install order: Cilium → metrics-server → Longhorn (Talos disk + Helm) → MetalLB (Helm → pool manifests) → ingress / certs / DNS as planned.

Cilium (Helm 1.16.6) — required before MetalLB if cni: none (clusters/noble/apps/cilium/)
metrics-server — Helm 3.13.0; values in clusters/noble/apps/metrics-server/values.yaml; verify kubectl top nodes
Longhorn — Talos: user volume + kubelet mounts + extensions (talos/README.md §5); Helm 1.11.1; kubectl apply -k clusters/noble/apps/longhorn; verify nodes.longhorn.io and test PVC Bound
MetalLB — chart installed; pool + L2 from clusters/noble/apps/metallb/ applied (192.168.50.210–229)
Service LoadBalancer / pool check — MetalLB assigns from 210–229 (validated before Traefik; temporary nginx test removed in favor of Traefik)
Traefik LoadBalancer for *.apps.noble.lab.pcenicni.dev — clusters/noble/apps/traefik/; 192.168.50.211
cert-manager + ClusterIssuer (letsencrypt-staging / letsencrypt-prod) — clusters/noble/apps/cert-manager/
Newt (Pangolin tunnel; replaces ExternalDNS for public DNS) — clusters/noble/apps/newt/ — newt-pangolin-auth; CNAME + Integration API per newt/README.md

Phase C — GitOps

Argo CD bootstrap — clusters/noble/bootstrap/argocd/ (helm upgrade --install argocd …)
Argo CD server LoadBalancer — 192.168.50.210 (see values.yaml)
App-of-apps — set repoURL in root-application.yaml, add Application manifests under bootstrap/argocd/apps/, apply root-application.yaml
SSO — later

Phase D — Observability

kube-prometheus-stack — kubectl apply -f clusters/noble/apps/kube-prometheus-stack/namespace.yaml then helm upgrade --install as in clusters/noble/apps/kube-prometheus-stack/values.yaml (chart 82.15.1); PVCs longhorn; --wait --timeout 30m recommended; verify kubectl -n monitoring get pods,pvc
Loki + Fluent Bit; Grafana datasource

Phase E — Secrets

Sealed Secrets (optional Git workflow)
Vault in-cluster on Longhorn + auto-unseal
External Secrets Operator + Vault ClusterSecretStore

Phase F — Policy + backups

Kyverno baseline policies
Velero when S3 is ready; backup/restore drill

Phase G — Hardening

RBAC, network policies (Cilium), Alertmanager routes
Runbooks: API VIP, etcd, Longhorn, Vault

Quick validation

kubectl get nodes — all Ready
API via VIP :6443 — kubectl get --raw /healthz → ok with kubeconfig server: https://192.168.50.230:6443
Ingress LoadBalancer in pool 210–229 (Traefik → 192.168.50.211)
Argo CD UI — argocd-server LoadBalancer 192.168.50.210 (initial admin password from argocd-initial-admin-secret)
Sample Ingress + cert (cert-manager ready) + Pangolin resource + CNAME
PVC Bound on Longhorn (storageClassName: longhorn); Prometheus/Loki durable when configured
monitoring — kube-prometheus-stack core workloads Running (Prometheus, Grafana, Alertmanager, operator, kube-state-metrics); PVCs Bound on longhorn

Keep in sync with talos/README.md and manifests under clusters/noble/.

14 KiB Raw Blame History Unescape Escape