8.5 KiB
8.5 KiB
Noble lab — Talos cluster build checklist
This document is the exported TODO for the noble Talos cluster (4 nodes). Commands and troubleshooting live in README.md.
Current state (2026-03-28)
- Talos v1.12.6 (target) / Kubernetes as bundled — four nodes Ready unless upgrading;
talosctl health;talos/kubeconfigforkubectl(rootkubeconfigmay still be a stub). Image Factory (nocloud installer):factory.talos.dev/nocloud-installer/249d9135de54962744e917cfe654117000cba369f9152fbab9d055a00aa3664f:v1.12.6 - Cilium Helm 1.16.6 / app 1.16.6 (
clusters/noble/apps/cilium/, phase 1 values). - MetalLB Helm 0.15.3 / app v0.15.3; IPAddressPool
noble-l2+ L2Advertisement — pool192.168.50.210–192.168.50.229. - kube-vip DaemonSet 3/3 on control planes; VIP
192.168.50.230onens18(vip_subnet/32required — bare32breaks parsing). Verified from workstation:kubectl config set-cluster noble --server=https://192.168.50.230:6443thenkubectl get --raw /healthz→ok(talos/kubeconfig; seetalos/README.md). - metrics-server Helm 3.13.0 / app v0.8.0 —
clusters/noble/apps/metrics-server/values.yaml(--kubelet-insecure-tlsfor Talos);kubectl top nodesworks. - Still open: Longhorn, Traefik, cert-manager, Argo CD, observability — checklist below.
Inventory
| Host | Role | IP |
|---|---|---|
| helium | worker | 192.168.50.10 |
| neon | control-plane + worker | 192.168.50.20 |
| argon | control-plane + worker | 192.168.50.30 |
| krypton | control-plane + worker | 192.168.50.40 |
Network reservations
| Use | Value |
|---|---|
| Kubernetes API VIP (kube-vip) | 192.168.50.230 (see talos/README.md; align with talos/talconfig.yaml additionalApiServerCertSans) |
| MetalLB L2 pool | 192.168.50.210–192.168.50.229 |
Argo CD LoadBalancer |
Pick one IP in the MetalLB pool (e.g. 192.168.50.210) |
| Apps ingress DNS | *.apps.noble.lab.pcenicni.dev |
| ExternalDNS | Pangolin (map to supported ExternalDNS provider when documented) |
| Velero | S3-compatible URL — configure later |
Versions
- Talos: v1.12.6 — align
talosctlclient with node image - Talos Image Factory (iscsi-tools + util-linux-tools):
factory.talos.dev/nocloud-installer/249d9135de54962744e917cfe654117000cba369f9152fbab9d055a00aa3664f:v1.12.6— same schematic must appear inmachine.install.imageaftertalhelper genconfig(bare metal may usemetal-installer/instead ofnocloud-installer/) - Kubernetes: 1.35.2 on current nodes (bundled with Talos; not pinned in repo)
- Cilium: 1.16.6 (Helm chart; see
clusters/noble/apps/cilium/README.md) - MetalLB: 0.15.3 (Helm chart; app v0.15.3)
- metrics-server: 3.13.0 (Helm chart; app v0.8.0)
Repo paths (this workspace)
| Artifact | Path |
|---|---|
| This checklist | talos/CLUSTER-BUILD.md |
| Talos quick start + networking + kubeconfig | talos/README.md |
| talhelper source (active) | talos/talconfig.yaml — may be wipe-phase (no Longhorn volume) during disk recovery |
| Longhorn volume restore | talos/talconfig.with-longhorn.yaml — copy to talconfig.yaml after GPT wipe (see talos/README.md §5) |
| Longhorn GPT wipe automation | talos/scripts/longhorn-gpt-recovery.sh |
| kube-vip (kustomize) | clusters/noble/apps/kube-vip/ (vip_interface e.g. ens18) |
| Cilium (Helm values) | clusters/noble/apps/cilium/ — values.yaml (phase 1), optional values-kpr.yaml, README.md |
| MetalLB | clusters/noble/apps/metallb/ — namespace.yaml (PSA privileged), ip-address-pool.yaml, kustomization.yaml, README.md |
| Longhorn Helm values | clusters/noble/apps/longhorn/values.yaml |
| metrics-server (Helm values) | clusters/noble/apps/metrics-server/values.yaml |
Git vs cluster: manifests and talconfig live in git; talhelper genconfig -o out, bootstrap, Helm, and kubectl run on your LAN. See talos/README.md for workstation reachability (lab LAN/VPN), talosctl kubeconfig vs Kubernetes server: (VIP vs node IP), and --insecure only in maintenance.
Ordering (do not skip)
- Talos installed; Cilium (or chosen CNI) before most workloads — with
cni: none, nodes stay NotReady / network-unavailable taint until CNI is up. - MetalLB Helm chart (CRDs + controller) before
kubectl apply -kon the pool manifests. clusters/noble/apps/metallb/namespace.yamlbefore or merged ontometallb-systemso Pod Security does not block speaker (seeapps/metallb/README.md).- Longhorn: Talos user volume + extensions in
talconfig.with-longhorn.yaml(when restored); HelmdefaultDataPathinclusters/noble/apps/longhorn/values.yaml.
Prerequisites (before phases)
talos/talconfig.yamlchecked in (VIP, API SANs,cni: none,iscsi-tools/util-linux-toolsin schematic) — runtalhelper validate talconfig talconfig.yamlafter edits- Workstation on a routable path to node IPs or VIP (same LAN / VPN);
talos/README.md§3 ifkubectlhits wrongserver:ornetwork is unreachable talosctlclient matches node Talos version;talhelperforgenconfig- Node static IPs (helium, neon, argon, krypton)
- DHCP does not lease
192.168.50.210–229,230, or node IPs - DNS for API and apps as in
talos/README.md - Git remote ready for Argo CD (argo-cd)
talos/kubeconfigfromtalosctl kubeconfig— root repokubeconfigis a stub until populated
Phase A — Talos bootstrap + API VIP
talhelper gensecret→talhelper genconfig -o out(re-rungenconfigafter everytalconfigedit)apply-configall nodes (talos/README.md§2 — no--insecureafter nodes join; useTALOSCONFIG)talosctl bootstraponce; other control planes and worker jointalosctl kubeconfig→ workingkubectl(talos/README.md§3 — overrideserver:if VIP not reachable from workstation)- kube-vip manifests in
clusters/noble/apps/kube-vip - kube-vip healthy;
vip_interfacematches uplink (talosctl get links); VIP reachable where needed talosctl health(e.g.talosctl health -n 192.168.50.20withTALOSCONFIGset)
Phase B — Core platform
Install order: Cilium → metrics-server → Longhorn (Talos disk + Helm) → MetalLB (Helm → pool manifests) → ingress / certs / DNS as planned.
- Cilium (Helm 1.16.6) — required before MetalLB if
cni: none(clusters/noble/apps/cilium/) - metrics-server — Helm 3.13.0; values in
clusters/noble/apps/metrics-server/values.yaml; verifykubectl top nodes - Longhorn — Talos:
talconfig.with-longhorn.yaml+talos/README.md§5; Helm:clusters/noble/apps/longhorn/values.yaml(defaultDataPath/var/mnt/longhorn) - MetalLB — chart installed; pool + L2 from
clusters/noble/apps/metallb/applied (192.168.50.210–229) ServiceLoadBalancertest — assign an IP from210–229(e.g. dummyLoadBalanceror Traefik)- Traefik
LoadBalancerfor*.apps.noble.lab.pcenicni.dev - cert-manager + ClusterIssuer (staging → prod)
- ExternalDNS (Pangolin-compatible provider)
Phase C — GitOps
- Argo CD bootstrap (
clusters/noble/bootstrap/argocd, root app) — path TBD when added - Argo CD server LoadBalancer with dedicated pool IP
- SSO — later
Phase D — Observability
- kube-prometheus-stack (PVCs on Longhorn)
- Loki + Fluent Bit; Grafana datasource
Phase E — Secrets
- Sealed Secrets (optional Git workflow)
- Vault in-cluster on Longhorn + auto-unseal
- External Secrets Operator + Vault
ClusterSecretStore
Phase F — Policy + backups
- Kyverno baseline policies
- Velero when S3 is ready; backup/restore drill
Phase G — Hardening
- RBAC, network policies (Cilium), Alertmanager routes
- Runbooks: API VIP, etcd, Longhorn, Vault
Quick validation
kubectl get nodes— all Ready- API via VIP
:6443—kubectl get --raw /healthz→okwith kubeconfigserver:https://192.168.50.230:6443 - Test
LoadBalancerreceives IP from210–229 - Sample Ingress + cert + ExternalDNS record
- PVC bound; Prometheus/Loki durable if configured
Keep in sync with talos/README.md and manifests under clusters/noble/.