14 KiB
14 KiB
Noble lab — Talos cluster build checklist
This document is the exported TODO for the noble Talos cluster (4 nodes). Commands and troubleshooting live in README.md.
Current state (2026-03-28)
- Talos v1.12.6 (target) / Kubernetes as bundled — four nodes Ready unless upgrading;
talosctl health;talos/kubeconfigforkubectl(rootkubeconfigmay still be a stub). Image Factory (nocloud installer):factory.talos.dev/nocloud-installer/249d9135de54962744e917cfe654117000cba369f9152fbab9d055a00aa3664f:v1.12.6 - Cilium Helm 1.16.6 / app 1.16.6 (
clusters/noble/apps/cilium/, phase 1 values). - MetalLB Helm 0.15.3 / app v0.15.3; IPAddressPool
noble-l2+ L2Advertisement — pool192.168.50.210–192.168.50.229. - kube-vip DaemonSet 3/3 on control planes; VIP
192.168.50.230onens18(vip_subnet/32required — bare32breaks parsing). Verified from workstation:kubectl config set-cluster noble --server=https://192.168.50.230:6443thenkubectl get --raw /healthz→ok(talos/kubeconfig; seetalos/README.md). - metrics-server Helm 3.13.0 / app v0.8.0 —
clusters/noble/apps/metrics-server/values.yaml(--kubelet-insecure-tlsfor Talos);kubectl top nodesworks. - Longhorn Helm 1.11.1 / app v1.11.1 —
clusters/noble/apps/longhorn/(PSA privileged namespace,defaultDataPath/var/mnt/longhorn,preUpgradeCheckerenabled); StorageClasslonghorn(default);nodes.longhorn.ioall Ready; test PVCBoundonlonghorn. - Traefik Helm 39.0.6 / app v3.6.11 —
clusters/noble/apps/traefik/;ServiceLoadBalancerEXTERNAL-IP192.168.50.211;IngressClasstraefik(default). Point*.apps.noble.lab.pcenicni.devat192.168.50.211. MetalLB pool verification was done before replacing the temporary nginx test with Traefik. - cert-manager Helm v1.20.0 / app v1.20.0 —
clusters/noble/apps/cert-manager/;ClusterIssuerletsencrypt-stagingandletsencrypt-prod(HTTP-01, ingress classtraefik); ACME emailcertificates@noble.lab.pcenicni.dev(edit in manifests if you want a different mailbox). - Newt Helm 1.2.0 / app 1.10.1 —
clusters/noble/apps/newt/(fossorial/newt); Pangolin site tunnel —newt-pangolin-authSecret (PANGOLIN_ENDPOINT,NEWT_ID,NEWT_SECRET). Public DNS is not automated with ExternalDNS: CNAME records at your DNS host per Pangolin’s domain instructions, plus Integration API for HTTP resources/targets — seeclusters/noble/apps/newt/README.md. LAN access to Traefik can still use*.apps.noble.lab.pcenicni.dev→192.168.50.211(split horizon / local resolver). - Argo CD Helm 9.4.17 / app v3.3.6 —
clusters/noble/bootstrap/argocd/;argocd-serverLoadBalancer192.168.50.210; app-of-apps scaffold underbootstrap/argocd/apps/(editroot-application.yamlrepoURLbefore applying). - kube-prometheus-stack — Helm chart 82.15.1 —
clusters/noble/apps/kube-prometheus-stack/(namespacemonitoring, PSA privileged — node-exporter needs host mounts); Longhorn PVCs for Prometheus, Grafana, Alertmanager. Grafana Ingress:https://grafana.apps.noble.lab.pcenicni.dev(TraefikingressClassName: traefik,cert-manager.io/cluster-issuer: letsencrypt-prod).helm upgrade --installwith--waitis silent until done — use--timeout 30m(not5m) and watchkubectl -n monitoring get pods -win another terminal. Grafana admin password: Secretkube-prometheus-grafana, keysadmin-user/admin-password. - Still open: Loki + Fluent Bit + Grafana datasource (Phase D).
Inventory
| Host | Role | IP |
|---|---|---|
| helium | worker | 192.168.50.10 |
| neon | control-plane + worker | 192.168.50.20 |
| argon | control-plane + worker | 192.168.50.30 |
| krypton | control-plane + worker | 192.168.50.40 |
Network reservations
| Use | Value |
|---|---|
| Kubernetes API VIP (kube-vip) | 192.168.50.230 (see talos/README.md; align with talos/talconfig.yaml additionalApiServerCertSans) |
| MetalLB L2 pool | 192.168.50.210–192.168.50.229 |
Argo CD LoadBalancer |
Pick one IP in the MetalLB pool (e.g. 192.168.50.210) |
| Traefik (apps ingress) | 192.168.50.211 — metallb.io/loadBalancerIPs in clusters/noble/apps/traefik/values.yaml |
| Apps ingress (LAN / split horizon) | *.apps.noble.lab.pcenicni.dev → Traefik LB |
| Grafana (Ingress + TLS) | grafana.apps.noble.lab.pcenicni.dev — grafana.ingress in clusters/noble/apps/kube-prometheus-stack/values.yaml (letsencrypt-prod) |
| Public DNS (Pangolin) | Newt tunnel + CNAME at registrar + Integration API — clusters/noble/apps/newt/ |
| Velero | S3-compatible URL — configure later |
Versions
- Talos: v1.12.6 — align
talosctlclient with node image - Talos Image Factory (iscsi-tools + util-linux-tools):
factory.talos.dev/nocloud-installer/249d9135de54962744e917cfe654117000cba369f9152fbab9d055a00aa3664f:v1.12.6— same schematic must appear inmachine.install.imageaftertalhelper genconfig(bare metal may usemetal-installer/instead ofnocloud-installer/) - Kubernetes: 1.35.2 on current nodes (bundled with Talos; not pinned in repo)
- Cilium: 1.16.6 (Helm chart; see
clusters/noble/apps/cilium/README.md) - MetalLB: 0.15.3 (Helm chart; app v0.15.3)
- metrics-server: 3.13.0 (Helm chart; app v0.8.0)
- Longhorn: 1.11.1 (Helm chart; app v1.11.1)
- Traefik: 39.0.6 (Helm chart; app v3.6.11)
- cert-manager: v1.20.0 (Helm chart; app v1.20.0)
- Newt (Fossorial): 1.2.0 (Helm chart; app 1.10.1)
- Argo CD: 9.4.17 (Helm chart
argo/argo-cd; app v3.3.6) - kube-prometheus-stack: 82.15.1 (Helm chart
prometheus-community/kube-prometheus-stack; app v0.89.x bundle)
Repo paths (this workspace)
| Artifact | Path |
|---|---|
| This checklist | talos/CLUSTER-BUILD.md |
| Talos quick start + networking + kubeconfig | talos/README.md |
| talhelper source (active) | talos/talconfig.yaml — may be wipe-phase (no Longhorn volume) during disk recovery |
| Longhorn volume restore | talos/talconfig.with-longhorn.yaml — copy to talconfig.yaml after GPT wipe (see talos/README.md §5) |
| Longhorn GPT wipe automation | talos/scripts/longhorn-gpt-recovery.sh |
| kube-vip (kustomize) | clusters/noble/apps/kube-vip/ (vip_interface e.g. ens18) |
| Cilium (Helm values) | clusters/noble/apps/cilium/ — values.yaml (phase 1), optional values-kpr.yaml, README.md |
| MetalLB | clusters/noble/apps/metallb/ — namespace.yaml (PSA privileged), ip-address-pool.yaml, kustomization.yaml, README.md |
| Longhorn | clusters/noble/apps/longhorn/ — values.yaml, namespace.yaml (PSA privileged), kustomization.yaml |
| metrics-server (Helm values) | clusters/noble/apps/metrics-server/values.yaml |
| Traefik (Helm values) | clusters/noble/apps/traefik/ — values.yaml, namespace.yaml, README.md |
| cert-manager (Helm + ClusterIssuers) | clusters/noble/apps/cert-manager/ — values.yaml, namespace.yaml, kustomization.yaml, README.md |
| Newt / Pangolin tunnel (Helm) | clusters/noble/apps/newt/ — values.yaml, namespace.yaml, README.md |
| Argo CD (bootstrap + app-of-apps) | clusters/noble/bootstrap/argocd/ — values.yaml, root-application.yaml, apps/, README.md |
| kube-prometheus-stack (Helm values) | clusters/noble/apps/kube-prometheus-stack/ — values.yaml, namespace.yaml |
Git vs cluster: manifests and talconfig live in git; talhelper genconfig -o out, bootstrap, Helm, and kubectl run on your LAN. See talos/README.md for workstation reachability (lab LAN/VPN), talosctl kubeconfig vs Kubernetes server: (VIP vs node IP), and --insecure only in maintenance.
Ordering (do not skip)
- Talos installed; Cilium (or chosen CNI) before most workloads — with
cni: none, nodes stay NotReady / network-unavailable taint until CNI is up. - MetalLB Helm chart (CRDs + controller) before
kubectl apply -kon the pool manifests. clusters/noble/apps/metallb/namespace.yamlbefore or merged ontometallb-systemso Pod Security does not block speaker (seeapps/metallb/README.md).- Longhorn: Talos user volume + extensions in
talconfig.with-longhorn.yaml(when restored); HelmdefaultDataPathinclusters/noble/apps/longhorn/values.yaml.
Prerequisites (before phases)
talos/talconfig.yamlchecked in (VIP, API SANs,cni: none,iscsi-tools/util-linux-toolsin schematic) — runtalhelper validate talconfig talconfig.yamlafter edits- Workstation on a routable path to node IPs or VIP (same LAN / VPN);
talos/README.md§3 ifkubectlhits wrongserver:ornetwork is unreachable talosctlclient matches node Talos version;talhelperforgenconfig- Node static IPs (helium, neon, argon, krypton)
- DHCP does not lease
192.168.50.210–229,230, or node IPs - DNS for API and apps as in
talos/README.md - Git remote ready for Argo CD (argo-cd)
talos/kubeconfigfromtalosctl kubeconfig— root repokubeconfigis a stub until populated
Phase A — Talos bootstrap + API VIP
talhelper gensecret→talhelper genconfig -o out(re-rungenconfigafter everytalconfigedit)apply-configall nodes (talos/README.md§2 — no--insecureafter nodes join; useTALOSCONFIG)talosctl bootstraponce; other control planes and worker jointalosctl kubeconfig→ workingkubectl(talos/README.md§3 — overrideserver:if VIP not reachable from workstation)- kube-vip manifests in
clusters/noble/apps/kube-vip - kube-vip healthy;
vip_interfacematches uplink (talosctl get links); VIP reachable where needed talosctl health(e.g.talosctl health -n 192.168.50.20withTALOSCONFIGset)
Phase B — Core platform
Install order: Cilium → metrics-server → Longhorn (Talos disk + Helm) → MetalLB (Helm → pool manifests) → ingress / certs / DNS as planned.
- Cilium (Helm 1.16.6) — required before MetalLB if
cni: none(clusters/noble/apps/cilium/) - metrics-server — Helm 3.13.0; values in
clusters/noble/apps/metrics-server/values.yaml; verifykubectl top nodes - Longhorn — Talos: user volume + kubelet mounts + extensions (
talos/README.md§5); Helm 1.11.1;kubectl apply -k clusters/noble/apps/longhorn; verifynodes.longhorn.ioand test PVCBound - MetalLB — chart installed; pool + L2 from
clusters/noble/apps/metallb/applied (192.168.50.210–229) ServiceLoadBalancer/ pool check — MetalLB assigns from210–229(validated before Traefik; temporary nginx test removed in favor of Traefik)- Traefik
LoadBalancerfor*.apps.noble.lab.pcenicni.dev—clusters/noble/apps/traefik/;192.168.50.211 - cert-manager + ClusterIssuer (
letsencrypt-staging/letsencrypt-prod) —clusters/noble/apps/cert-manager/ - Newt (Pangolin tunnel; replaces ExternalDNS for public DNS) —
clusters/noble/apps/newt/—newt-pangolin-auth; CNAME + Integration API pernewt/README.md
Phase C — GitOps
- Argo CD bootstrap —
clusters/noble/bootstrap/argocd/(helm upgrade --install argocd …) - Argo CD server LoadBalancer —
192.168.50.210(seevalues.yaml) - App-of-apps — set
repoURLinroot-application.yaml, addApplicationmanifests underbootstrap/argocd/apps/, applyroot-application.yaml - SSO — later
Phase D — Observability
- kube-prometheus-stack —
kubectl apply -f clusters/noble/apps/kube-prometheus-stack/namespace.yamlthenhelm upgrade --installas inclusters/noble/apps/kube-prometheus-stack/values.yaml(chart 82.15.1); PVCslonghorn;--wait --timeout 30mrecommended; verifykubectl -n monitoring get pods,pvc - Loki + Fluent Bit; Grafana datasource
Phase E — Secrets
- Sealed Secrets (optional Git workflow)
- Vault in-cluster on Longhorn + auto-unseal
- External Secrets Operator + Vault
ClusterSecretStore
Phase F — Policy + backups
- Kyverno baseline policies
- Velero when S3 is ready; backup/restore drill
Phase G — Hardening
- RBAC, network policies (Cilium), Alertmanager routes
- Runbooks: API VIP, etcd, Longhorn, Vault
Quick validation
kubectl get nodes— all Ready- API via VIP
:6443—kubectl get --raw /healthz→okwith kubeconfigserver:https://192.168.50.230:6443 - Ingress
LoadBalancerin pool210–229(Traefik →192.168.50.211) - Argo CD UI —
argocd-serverLoadBalancer192.168.50.210(initialadminpassword fromargocd-initial-admin-secret) - Sample Ingress + cert (cert-manager ready) + Pangolin resource + CNAME
- PVC
Boundon Longhorn (storageClassName: longhorn); Prometheus/Loki durable when configured monitoring— kube-prometheus-stack core workloads Running (Prometheus, Grafana, Alertmanager, operator, kube-state-metrics); PVCs Bound on longhorn
Keep in sync with talos/README.md and manifests under clusters/noble/.