Files
home-server/talos/CLUSTER-BUILD.md
Nikholas Pcenicni a65b553252 Stop tracking talos kubeconfig; remove Authentik token from git; add Newt kubeseal example
Remove committed talos/kubeconfig (cluster admin credentials). Ignore talos/kubeconfig at repo root.
Replace hardcoded LDAP outpost token with AUTHENTIK_LDAP_OUTPOST_TOKEN from .env.
Document Sealed Secrets workflow for Newt (kubeseal script + README updates). Clarify Talos secrets use talsecret/SOPS, not Sealed Secrets.

Made-with: Cursor
2026-03-28 01:19:58 -04:00

186 lines
20 KiB
Markdown
Raw Blame History

This file contains ambiguous Unicode characters
This file contains Unicode characters that might be confused with other characters. If you think that this is intentional, you can safely ignore this warning. Use the Escape button to reveal them.
# Noble lab — Talos cluster build checklist
This document is the **exported TODO** for the **noble** Talos cluster (4 nodes). Commands and troubleshooting live in [`README.md`](./README.md).
## Current state (2026-03-28)
Lab stack is **up** on-cluster for bootstrap through **Phase D** (observability) and **Phase E** (Sealed Secrets, External Secrets, **Vault** Helm install), with manifests matching this repo. **Next focus:** **Vault** `operator init` / unseal, optional **`unseal-cronjob.yaml`**, Kubernetes auth + **`ClusterSecretStore`**, optional Pangolin/sample Ingress validation, Velero when S3 exists.
- **Talos** v1.12.6 (target) / **Kubernetes** as bundled — four nodes **Ready** unless upgrading; **`talosctl health`**; **`talos/kubeconfig`** is **local only** (gitignored — never commit; regenerate with `talosctl kubeconfig` per `talos/README.md`). **Image Factory (nocloud installer):** `factory.talos.dev/nocloud-installer/249d9135de54962744e917cfe654117000cba369f9152fbab9d055a00aa3664f:v1.12.6`
- **Cilium** Helm **1.16.6** / app **1.16.6** (`clusters/noble/apps/cilium/`, phase 1 values).
- **MetalLB** Helm **0.15.3** / app **v0.15.3**; **IPAddressPool** `noble-l2` + **L2Advertisement** — pool **`192.168.50.210``192.168.50.229`**.
- **kube-vip** DaemonSet **3/3** on control planes; VIP **`192.168.50.230`** on **`ens18`** (`vip_subnet` **`/32`** required — bare **`32`** breaks parsing). **Verified from workstation:** `kubectl config set-cluster noble --server=https://192.168.50.230:6443` then **`kubectl get --raw /healthz`** → **`ok`** (`talos/kubeconfig`; see `talos/README.md`).
- **metrics-server** Helm **3.13.0** / app **v0.8.0**`clusters/noble/apps/metrics-server/values.yaml` (`--kubelet-insecure-tls` for Talos); **`kubectl top nodes`** works.
- **Longhorn** Helm **1.11.1** / app **v1.11.1**`clusters/noble/apps/longhorn/` (PSA **privileged** namespace, `defaultDataPath` `/var/mnt/longhorn`, `preUpgradeChecker` enabled); **StorageClass** `longhorn` (default); **`nodes.longhorn.io`** all **Ready**; test **PVC** `Bound` on `longhorn`.
- **Traefik** Helm **39.0.6** / app **v3.6.11**`clusters/noble/apps/traefik/`; **`Service`** **`LoadBalancer`** **`EXTERNAL-IP` `192.168.50.211`**; **`IngressClass`** **`traefik`** (default). Point **`*.apps.noble.lab.pcenicni.dev`** at **`192.168.50.211`**. MetalLB pool verification was done before replacing the temporary nginx test with Traefik.
- **cert-manager** Helm **v1.20.0** / app **v1.20.0**`clusters/noble/apps/cert-manager/`; **`ClusterIssuer`** **`letsencrypt-staging`** and **`letsencrypt-prod`** (HTTP-01, ingress class **`traefik`**); ACME email **`certificates@noble.lab.pcenicni.dev`** (edit in manifests if you want a different mailbox).
- **Newt** Helm **1.2.0** / app **1.10.1**`clusters/noble/apps/newt/` (**fossorial/newt**); Pangolin site tunnel — **`newt-pangolin-auth`** Secret (**`PANGOLIN_ENDPOINT`**, **`NEWT_ID`**, **`NEWT_SECRET`**). Prefer a **SealedSecret** in git (`kubeseal` — see `clusters/noble/apps/sealed-secrets/examples/`) after rotating credentials if they were exposed. **Public DNS** is **not** automated with ExternalDNS: **CNAME** records at your DNS host per Pangolins domain instructions, plus **Integration API** for HTTP resources/targets — see **`clusters/noble/apps/newt/README.md`**. LAN access to Traefik can still use **`*.apps.noble.lab.pcenicni.dev`** → **`192.168.50.211`** (split horizon / local resolver).
- **Argo CD** Helm **9.4.17** / app **v3.3.6**`clusters/noble/bootstrap/argocd/`; **`argocd-server`** **`LoadBalancer`** **`192.168.50.210`**; app-of-apps scaffold under **`bootstrap/argocd/apps/`** (edit **`root-application.yaml`** `repoURL` before applying).
- **kube-prometheus-stack** — Helm chart **82.15.1**`clusters/noble/apps/kube-prometheus-stack/` (**namespace** `monitoring`, PSA **privileged****node-exporter** needs host mounts); **Longhorn** PVCs for Prometheus, Grafana, Alertmanager; **node-exporter** DaemonSet **4/4**. **Grafana Ingress:** **`https://grafana.apps.noble.lab.pcenicni.dev`** (Traefik **`ingressClassName: traefik`**, **`cert-manager.io/cluster-issuer: letsencrypt-prod`**). **Loki** datasource in Grafana: ConfigMap **`clusters/noble/apps/grafana-loki-datasource/loki-datasource.yaml`** (sidecar label **`grafana_datasource: "1"`**) — not via **`grafana.additionalDataSources`** in the chart. **`helm upgrade --install` with `--wait` is silent until done** — use **`--timeout 30m`**; Grafana admin: Secret **`kube-prometheus-grafana`**, keys **`admin-user`** / **`admin-password`**.
- **Loki** + **Fluent Bit****`grafana/loki` 6.55.0** SingleBinary + **filesystem** on **Longhorn** (`clusters/noble/apps/loki/`); **`loki.auth_enabled: false`**; **`chunksCache.enabled: false`** (no memcached chunk cache). **`fluent/fluent-bit` 0.56.0** → **`loki-gateway.loki.svc:80`** (`clusters/noble/apps/fluent-bit/`); **`logging`** PSA **privileged**. **Grafana Explore:** **`kubectl apply -f clusters/noble/apps/grafana-loki-datasource/loki-datasource.yaml`** then **Explore → Loki** (e.g. `{job="fluent-bit"}`).
- **Sealed Secrets** Helm **2.18.4** / app **0.36.1**`clusters/noble/apps/sealed-secrets/` (namespace **`sealed-secrets`**); **`kubeseal`** on client should match controller minor (**README**); back up **`sealed-secrets-key`** (see README).
- **External Secrets Operator** Helm **2.2.0** / app **v2.2.0**`clusters/noble/apps/external-secrets/`; Vault **`ClusterSecretStore`** in **`examples/vault-cluster-secret-store.yaml`** (**`http://`** to match Vault listener — apply after Vault **Kubernetes auth**).
- **Vault** Helm **0.32.0** / app **1.21.2**`clusters/noble/apps/vault/` — standalone **file** storage, **Longhorn** PVC; **HTTP** listener (`global.tlsDisable`); optional **CronJob** lab unseal **`unseal-cronjob.yaml`**; **not** initialized in git — run **`vault operator init`** per **`README.md`**.
- **Still open:** Vault **Kubernetes auth** + **`ClusterSecretStore`** apply + KV for ESO; **Phase FG**; optional **sample Ingress + cert + Pangolin** end-to-end; **Velero** when S3 is ready; **Argo CD SSO**.
## Inventory
| Host | Role | IP |
|------|------|-----|
| helium | worker | `192.168.50.10` |
| neon | control-plane + worker | `192.168.50.20` |
| argon | control-plane + worker | `192.168.50.30` |
| krypton | control-plane + worker | `192.168.50.40` |
## Network reservations
| Use | Value |
|-----|--------|
| Kubernetes API VIP (kube-vip) | `192.168.50.230` (see `talos/README.md`; align with `talos/talconfig.yaml` `additionalApiServerCertSans`) |
| MetalLB L2 pool | `192.168.50.210``192.168.50.229` |
| Argo CD `LoadBalancer` | **Pick one IP** in the MetalLB pool (e.g. `192.168.50.210`) |
| Traefik (apps ingress) | `192.168.50.211`**`metallb.io/loadBalancerIPs`** in `clusters/noble/apps/traefik/values.yaml` |
| Apps ingress (LAN / split horizon) | `*.apps.noble.lab.pcenicni.dev` → Traefik LB |
| Grafana (Ingress + TLS) | **`grafana.apps.noble.lab.pcenicni.dev`** — `grafana.ingress` in `clusters/noble/apps/kube-prometheus-stack/values.yaml` (**`letsencrypt-prod`**) |
| Public DNS (Pangolin) | **Newt** tunnel + **CNAME** at registrar + **Integration API**`clusters/noble/apps/newt/` |
| Velero | S3-compatible URL — configure later |
## Versions
- Talos: **v1.12.6** — align `talosctl` client with node image
- Talos **Image Factory** (iscsi-tools + util-linux-tools): **`factory.talos.dev/nocloud-installer/249d9135de54962744e917cfe654117000cba369f9152fbab9d055a00aa3664f:v1.12.6`** — same schematic must appear in **`machine.install.image`** after `talhelper genconfig` (bare metal may use `metal-installer/` instead of `nocloud-installer/`)
- Kubernetes: **1.35.2** on current nodes (bundled with Talos; not pinned in repo)
- Cilium: **1.16.6** (Helm chart; see `clusters/noble/apps/cilium/README.md`)
- MetalLB: **0.15.3** (Helm chart; app **v0.15.3**)
- metrics-server: **3.13.0** (Helm chart; app **v0.8.0**)
- Longhorn: **1.11.1** (Helm chart; app **v1.11.1**)
- Traefik: **39.0.6** (Helm chart; app **v3.6.11**)
- cert-manager: **v1.20.0** (Helm chart; app **v1.20.0**)
- Newt (Fossorial): **1.2.0** (Helm chart; app **1.10.1**)
- Argo CD: **9.4.17** (Helm chart `argo/argo-cd`; app **v3.3.6**)
- kube-prometheus-stack: **82.15.1** (Helm chart `prometheus-community/kube-prometheus-stack`; app **v0.89.x** bundle)
- Loki: **6.55.0** (Helm chart `grafana/loki`; app **3.6.7**)
- Fluent Bit: **0.56.0** (Helm chart `fluent/fluent-bit`; app **4.2.3**)
- Sealed Secrets: **2.18.4** (Helm chart `sealed-secrets/sealed-secrets`; app **0.36.1**)
- External Secrets Operator: **2.2.0** (Helm chart `external-secrets/external-secrets`; app **v2.2.0**)
- Vault: **0.32.0** (Helm chart `hashicorp/vault`; app **1.21.2**)
## Repo paths (this workspace)
| Artifact | Path |
|----------|------|
| This checklist | `talos/CLUSTER-BUILD.md` |
| Talos quick start + networking + kubeconfig | `talos/README.md` |
| talhelper source (active) | `talos/talconfig.yaml` — may be **wipe-phase** (no Longhorn volume) during disk recovery |
| Longhorn volume restore | `talos/talconfig.with-longhorn.yaml` — copy to `talconfig.yaml` after GPT wipe (see `talos/README.md` §5) |
| Longhorn GPT wipe automation | `talos/scripts/longhorn-gpt-recovery.sh` |
| kube-vip (kustomize) | `clusters/noble/apps/kube-vip/` (`vip_interface` e.g. `ens18`) |
| Cilium (Helm values) | `clusters/noble/apps/cilium/``values.yaml` (phase 1), optional `values-kpr.yaml`, `README.md` |
| MetalLB | `clusters/noble/apps/metallb/``namespace.yaml` (PSA **privileged**), `ip-address-pool.yaml`, `kustomization.yaml`, `README.md` |
| Longhorn | `clusters/noble/apps/longhorn/``values.yaml`, `namespace.yaml` (PSA **privileged**), `kustomization.yaml` |
| metrics-server (Helm values) | `clusters/noble/apps/metrics-server/values.yaml` |
| Traefik (Helm values) | `clusters/noble/apps/traefik/``values.yaml`, `namespace.yaml`, `README.md` |
| cert-manager (Helm + ClusterIssuers) | `clusters/noble/apps/cert-manager/``values.yaml`, `namespace.yaml`, `kustomization.yaml`, `README.md` |
| Newt / Pangolin tunnel (Helm) | `clusters/noble/apps/newt/``values.yaml`, `namespace.yaml`, `README.md` |
| Argo CD (bootstrap + app-of-apps) | `clusters/noble/bootstrap/argocd/``values.yaml`, `root-application.yaml`, `apps/`, `README.md` |
| kube-prometheus-stack (Helm values) | `clusters/noble/apps/kube-prometheus-stack/``values.yaml`, `namespace.yaml` |
| Grafana Loki datasource (ConfigMap; no chart change) | `clusters/noble/apps/grafana-loki-datasource/loki-datasource.yaml` |
| Loki (Helm values) | `clusters/noble/apps/loki/``values.yaml`, `namespace.yaml` |
| Fluent Bit → Loki (Helm values) | `clusters/noble/apps/fluent-bit/``values.yaml`, `namespace.yaml` |
| Sealed Secrets (Helm) | `clusters/noble/apps/sealed-secrets/``values.yaml`, `namespace.yaml`, `README.md` |
| External Secrets Operator (Helm + Vault store example) | `clusters/noble/apps/external-secrets/``values.yaml`, `namespace.yaml`, `README.md`, `examples/vault-cluster-secret-store.yaml` |
| Vault (Helm + optional unseal CronJob) | `clusters/noble/apps/vault/``values.yaml`, `namespace.yaml`, `unseal-cronjob.yaml`, `README.md` |
**Git vs cluster:** manifests and `talconfig` live in git; **`talhelper genconfig -o out`**, bootstrap, Helm, and `kubectl` run on your LAN. See **`talos/README.md`** for workstation reachability (lab LAN/VPN), **`talosctl kubeconfig`** vs Kubernetes `server:` (VIP vs node IP), and **`--insecure`** only in maintenance.
## Ordering (do not skip)
1. **Talos** installed; **Cilium** (or chosen CNI) **before** most workloads — with `cni: none`, nodes stay **NotReady** / **network-unavailable** taint until CNI is up.
2. **MetalLB Helm chart** (CRDs + controller) **before** `kubectl apply -k` on the pool manifests.
3. **`clusters/noble/apps/metallb/namespace.yaml`** before or merged onto `metallb-system` so Pod Security does not block speaker (see `apps/metallb/README.md`).
4. **Longhorn:** Talos user volume + extensions in `talconfig.with-longhorn.yaml` (when restored); Helm **`defaultDataPath`** in `clusters/noble/apps/longhorn/values.yaml`.
5. **Loki → Fluent Bit → Grafana datasource:** deploy **Loki** (`loki-gateway` Service) before **Fluent Bit**; apply **`clusters/noble/apps/grafana-loki-datasource/loki-datasource.yaml`** after **Loki** (sidecar picks up the ConfigMap — no kube-prometheus values change for Loki).
6. **Vault:** **Longhorn** default **StorageClass** before **`clusters/noble/apps/vault/`** Helm (PVC **`data-vault-0`**); **External Secrets** **`ClusterSecretStore`** after Vault is initialized, unsealed, and **Kubernetes auth** is configured.
## Prerequisites (before phases)
- [x] `talos/talconfig.yaml` checked in (VIP, API SANs, `cni: none`, `iscsi-tools` / `util-linux-tools` in schematic) — run `talhelper validate talconfig talconfig.yaml` after edits
- [x] Workstation on a **routable path** to node IPs or VIP (same LAN / VPN); `talos/README.md` §3 if `kubectl` hits wrong `server:` or `network is unreachable`
- [x] `talosctl` client matches node Talos version; `talhelper` for `genconfig`
- [x] Node static IPs (helium, neon, argon, krypton)
- [x] DHCP does not lease `192.168.50.210``229`, `230`, or node IPs
- [x] DNS for API and apps as in `talos/README.md`
- [x] Git remote ready for Argo CD (argo-cd)
- [x] **`talos/kubeconfig`** from `talosctl kubeconfig` — root repo `kubeconfig` is a stub until populated
## Phase A — Talos bootstrap + API VIP
- [x] `talhelper gensecret``talhelper genconfig -o out` (re-run `genconfig` after every `talconfig` edit)
- [x] `apply-config` all nodes (`talos/README.md` §2 — **no** `--insecure` after nodes join; use `TALOSCONFIG`)
- [x] `talosctl bootstrap` once; other control planes and worker join
- [x] `talosctl kubeconfig` → working `kubectl` (`talos/README.md` §3 — override `server:` if VIP not reachable from workstation)
- [x] **kube-vip manifests** in `clusters/noble/apps/kube-vip`
- [x] kube-vip healthy; `vip_interface` matches uplink (`talosctl get links`); VIP reachable where needed
- [x] `talosctl health` (e.g. `talosctl health -n 192.168.50.20` with `TALOSCONFIG` set)
## Phase B — Core platform
**Install order:** **Cilium****metrics-server****Longhorn** (Talos disk + Helm) → **MetalLB** (Helm → pool manifests) → ingress / certs / DNS as planned.
- [x] **Cilium** (Helm **1.16.6**) — **required** before MetalLB if `cni: none` (`clusters/noble/apps/cilium/`)
- [x] **metrics-server** — Helm **3.13.0**; values in `clusters/noble/apps/metrics-server/values.yaml`; verify `kubectl top nodes`
- [x] **Longhorn** — Talos: user volume + kubelet mounts + extensions (`talos/README.md` §5); Helm **1.11.1**; `kubectl apply -k clusters/noble/apps/longhorn`; verify **`nodes.longhorn.io`** and test PVC **`Bound`**
- [x] **MetalLB** — chart installed; **pool + L2** from `clusters/noble/apps/metallb/` applied (`192.168.50.210``229`)
- [x] **`Service` `LoadBalancer`** / pool check — MetalLB assigns from `210``229` (validated before Traefik; temporary nginx test removed in favor of Traefik)
- [x] **Traefik** `LoadBalancer` for `*.apps.noble.lab.pcenicni.dev``clusters/noble/apps/traefik/`; **`192.168.50.211`**
- [x] **cert-manager** + ClusterIssuer (**`letsencrypt-staging`** / **`letsencrypt-prod`**) — `clusters/noble/apps/cert-manager/`
- [x] **Newt** (Pangolin tunnel; replaces ExternalDNS for public DNS) — `clusters/noble/apps/newt/`**`newt-pangolin-auth`**; CNAME + **Integration API** per **`newt/README.md`**
## Phase C — GitOps
- [x] **Argo CD** bootstrap — `clusters/noble/bootstrap/argocd/` (`helm upgrade --install argocd …`)
- [x] Argo CD server **LoadBalancer****`192.168.50.210`** (see `values.yaml`)
- [X] **App-of-apps** — set **`repoURL`** in **`root-application.yaml`**, add **`Application`** manifests under **`bootstrap/argocd/apps/`**, apply **`root-application.yaml`**
- [ ] SSO — later
## Phase D — Observability
- [x] **kube-prometheus-stack**`kubectl apply -f clusters/noble/apps/kube-prometheus-stack/namespace.yaml` then **`helm upgrade --install`** as in `clusters/noble/apps/kube-prometheus-stack/values.yaml` (chart **82.15.1**); PVCs **`longhorn`**; **`--wait --timeout 30m`** recommended; verify **`kubectl -n monitoring get pods,pvc`**
- [x] **Loki** + **Fluent Bit** + **Grafana Loki datasource****order:** **`kubectl apply -f clusters/noble/apps/loki/namespace.yaml`** → **`helm upgrade --install loki`** `grafana/loki` **6.55.0** `-f clusters/noble/apps/loki/values.yaml`**`kubectl apply -f clusters/noble/apps/fluent-bit/namespace.yaml`** → **`helm upgrade --install fluent-bit`** `fluent/fluent-bit` **0.56.0** `-f clusters/noble/apps/fluent-bit/values.yaml`**`kubectl apply -f clusters/noble/apps/grafana-loki-datasource/loki-datasource.yaml`**. Verify **Explore → Loki** in Grafana; **`kubectl -n loki get pods,pvc`**, **`kubectl -n logging get pods`**
## Phase E — Secrets
- [x] **Sealed Secrets** (optional Git workflow) — `clusters/noble/apps/sealed-secrets/` (Helm **2.18.4**); **`kubeseal`** + key backup per **`README.md`**
- [x] **Vault** in-cluster on Longhorn + **auto-unseal**`clusters/noble/apps/vault/` (Helm **0.32.0**); **Longhorn** PVC; **OSS** “auto-unseal” = optional **`unseal-cronjob.yaml`** + Secret (**README**); init/unseal/Kubernetes auth for ESO still **to do** on cluster
- [x] **External Secrets Operator** + Vault `ClusterSecretStore` — operator **`clusters/noble/apps/external-secrets/`** (Helm **2.2.0**); apply **`examples/vault-cluster-secret-store.yaml`** after Vault (**`README.md`**)
## Phase F — Policy + backups
- [ ] **Kyverno** baseline policies
- [ ] **Velero** when S3 is ready; backup/restore drill
## Phase G — Hardening
- [ ] RBAC, network policies (Cilium), Alertmanager routes
- [ ] Runbooks: API VIP, etcd, Longhorn, Vault
## Quick validation
- [x] `kubectl get nodes` — all **Ready**
- [x] API via VIP `:6443`**`kubectl get --raw /healthz`** → **`ok`** with kubeconfig **`server:`** `https://192.168.50.230:6443`
- [x] Ingress **`LoadBalancer`** in pool `210``229` (**Traefik** → **`192.168.50.211`**)
- [x] **Argo CD** UI — **`argocd-server`** **`LoadBalancer`** **`192.168.50.210`** (initial **`admin`** password from **`argocd-initial-admin-secret`**)
- [ ] Sample Ingress + cert (cert-manager ready) + Pangolin resource + CNAME
- [x] PVC **`Bound`** on **Longhorn** (`storageClassName: longhorn`); Prometheus/Loki durable when configured
- [x] **`monitoring`** — **kube-prometheus-stack** core workloads **Running** (Prometheus, Grafana, Alertmanager, operator, kube-state-metrics, node-exporter); PVCs **Bound** on **longhorn**
- [x] **`loki`** — **Loki** SingleBinary + **gateway** **Running**; **`loki`** PVC **Bound** on **longhorn** (no chunks-cache by design)
- [x] **`logging`** — **Fluent Bit** DaemonSet **Running** on all nodes (logs → **Loki**)
- [x] **Grafana****Loki** datasource from **`grafana-loki-datasource`** ConfigMap (**Explore** works after apply + sidecar sync)
- [x] **`sealed-secrets`** — controller **Deployment** **Running** in **`sealed-secrets`** (install + **`kubeseal`** per **`apps/sealed-secrets/README.md`**)
- [x] **`external-secrets`** — controller + webhook + cert-controller **Running** in **`external-secrets`**; apply **`ClusterSecretStore`** after Vault **Kubernetes auth**
- [x] **`vault`** — **StatefulSet** **Running**, **`data-vault-0`** PVC **Bound** on **longhorn**; **`vault operator init`** + unseal per **`apps/vault/README.md`**
---
*Keep in sync with `talos/README.md` and manifests under `clusters/noble/`.*