Remove deprecated Argo CD application configurations and related files for noble cluster, including root-application.yaml, kustomization.yaml, and individual application manifests for argocd, cilium, longhorn, kube-vip, and monitoring components. Update kube-vip daemonset.yaml to enhance deployment strategy and environment variables for improved configuration.

2026-03-27 23:02:17 -04:00
parent 4263da65d8
commit d2c53fc553
37 changed files with 778 additions and 1042 deletions
--- a/talos/CLUSTER-BUILD.md
+++ b/talos/CLUSTER-BUILD.md
@@ -0,0 +1,138 @@
+# Noble lab — Talos cluster build checklist
+
+This document is the **exported TODO** for the **noble** Talos cluster (4 nodes). Commands and troubleshooting live in [`README.md`](./README.md).
+
+## Current state (2026-03-28)
+
+- **Talos** v1.12.6 (target) / **Kubernetes** as bundled — four nodes **Ready** unless upgrading; **`talosctl health`**; **`talos/kubeconfig`** for `kubectl` (root `kubeconfig` may still be a stub). **Image Factory (nocloud installer):** `factory.talos.dev/nocloud-installer/249d9135de54962744e917cfe654117000cba369f9152fbab9d055a00aa3664f:v1.12.6`
+- **Cilium** Helm **1.16.6** / app **1.16.6** (`clusters/noble/apps/cilium/`, phase 1 values).
+- **MetalLB** Helm **0.15.3** / app **v0.15.3**; **IPAddressPool** `noble-l2` + **L2Advertisement** — pool **`192.168.50.210`–`192.168.50.229`**.
+- **kube-vip** DaemonSet **3/3** on control planes; VIP **`192.168.50.230`** on **`ens18`** (`vip_subnet` **`/32`** required — bare **`32`** breaks parsing). **Verified from workstation:** `kubectl config set-cluster noble --server=https://192.168.50.230:6443` then **`kubectl get --raw /healthz`** → **`ok`** (`talos/kubeconfig`; see `talos/README.md`).
+- **metrics-server** Helm **3.13.0** / app **v0.8.0** — `clusters/noble/apps/metrics-server/values.yaml` (`--kubelet-insecure-tls` for Talos); **`kubectl top nodes`** works.
+- **Still open:** Longhorn, Traefik, cert-manager, Argo CD, observability — checklist below.
+
+## Inventory
+
+| Host | Role | IP |
+|------|------|-----|
+| helium | worker | `192.168.50.10` |
+| neon | control-plane + worker | `192.168.50.20` |
+| argon | control-plane + worker | `192.168.50.30` |
+| krypton | control-plane + worker | `192.168.50.40` |
+
+## Network reservations
+
+| Use | Value |
+|-----|--------|
+| Kubernetes API VIP (kube-vip) | `192.168.50.230` (see `talos/README.md`; align with `talos/talconfig.yaml` `additionalApiServerCertSans`) |
+| MetalLB L2 pool | `192.168.50.210`–`192.168.50.229` |
+| Argo CD `LoadBalancer` | **Pick one IP** in the MetalLB pool (e.g. `192.168.50.210`) |
+| Apps ingress DNS | `*.apps.noble.lab.pcenicni.dev` |
+| ExternalDNS | Pangolin (map to supported ExternalDNS provider when documented) |
+| Velero | S3-compatible URL — configure later |
+
+## Versions
+
+- Talos: **v1.12.6** — align `talosctl` client with node image
+- Talos **Image Factory** (iscsi-tools + util-linux-tools): **`factory.talos.dev/nocloud-installer/249d9135de54962744e917cfe654117000cba369f9152fbab9d055a00aa3664f:v1.12.6`** — same schematic must appear in **`machine.install.image`** after `talhelper genconfig` (bare metal may use `metal-installer/` instead of `nocloud-installer/`)
+- Kubernetes: **1.35.2** on current nodes (bundled with Talos; not pinned in repo)
+- Cilium: **1.16.6** (Helm chart; see `clusters/noble/apps/cilium/README.md`)
+- MetalLB: **0.15.3** (Helm chart; app **v0.15.3**)
+- metrics-server: **3.13.0** (Helm chart; app **v0.8.0**)
+
+## Repo paths (this workspace)
+
+| Artifact | Path |
+|----------|------|
+| This checklist | `talos/CLUSTER-BUILD.md` |
+| Talos quick start + networking + kubeconfig | `talos/README.md` |
+| talhelper source (active) | `talos/talconfig.yaml` — may be **wipe-phase** (no Longhorn volume) during disk recovery |
+| Longhorn volume restore | `talos/talconfig.with-longhorn.yaml` — copy to `talconfig.yaml` after GPT wipe (see `talos/README.md` §5) |
+| Longhorn GPT wipe automation | `talos/scripts/longhorn-gpt-recovery.sh` |
+| kube-vip (kustomize) | `clusters/noble/apps/kube-vip/` (`vip_interface` e.g. `ens18`) |
+| Cilium (Helm values) | `clusters/noble/apps/cilium/` — `values.yaml` (phase 1), optional `values-kpr.yaml`, `README.md` |
+| MetalLB | `clusters/noble/apps/metallb/` — `namespace.yaml` (PSA **privileged**), `ip-address-pool.yaml`, `kustomization.yaml`, `README.md` |
+| Longhorn Helm values | `clusters/noble/apps/longhorn/values.yaml` |
+| metrics-server (Helm values) | `clusters/noble/apps/metrics-server/values.yaml` |
+
+**Git vs cluster:** manifests and `talconfig` live in git; **`talhelper genconfig -o out`**, bootstrap, Helm, and `kubectl` run on your LAN. See **`talos/README.md`** for workstation reachability (lab LAN/VPN), **`talosctl kubeconfig`** vs Kubernetes `server:` (VIP vs node IP), and **`--insecure`** only in maintenance.
+
+## Ordering (do not skip)
+
+1. **Talos** installed; **Cilium** (or chosen CNI) **before** most workloads — with `cni: none`, nodes stay **NotReady** / **network-unavailable** taint until CNI is up.
+2. **MetalLB Helm chart** (CRDs + controller) **before** `kubectl apply -k` on the pool manifests.
+3. **`clusters/noble/apps/metallb/namespace.yaml`** before or merged onto `metallb-system` so Pod Security does not block speaker (see `apps/metallb/README.md`).
+4. **Longhorn:** Talos user volume + extensions in `talconfig.with-longhorn.yaml` (when restored); Helm **`defaultDataPath`** in `clusters/noble/apps/longhorn/values.yaml`.
+
+## Prerequisites (before phases)
+
+- [x] `talos/talconfig.yaml` checked in (VIP, API SANs, `cni: none`, `iscsi-tools` / `util-linux-tools` in schematic) — run `talhelper validate talconfig talconfig.yaml` after edits
+- [x] Workstation on a **routable path** to node IPs or VIP (same LAN / VPN); `talos/README.md` §3 if `kubectl` hits wrong `server:` or `network is unreachable`
+- [x] `talosctl` client matches node Talos version; `talhelper` for `genconfig`
+- [x] Node static IPs (helium, neon, argon, krypton)
+- [x] DHCP does not lease `192.168.50.210`–`229`, `230`, or node IPs
+- [x]  DNS for API and apps as in `talos/README.md`
+- [x] Git remote ready for Argo CD (argo-cd)
+- [x] **`talos/kubeconfig`** from `talosctl kubeconfig` — root repo `kubeconfig` is a stub until populated
+
+## Phase A — Talos bootstrap + API VIP
+
+- [x] `talhelper gensecret` → `talhelper genconfig -o out` (re-run `genconfig` after every `talconfig` edit)
+- [x] `apply-config` all nodes (`talos/README.md` §2 — **no** `--insecure` after nodes join; use `TALOSCONFIG`)
+- [x] `talosctl bootstrap` once; other control planes and worker join
+- [x] `talosctl kubeconfig` → working `kubectl` (`talos/README.md` §3 — override `server:` if VIP not reachable from workstation)
+- [x] **kube-vip manifests** in `clusters/noble/apps/kube-vip`
+- [x] kube-vip healthy; `vip_interface` matches uplink (`talosctl get links`); VIP reachable where needed
+- [x] `talosctl health` (e.g. `talosctl health -n 192.168.50.20` with `TALOSCONFIG` set)
+
+## Phase B — Core platform
+
+**Install order:** **Cilium** → **metrics-server** → **Longhorn** (Talos disk + Helm) → **MetalLB** (Helm → pool manifests) → ingress / certs / DNS as planned.
+
+- [x] **Cilium** (Helm **1.16.6**) — **required** before MetalLB if `cni: none` (`clusters/noble/apps/cilium/`)
+- [x] **metrics-server** — Helm **3.13.0**; values in `clusters/noble/apps/metrics-server/values.yaml`; verify `kubectl top nodes`
+- [ ] **Longhorn** — Talos: `talconfig.with-longhorn.yaml` + `talos/README.md` §5; Helm: `clusters/noble/apps/longhorn/values.yaml` (`defaultDataPath` `/var/mnt/longhorn`)
+- [x] **MetalLB** — chart installed; **pool + L2** from `clusters/noble/apps/metallb/` applied (`192.168.50.210`–`229`)
+- [ ] **`Service` `LoadBalancer`** test — assign an IP from `210`–`229` (e.g. dummy `LoadBalancer` or Traefik)
+- [ ] **Traefik** `LoadBalancer` for `*.apps.noble.lab.pcenicni.dev`
+- [ ] **cert-manager** + ClusterIssuer (staging → prod)
+- [ ] **ExternalDNS** (Pangolin-compatible provider)
+
+## Phase C — GitOps
+
+- [ ] **Argo CD** bootstrap (`clusters/noble/bootstrap/argocd`, root app) — path TBD when added
+- [ ] Argo CD server **LoadBalancer** with dedicated pool IP
+- [ ] SSO — later
+
+## Phase D — Observability
+
+- [ ] **kube-prometheus-stack** (PVCs on Longhorn)
+- [ ] **Loki** + **Fluent Bit**; Grafana datasource
+
+## Phase E — Secrets
+
+- [ ] **Sealed Secrets** (optional Git workflow)
+- [ ] **Vault** in-cluster on Longhorn + **auto-unseal**
+- [ ] **External Secrets Operator** + Vault `ClusterSecretStore`
+
+## Phase F — Policy + backups
+
+- [ ] **Kyverno** baseline policies
+- [ ] **Velero** when S3 is ready; backup/restore drill
+
+## Phase G — Hardening
+
+- [ ] RBAC, network policies (Cilium), Alertmanager routes
+- [ ] Runbooks: API VIP, etcd, Longhorn, Vault
+
+## Quick validation
+
+- [x] `kubectl get nodes` — all **Ready**
+- [x] API via VIP `:6443` — **`kubectl get --raw /healthz`** → **`ok`** with kubeconfig **`server:`** `https://192.168.50.230:6443`
+- [ ] Test `LoadBalancer` receives IP from `210`–`229`
+- [ ] Sample Ingress + cert + ExternalDNS record
+- [ ] PVC bound; Prometheus/Loki durable if configured
+
+---
+
+*Keep in sync with `talos/README.md` and manifests under `clusters/noble/`.*