Remove deprecated Argo CD application configurations and related files for noble cluster, including root-application.yaml, kustomization.yaml, and individual application manifests for argocd, cilium, longhorn, kube-vip, and monitoring components. Update kube-vip daemonset.yaml to enhance deployment strategy and environment variables for improved configuration.

2026-03-27 23:02:17 -04:00
parent 4263da65d8
commit d2c53fc553
37 changed files with 778 additions and 1042 deletions
--- a/talos/README.md
+++ b/talos/README.md
@@ -1,544 +1,182 @@
-# Talos deployment (4 nodes)
+# Talos — noble lab

-This directory contains a `talhelper` cluster definition for a 4-node Talos
-cluster:
+- **Cluster build checklist (exported TODO):** [CLUSTER-BUILD.md](./CLUSTER-BUILD.md)

- 3 hybrid control-plane/worker nodes: `noble-cp-1..3`
- 1 worker-only node: `noble-worker-1`
- `allowSchedulingOnControlPlanes: true`
- CNI: `none` (for Cilium via GitOps)
+## Versions

-## 1) Update values for your environment
+Align with [CLUSTER-BUILD.md](./CLUSTER-BUILD.md): Talos **v1.12.6**; `talosctl` client should match installed node image.

-Edit `talconfig.yaml`:
+## DNS (prerequisites)

- `endpoint` (Kubernetes API VIP or LB IP)
- **`additionalApiServerCertSans`** / **`additionalMachineCertSans`**: must include the
-  **same VIP** (and DNS name, if you use one) that clients and `talosctl` use —
-  otherwise TLS to `https://<VIP>:6443` fails because the cert only lists node
-  IPs by default. This repo sets **`192.168.50.230`** (and
-  **`kube.noble.lab.pcenicni.dev`**) to match kube-vip.
- each node `ipAddress`
- each node `installDisk` (for example `/dev/sda`, `/dev/nvme0n1`)
- `talosVersion` / `kubernetesVersion` if desired
+| Name | Points to |
+|------|-----------|
+| `noble.lab`, `kube.noble.lab` (API SANs) | `192.168.50.230` (kube-vip) |
+| `*.apps.noble.lab.pcenicni.dev` | Traefik `LoadBalancer` IP from MetalLB pool (`192.168.50.210`–`229`) once ingress is up |

-After changing SANs, run **`talhelper genconfig`**, re-**apply-config** to all
-**control-plane** nodes (certs are regenerated), then refresh **`talosctl kubeconfig`**.
-
-## 2) Generate cluster secrets and machine configs
+## 1. Secrets and generated configs

 From this directory:

 ```bash
-talhelper gensecret > talsecret.sops.yaml
-talhelper genconfig
+talhelper gensecret > talsecret.yaml
+# Encrypt for git if desired: sops -e -i talsecret.sops.yaml (see talhelper docs)
+
+talhelper genconfig -o out
 ```

-Generated machine configs are written to `clusterconfig/`.
+`out/` is ignored via repo root `.gitignore` (`talos/out/`). Do not commit `talsecret.yaml` or generated machine configs.

-## 3) Apply Talos configs
+**After any `talconfig.yaml` edit, run `genconfig` again** before `apply-config`. Stale `out/*.yaml` is easy to apply by mistake. Quick check: `grep -A8 kind: UserVolumeConfig out/noble-neon.yaml` should match what you expect (e.g. Longhorn `volumeType: disk`, not `grow`/`maxSize` on a partition).

-Apply each node file to the matching node IP from `talconfig.yaml`:
+## 2. Apply machine config
+
+Order: **§1 `genconfig` → apply all nodes → §3 bootstrap** (not the reverse). Use the same `talsecret` / `out/` generation for the life of the cluster; rotating secrets without reinstalling nodes breaks client trust.
+
+**A) First install — node still in maintenance mode** (no Talos OS on disk yet, or explicitly in maintenance):

 ```bash
-talosctl apply-config --insecure -n 192.168.50.20 -f clusterconfig/noble-noble-cp-1.yaml
-talosctl apply-config --insecure -n 192.168.50.30 -f clusterconfig/noble-noble-cp-2.yaml
-talosctl apply-config --insecure -n 192.168.50.40 -f clusterconfig/noble-noble-cp-3.yaml
-talosctl apply-config --insecure -n 192.168.50.10 -f clusterconfig/noble-noble-worker-1.yaml
+talosctl apply-config --insecure -n 192.168.50.20 --file out/noble-neon.yaml
+# repeat for each node; TALOSCONFIG not required for --insecure maintenance API
 ```

-## 4) Bootstrap the cluster
-
-After all nodes are up (bootstrap once, from any control-plane node):
+**B) Node already installed / cluster already bootstrapped** (`tls: certificate required` if you use `--insecure` here):

 ```bash
-talosctl bootstrap -n 192.168.50.20 -e 192.168.50.230
-talosctl kubeconfig -n 192.168.50.20 -e 192.168.50.230 .
+export TALOSCONFIG="${TALOSCONFIG:-$(pwd)/out/talosconfig}"
+talosctl apply-config -n 192.168.50.20 --file out/noble-neon.yaml
 ```

-## 5) Validate
+**Do not pass `--insecure` for (B).** With `--insecure`, `talosctl` does not use client certificates from `TALOSCONFIG`, so the node still responds with `tls: certificate required`. The flag means “maintenance API only,” not “skip server verification.”
+
+**Wrong (what triggers the error):**

 ```bash
-talosctl -n 192.168.50.20 -e 192.168.50.230 health
-kubectl get nodes -o wide
+export TALOSCONFIG="$(pwd)/out/talosconfig"
+talosctl apply-config --insecure -n 192.168.50.20 --file out/noble-neon.yaml   # still broken on joined nodes
 ```

-### `kubectl` errors: `lookup https: no such host` or `https://https/...`
+## 3. Bootstrap and kubeconfig

-That means the **active** kubeconfig has a broken `cluster.server` URL (often a
-**double** `https://` or **duplicate** `:6443`). Kubernetes then tries to resolve
-the hostname `https`, which fails.
-
-Inspect what you are using:
+Bootstrap **once** on the first control plane **after** configs are applied (example: neon):

 ```bash
-kubectl config view --minify -o jsonpath='{.clusters[0].cluster.server}{"\n"}'
+export TALOSCONFIG="${TALOSCONFIG:-$(pwd)/out/talosconfig}"
+talosctl bootstrap -n 192.168.50.20
 ```

-It must be a **single** valid URL, for example:
-
- `https://192.168.50.230:6443` (API VIP from `talconfig.yaml`), or
- `https://kube.noble.lab.pcenicni.dev:6443` (if DNS points at that VIP)
-
-Fix the cluster entry (replace `noble` with your context’s cluster name if
-different):
-
-```bash
-kubectl config set-cluster noble --server=https://192.168.50.230:6443
-```
-
-Or point `kubectl` at this repo’s kubeconfig (known-good server line):
+After the API is up (direct node IP first; use VIP after kube-vip is healthy):

 ```bash
+export TALOSCONFIG="${TALOSCONFIG:-$(pwd)/out/talosconfig}"
+talosctl kubeconfig ./kubeconfig -n 192.168.50.20 -e 192.168.50.230 --merge=false
 export KUBECONFIG="$(pwd)/kubeconfig"
-kubectl cluster-info
-```
-
-Avoid pasting `https://` twice when running `kubectl config set-cluster ... --server=...`.
-
-### `kubectl apply` fails: `localhost:8080` / `openapi` connection refused
-
-`kubectl` is **not** using a real cluster config; it falls back to the default
-`http://localhost:8080` (no `KUBECONFIG`, empty file, or wrong file).
-
-Fix:
-
-```bash
-cd talos
-export KUBECONFIG="$(pwd)/kubeconfig"
-kubectl config current-context
-kubectl cluster-info
-```
-
-Then run `kubectl apply` from the **repository root** (parent of `talos/`) in
-the same shell. Do **not** use a literal `cd /path/to/...` — that was only a
-placeholder. Example (adjust to where you cloned this repo):
-
-```bash
-export KUBECONFIG="${HOME}/Developer/home-server/talos/kubeconfig"
-```
-
-`kubectl config set-cluster noble ...` only updates the file **`kubectl` is
-actually reading** (often `~/.kube/config`). It does nothing if `KUBECONFIG`
-points at another path.
-
-## 6) GitOps-pinned Cilium values
-
-The Cilium settings that worked for this Talos cluster are now persisted in:
-
- `clusters/noble/apps/cilium/helm-values.yaml`
- `clusters/noble/apps/cilium/application.yaml` (Helm chart + `valueFiles` from this repo)
-
-That Argo CD `Application` pins chart `1.16.6` and uses the same values file
-for API host/port, cgroup settings, IPAM CIDR, and security capabilities.
-
-### Cilium before Argo CD (`cni: none`)
-
-This cluster uses **`cniConfig.name: none`** in `talconfig.yaml` so Talos does
-not install a CNI. **Argo CD pods cannot schedule** until some CNI makes nodes
-`Ready` (otherwise the `node.kubernetes.io/not-ready` taint blocks scheduling).
-
-Install Cilium **once** with Helm from your workstation (same chart and values
-Argo will manage later), **then** bootstrap Argo CD:
-
-```bash
-helm repo add cilium https://helm.cilium.io/
-helm repo update
-helm upgrade --install cilium cilium/cilium \
-  --namespace kube-system \
-  --version 1.16.6 \
-  -f clusters/noble/apps/cilium/helm-values.yaml \
-  --wait --timeout 10m
 kubectl get nodes
-kubectl wait --for=condition=Ready nodes --all --timeout=300s
 ```

-If **`helm --install` seems stuck** after “Installing it now”, it is usually still
-pulling images (`quay.io/cilium/...`) or waiting for pods to become Ready. In
-another terminal run `kubectl get pods -n kube-system -w` and check for
-`ImagePullBackOff`, `Pending`, or `CrashLoopBackOff`. To avoid blocking on
-Helm’s wait logic, install without `--wait`, confirm Cilium pods, then continue:
+Adjust `-n` / `-e` if your bootstrap node or VIP differ.
+
+**Reachability (same idea for Talos and Kubernetes):**
+
+| Command | What it connects to |
+|---------|---------------------|
+| `talosctl … -e <addr>` | Talos **apid** on `<addr>:50000` (not 6443) |
+| `kubectl` / Helm | Kubernetes API on `https://<addr>:6443` from kubeconfig |
+
+If your Mac shows **`network is unreachable`** to `192.168.50.230`, fix **L2/L3** first (same **LAN** as the nodes, **VPN**, or routing). **`talosctl kubeconfig -e 192.168.50.20`** only chooses **which Talos node** fetches the admin cert; the **`server:`** URL inside kubeconfig still comes from **`cluster.controlPlane.endpoint`** in Talos config (here **`https://192.168.50.230:6443`**). So `kubectl` can still dial the **VIP** even when `-e` used a node IP.
+
+After a successful `talosctl kubeconfig`, **point kubectl at a reachable control-plane IP** (same as bootstrap node until kube-vip works from your network):

 ```bash
-helm upgrade --install cilium cilium/cilium \
-  --namespace kube-system \
-  --version 1.16.6 \
-  -f clusters/noble/apps/cilium/helm-values.yaml
-kubectl get pods -n kube-system -l app.kubernetes.io/part-of=cilium -w
+export TALOSCONFIG="${TALOSCONFIG:-$(pwd)/out/talosconfig}"
+talosctl kubeconfig ./kubeconfig -n 192.168.50.20 -e 192.168.50.20 --merge=false
+export KUBECONFIG="$(pwd)/kubeconfig"
+# Kubeconfig still says https://192.168.50.230:6443 — override if VIP is unreachable from this machine:
+kubectl config set-cluster noble --server=https://192.168.50.20:6443
+kubectl get nodes
 ```

-`helm-values.yaml` sets **`operator.replicas: 1`** so the chart default (two
-operators with hard anti-affinity) cannot deadlock `helm --wait` when only one
-node can take the operator early in bootstrap.
-
-If **`helm upgrade` fails** with server-side apply conflicts and
-**`argocd-controller`**, Argo already synced Cilium and **owns those fields**
-on live objects. Clearing **`syncPolicy`** on the Application does **not**
-remove that ownership; Helm still conflicts until you **take over** the fields
-or only use Argo.
-
-**One-shot CLI fix** (Helm 3.13+): add **`--force-conflicts`** so SSA wins the
-disputed fields:
+One-liner alternative (macOS/BSD `sed -i ''`; on Linux use `sed -i`):

 ```bash
-helm upgrade --install cilium cilium/cilium \
-  --namespace kube-system \
-  --version 1.16.6 \
-  -f clusters/noble/apps/cilium/helm-values.yaml \
-  --force-conflicts
+sed -i '' 's|https://192.168.50.230:6443|https://192.168.50.20:6443|g' kubeconfig
 ```

-Typical conflicts: Secret **`hubble-server-certs`** (`.data` TLS) and
-Deployment **`cilium-operator`** (`.spec.replicas`,
-`.spec/strategy/rollingUpdate/maxUnavailable`). The **`cilium` Application**
-lists **`ignoreDifferences`** for those paths plus **`RespectIgnoreDifferences`**
-so later Argo syncs do not keep overwriting them. Apply the manifest after you
-change it: **`kubectl apply -f clusters/noble/apps/cilium/application.yaml`**.
+Quick check from your Mac: `nc -vz 192.168.50.20 50000` (Talos) and `nc -vz 192.168.50.20 6443` (Kubernetes).

-After bootstrap, prefer syncing Cilium **only through Argo** (from Git) instead
-of ad hoc Helm, unless you suspend the **`cilium`** Application first.
+**`dial tcp 192.168.50.230:6443` on nodes:** Host-network components (including **Cilium**) cannot use the in-cluster `kubernetes` Service; they otherwise follow **`cluster.controlPlane.endpoint`** (the VIP). Talos **KubePrism** on **`127.0.0.1:7445`** (default) load-balances to healthy apiservers. Ensure the CNI Helm values set **`k8sServiceHost: "127.0.0.1"`** and **`k8sServicePort: "7445"`** — see [`clusters/noble/apps/cilium/values.yaml`](../clusters/noble/apps/cilium/values.yaml). Also confirm **kube-vip**’s **`vip_interface`** matches the uplink (`talosctl -n <ip> get links` — e.g. **`ens18`** on these nodes). A bare **`curl -k https://192.168.50.230:6443/healthz`** often returns **`401 Unauthorized`** because no client cert was sent — that still means TLS to the VIP worked.

-Shell tip: a line like **`# comment`** must start with **`#`**; if the shell
-reports **`command not found: #`**, the character is not a real hash or the
-line was pasted wrong—run **`kubectl apply ...`** as its own command without a
-leading comment on the same paste block.
-
-If nodes were already `Ready`, you can skip straight to section 7.
-
-## 7) Argo CD app-of-apps bootstrap
-
-This repo includes an app-of-apps structure for cluster apps:
-
- Root app: `clusters/noble/root-application.yaml`
- Child apps index: `clusters/noble/apps/kustomization.yaml`
- Argo CD app: `clusters/noble/apps/argocd/application.yaml`
- Cilium app: `clusters/noble/apps/cilium/application.yaml`
-
-Bootstrap once from your workstation:
+**Verify the VIP with `kubectl` (copy as-is):** use a real kubeconfig path (not ` /path/to/…`). From the **repository root**:

 ```bash
-kubectl apply -k clusters/noble/bootstrap/argocd
-kubectl wait --for=condition=Established crd/appprojects.argoproj.io --timeout=120s
-kubectl apply -f clusters/noble/bootstrap/argocd/default-appproject.yaml
-kubectl apply -f clusters/noble/root-application.yaml
+export KUBECONFIG="${KUBECONFIG:-$(pwd)/talos/kubeconfig}"
+kubectl config set-cluster noble --server=https://192.168.50.230:6443
+kubectl get --raw /healthz
 ```

-If the first command errors on `AppProject` (“no matches for kind `AppProject`”), the CRDs were not ready yet; run the `kubectl wait` and `kubectl apply -f .../default-appproject.yaml` lines, then continue.
+Expect a single line: **`ok`**. If you see **`The connection to the server localhost:8080 was refused`**, `KUBECONFIG` was missing or wrong (e.g. typo **`.export`** instead of **`export`**, or a path that does not exist). Do not put **`#` comments** on the same line as `kubectl config set-cluster` when pasting — some shells copy the comment into the command.

-After this, Argo CD continuously reconciles all applications under
-`clusters/noble/apps/`.
+**`kubectl` → `localhost:8080` / connection refused:** `talosctl kubeconfig` did **not** write a valid kubeconfig (often because the step above failed). Fix Talos/API reachability first; do not trust `kubectl` until `talosctl kubeconfig` completes without error.

-## 8) kube-vip API VIP (`192.168.50.230`)
+## 4. Platform manifests (this repo)

-HAProxy has been removed in favor of `kube-vip` running on control-plane nodes.
+| Component | Apply |
+|-----------|--------|
+| Cilium | **Before** kube-vip/MetalLB scheduling: Helm from [`clusters/noble/apps/cilium/README.md`](../clusters/noble/apps/cilium/README.md) (`values.yaml`) |
+| kube-vip | `kubectl apply -k ../clusters/noble/apps/kube-vip` |
+| MetalLB pool | After MetalLB controller install: `kubectl apply -k ../clusters/noble/apps/metallb` |
+| Longhorn PSA + Helm | `kubectl apply -k ../clusters/noble/apps/longhorn` then Helm from §5 below |

-Manifests are in:
+Set `vip_interface` in `clusters/noble/apps/kube-vip/vip-daemonset.yaml` if it does not match the control-plane uplink (`talosctl -n <cp-ip> get links`).

- `clusters/noble/apps/kube-vip/application.yaml`
- `clusters/noble/apps/kube-vip/vip-rbac.yaml`
- `clusters/noble/apps/kube-vip/vip-daemonset.yaml`
+## 5. Longhorn (Talos)

-The DaemonSet advertises `192.168.50.230` in ARP mode and fronts the Kubernetes
-API on port `6443`.
-
-Apply manually (or let Argo CD sync from root app):
+1. **Machine image:** `talconfig.yaml` includes `iscsi-tools` and `util-linux-tools` extensions. After `talhelper genconfig`, **upgrade each node** so the running installer image matches (extensions are in the image, not applied live by config alone). If `longhorn-manager` logs **`iscsiadm` / `open-iscsi`**, the node image does not include the extension yet.
+2. **Pod Security + path:** Apply `kubectl apply -k ../clusters/noble/apps/longhorn` (privileged `longhorn-system`). The Helm chart host-mounts **`/var/lib/longhorn`**; `talconfig` adds a kubelet **bind** from `/var/mnt/longhorn` → `/var/lib/longhorn` so that path matches the dedicated XFS volume.
+3. **Data path:** From the **repository root** (not `talos/`), run Helm with a real release and chart name — not literal `...`:

 ```bash
-kubectl apply -k clusters/noble/apps/kube-vip
+helm repo add longhorn https://charts.longhorn.io && helm repo update
+helm upgrade --install longhorn longhorn/longhorn -n longhorn-system --create-namespace \
+  -f clusters/noble/apps/longhorn/values.yaml
 ```

-Validate:
+If Longhorn defaults to `/var/lib/longhorn`, you get **wrong format** / **no space** on the Talos root filesystem.
+4. **Disk device:** Second disk is often `/dev/vdb` under Proxmox virtio; `talconfig` selects `sdb` or `vdb`. Confirm with `talosctl get disks -n <ip>`.
+5. **`filesystem type mismatch: gpt != xfs` on `volumeType: disk`:** The data disk still has a **GPT** from an older partition attempt. Whole-disk XFS needs a **raw** disk. Talos cannot `wipe disk` while `u-longhorn` claims the device.

-```bash
-kubectl -n kube-system get pods -l app.kubernetes.io/name=kube-vip-ds -o wide
-nc -vz 192.168.50.230 6443
-```
+   **Repo layout:** `talconfig.yaml` = **wipe-phase** (no Longhorn volume / no kubelet bind). `talconfig.with-longhorn.yaml` = restore after wipes.

-If **`kube-vip-ds` pods are `CrashLoopBackOff`**, logs usually show
-`could not get link for interface '…'`. kube-vip binds the VIP to
-**`vip_interface`**; on Talos the uplink is often **`eno1`**, **`enp0s…`**, or
-**`enx…`**, not **`eth0`**. On a control-plane node IP from `talconfig.yaml`:
+   **Order matters.** `blockdevice "sdb" is in use by volume "u-longhorn"` means you tried to **wipe before** the running nodes received the wipe-phase machine config. You must **`talosctl apply-config`** (wipe YAML) on **every** node first, **reboot** if `u-longhorn` still appears, **then** `talosctl wipe disk`.

-```bash
-talosctl -n 192.168.50.20 get links
-```
+   **Automated (recommended):** from `talos/` after `talhelper genconfig -o out`:

-Do **not** paste that command’s **table output** back into the shell: zsh runs
-each line as a command (e.g. `192.168.50.20` → `command not found`), and a line
-starting with **`NODE`** can be mistaken for the **`node`** binary and try to
-load a file like **`NAMESPACE`** in the current directory. Also avoid pasting
-the **prompt** (`(base) … %`) together with the command (duplicate prompt →
-parse errors).
+   ```bash
+   cd talos && talhelper genconfig -o out && export TALOSCONFIG="$(pwd)/out/talosconfig"
+   ./scripts/longhorn-gpt-recovery.sh phase1   # apply wipe config to all 4 nodes; reboot cluster if needed
+   ./scripts/longhorn-gpt-recovery.sh phase2   # wipe disk, restore Longhorn talconfig, genconfig, apply all nodes
+   ```

-Set **`vip_interface`** in `clusters/noble/apps/kube-vip/vip-daemonset.yaml` to
-that link’s **`metadata.id`**, commit, sync (or `kubectl apply -k
-clusters/noble/apps/kube-vip`), and confirm pods go **`Running`**.
+   Use `DISK=vdb ./scripts/longhorn-gpt-recovery.sh phase2` if the second disk is `vdb`.

-## 9) Argo CD via DNS host (no port)
+   **Manual:** same sequence, but do not paste comment lines into zsh as commands (`#` lines can error if copy-paste breaks).

-Argo CD is exposed through a kube-vip managed LoadBalancer Service:
+6. **“Error fetching pod status”** in the Longhorn UI is often API connectivity (VIP/DNS), `longhorn-manager` / CSI pods not ready, or RBAC. Check `kubectl get pods -n longhorn-system` and `kubectl logs -n longhorn-system -l app=longhorn-manager --tail=50` from a working kubeconfig.

- `argo.noble.lab.pcenicni.dev`
+## Troubleshooting

-Manifests:
+### `user=apiserver-kubelet-client` / `verb=get` / `resource=nodes` (authorization error)

- `clusters/noble/bootstrap/argocd/argocd-server-lb.yaml`
- `clusters/noble/apps/kube-vip/vip-daemonset.yaml` (`svc_enable: "true"`)
+That identity is the **client cert the kube-apiserver uses when talking to kubelets** (logs, exec, node metrics, etc.). Audit logs often show it when the apiserver checks **Node** access before proxying. It is **not** your human `kubectl` user.

-After syncing manifests, create a Pi-hole DNS A record:
+- If **`kubectl get nodes`** and normal workloads work, treat log noise as **informational** unless something user-facing breaks (`kubectl logs`, `kubectl exec`, **metrics-server** node metrics, **HorizontalPodAutoscaler**).
+- If **logs/exec/metrics** fail cluster-wide, check default RBAC still exists (nothing should delete `system:*` ClusterRoles):

- `argo.noble.lab.pcenicni.dev` -> `192.168.50.231`
+  ```bash
+  kubectl get clusterrole system:kubelet-api-admin system:node-proxier 2>&1
+  ```

-## 10) Longhorn storage and extra disks
+- If you **customized** `authorization-config` / RBAC on the API server, revert or align with [kubelet authentication/authorization](https://kubernetes.io/docs/reference/access-authn-authz/kubelet-authn-authz/) expectations.

-Longhorn is deployed from:
-
- `clusters/noble/apps/longhorn/application.yaml`
-
-Monitoring apps are configured to use `storageClassName: longhorn`, so you can
-persist Prometheus/Alertmanager/Loki data once Longhorn is healthy.
-
-### Argo CD: `longhorn` OutOfSync, Health **Missing**, no `longhorn-role`
-
-**Missing** means nothing has been applied yet, or a sync never completed. The
-Helm chart creates `ClusterRole/longhorn-role` on a successful install.
-
-1. See the failure reason:
-
-```bash
-kubectl describe application longhorn -n argocd
-```
-
-Check **Status → Conditions** and **Status → Operation State** for the error
-(for example Helm render error, CRD apply failure, or repo-server cannot reach
-`https://charts.longhorn.io`).
-
-2. Trigger a sync (Argo CD UI **Sync**, or CLI):
-
-```bash
-argocd app sync longhorn
-```
-
-3. After a good sync, confirm:
-
-```bash
-kubectl get clusterrole longhorn-role
-kubectl get pods -n longhorn-system
-```
-
-### Extra drive layout (this cluster)
-
-Each node uses:
-
- `/dev/sda` — Talos install disk (`installDisk` in `talconfig.yaml`)
- `/dev/sdb` — dedicated Longhorn data disk
-
-`talconfig.yaml` includes a global patch that partitions `/dev/sdb` and mounts it
-at `/var/mnt/longhorn`, which matches Longhorn `defaultDataPath` in the Argo
-Helm values.
-
-After editing `talconfig.yaml`, regenerate and apply configs:
-
-```bash
-cd talos
-talhelper genconfig
-# apply each node’s YAML from clusterconfig/ with talosctl apply-config
-```
-
-Then reboot each node once so the new disk layout is applied.
-
-### `talosctl` TLS errors (`unknown authority`, `Ed25519 verification failure`)
-
-`talosctl` **does not** automatically use `talos/clusterconfig/talosconfig`. If you
-omit it, the client falls back to **`~/.talos/config`**, which is usually a
-**different** cluster CA — you then get TLS handshake failures against the noble
-nodes.
-
-**Always** set this in the shell where you run `talosctl` (use an absolute path
-if you change directories):
-
-```bash
-cd talos
-export TALOSCONFIG="$(pwd)/clusterconfig/talosconfig"
-export ENDPOINT=192.168.50.230
-```
-
-Sanity check (should print Talos and Kubernetes versions, not TLS errors):
-
-```bash
-talosctl -e "${ENDPOINT}" -n 192.168.50.20 version
-```
-
-Then use the same shell for `apply-config`, `reboot`, and `health`.
-
-If it **still** fails after `TALOSCONFIG` is set, the running cluster was likely
-bootstrapped with **different** secrets than the ones in your current
-`talsecret.sops.yaml` / regenerated `clusterconfig/`. In that case you need the
-**original** `talosconfig` that matched the cluster when it was created, or you
-must align secrets and cluster state (recovery / rebuild is a larger topic).
-
-Keep **`talosctl`** roughly aligned with the node Talos version (for example
-`v1.12.x` clients for `v1.12.5` nodes).
-
-**Paste tip:** run **one** command per line. Pasting `...cp-3.yaml` and
-`talosctl` on the same line breaks the filename and can confuse the shell.
-
-### More than one extra disk per node
-
-If you add a third disk later, extend `machine.disks` in `talconfig.yaml` (for
-example `/dev/sdc` → `/var/mnt/longhorn-disk2`) and register that path in
-Longhorn as an additional disk for that node.
-
-Recommended:
-
- use one dedicated filesystem per Longhorn disk path
- avoid using the Talos system disk for heavy Longhorn data
- spread replicas across nodes for resiliency
-
-## 11) Upgrade Talos to `v1.12.x`
-
-This repo now pins:
-
- `talosVersion: v1.12.5` in `talconfig.yaml`
-
-### Regenerate configs
-
-From `talos/`:
-
-```bash
-talhelper genconfig
-```
-
-### Rolling upgrade order
-
-Upgrade one node at a time, waiting for it to return healthy before moving on.
-
-1. Control plane nodes (`noble-cp-1`, then `noble-cp-2`, then `noble-cp-3`)
-2. Worker node (`noble-worker-1`)
-
-Example commands (adjust node IP per step):
-
-```bash
-talosctl --talosconfig ./clusterconfig/talosconfig -n 192.168.50.20 upgrade --image ghcr.io/siderolabs/installer:v1.12.5
-talosctl --talosconfig ./clusterconfig/talosconfig -n 192.168.50.20 reboot
-talosctl --talosconfig ./clusterconfig/talosconfig -n 192.168.50.20 health
-```
-
-After all nodes are upgraded, verify:
-
-```bash
-talosctl --talosconfig ./clusterconfig/talosconfig version
-kubectl get nodes -o wide
-```
-
-## 12) Destroy the cluster and rebuild from scratch
-
-Use this when Kubernetes / etcd / Argo / Longhorn state is corrupted and you want a
-**clean** cluster. This **wipes cluster state on the nodes** (etcd, workloads,
-Longhorn data on cluster disks). Plan for **downtime** and **backup** anything
-you must keep off-cluster first.
-
-### 12.1 Reset every Talos node (Kubernetes is destroyed)
-
-From `talos/` with a working **`talosconfig`** that matches the machines (same
-`TALOSCONFIG` / `ENDPOINT` guidance as elsewhere in this README):
-
-```bash
-cd talos
-export TALOSCONFIG="$(pwd)/clusterconfig/talosconfig"
-export ENDPOINT=192.168.50.230
-```
-
-Reset **one node at a time**, waiting for each to reboot before the next. Order:
-**worker first**, then **non-bootstrap control planes**, then the **bootstrap**
-control plane **last** (`noble-cp-1` → `192.168.50.20`).
-
-```bash
-talosctl -e "${ENDPOINT}" -n 192.168.50.10 reset --graceful=false
-talosctl -e "${ENDPOINT}" -n 192.168.50.30 reset --graceful=false
-talosctl -e "${ENDPOINT}" -n 192.168.50.40 reset --graceful=false
-talosctl -e "${ENDPOINT}" -n 192.168.50.20 reset --graceful=false
-```
-
-If the API VIP is already unreachable, target the **node IP** as endpoint for that
-node, for example:
-`talosctl -e 192.168.50.10 -n 192.168.50.10 reset --graceful=false`.
-
-Your workstation **`kubeconfig`** will not work for the old cluster after this;
-that is expected until you bootstrap again.
-
-### 12.2 (Optional) New cluster secrets
-
-For a fully fresh identity (new cluster CA and `talosconfig`):
-
-```bash
-cd talos
-talhelper gensecret > talsecret.sops.yaml
-# encrypt / store talsecret as you usually do, then:
-talhelper genconfig
-```
-
-If you **keep** the existing `talsecret.sops.yaml`, still run **`talhelper genconfig`**
-so `clusterconfig/` matches what you will apply.
-
-### 12.3 Apply configs, bootstrap, kubeconfig
-
-Repeat **§3 Apply Talos configs** and **§4 Bootstrap the cluster** (and **§5
-Validate**) from the top of this README: `apply-config` each node, then
-`talosctl bootstrap`, then `talosctl kubeconfig` into `talos/kubeconfig`.
-
-### 12.4 Redeploy GitOps (Argo CD + apps)
-
-From your workstation (repo root), with `KUBECONFIG` pointing at the new
-`talos/kubeconfig`:
-
-```bash
-# Set REPO to the directory that contains both talos/ and clusters/ (not a literal "path/to")
-REPO="${HOME}/Developer/home-server"
-export KUBECONFIG="${REPO}/talos/kubeconfig"
-cd "${REPO}"
-kubectl apply -k clusters/noble/bootstrap/argocd
-kubectl apply -f clusters/noble/root-application.yaml
-```
-
-Resolve **Argo CD admin** login (secret / password reset) as needed; then let
-`noble-root` sync `clusters/noble/apps/`.
-
-## 13) Mid-rebuild issues: etcd, bootstrap, and `apply-config`
-
-### `tls: certificate required` when using `apply-config --insecure`
-
-After a node has **joined** the cluster, the Talos API expects **client
-certificates** from your `talosconfig`. `--insecure` only applies to **maintenance**
-(before join / after a reset).
-
-**Do one of:**
-
- Apply config **with** `talosconfig` (no `--insecure`):
-
-```bash
-cd talos
-export TALOSCONFIG="$(pwd)/clusterconfig/talosconfig"
-export ENDPOINT=192.168.50.230
-talosctl -e "${ENDPOINT}" apply-config -n 192.168.50.30 -f clusterconfig/noble-noble-cp-2.yaml
-```
-
- Or **`talosctl reset`** that node first (see §12.1), then use
-  `apply-config --insecure` again while it is in maintenance.
-
-### `bootstrap`: `etcd data directory is not empty`
-
-The bootstrap node (`192.168.50.20`) already has a **previous etcd** on disk (failed
-or partial bootstrap). Kubernetes will not bootstrap again until that state is
-**wiped**.
-
-**Fix:** run **`talosctl reset --graceful=false`** on the **control plane nodes**
-(at minimum the bootstrap node; often **all four nodes** is cleaner). See §12.1.
-Then re-apply machine configs and run **`talosctl bootstrap` exactly once**.
-
-### etcd unhealthy / “Preparing” on some control planes
-
-Usually means **split or partial** cluster state. The reliable fix is the same
-**full reset** (§12.1), then a single ordered bring-up: apply all configs →
-bootstrap once → `talosctl health`.
+## Kubeconfig from running nodes

+The repo root `kubeconfig` may be incomplete until you merge credentials; prefer generating `talos/kubeconfig` with the commands in §3 above.