Enhance monitoring configurations by enabling persistence for Loki and updating storage settings for Prometheus and Alertmanager to use Longhorn. Add Longhorn application to kustomization.yaml for improved storage management.

2026-03-27 16:27:58 -04:00
parent 036f8ef37e
commit 8cacf5f5de
7 changed files with 299 additions and 6 deletions
--- a/talos/README.md
+++ b/talos/README.md
@@ -125,3 +125,117 @@ After syncing manifests, create a Pi-hole DNS A record:

 - `argo.noble.lab.pcenicni.dev` -> `192.168.50.231`

+## 10) Longhorn storage and extra disks
+
+Longhorn is deployed from:
+
+- `clusters/noble/apps/longhorn/application.yaml`
+
+Monitoring apps are configured to use `storageClassName: longhorn`, so you can
+persist Prometheus/Alertmanager/Loki data once Longhorn is healthy.
+
+### Extra drive layout (this cluster)
+
+Each node uses:
+
+- `/dev/sda` — Talos install disk (`installDisk` in `talconfig.yaml`)
+- `/dev/sdb` — dedicated Longhorn data disk
+
+`talconfig.yaml` includes a global patch that partitions `/dev/sdb` and mounts it
+at `/var/mnt/longhorn`, which matches Longhorn `defaultDataPath` in the Argo
+Helm values.
+
+After editing `talconfig.yaml`, regenerate and apply configs:
+
+```bash
+cd talos
+talhelper genconfig
+# apply each node’s YAML from clusterconfig/ with talosctl apply-config
+```
+
+Then reboot each node once so the new disk layout is applied.
+
+### `talosctl` TLS errors (`unknown authority`, `Ed25519 verification failure`)
+
+`talosctl` **does not** automatically use `talos/clusterconfig/talosconfig`. If you
+omit it, the client falls back to **`~/.talos/config`**, which is usually a
+**different** cluster CA — you then get TLS handshake failures against the noble
+nodes.
+
+**Always** set this in the shell where you run `talosctl` (use an absolute path
+if you change directories):
+
+```bash
+cd talos
+export TALOSCONFIG="$(pwd)/clusterconfig/talosconfig"
+export ENDPOINT=192.168.50.230
+```
+
+Sanity check (should print Talos and Kubernetes versions, not TLS errors):
+
+```bash
+talosctl -e "${ENDPOINT}" -n 192.168.50.20 version
+```
+
+Then use the same shell for `apply-config`, `reboot`, and `health`.
+
+If it **still** fails after `TALOSCONFIG` is set, the running cluster was likely
+bootstrapped with **different** secrets than the ones in your current
+`talsecret.sops.yaml` / regenerated `clusterconfig/`. In that case you need the
+**original** `talosconfig` that matched the cluster when it was created, or you
+must align secrets and cluster state (recovery / rebuild is a larger topic).
+
+Keep **`talosctl`** roughly aligned with the node Talos version (for example
+`v1.12.x` clients for `v1.12.5` nodes).
+
+**Paste tip:** run **one** command per line. Pasting `...cp-3.yaml` and
+`talosctl` on the same line breaks the filename and can confuse the shell.
+
+### More than one extra disk per node
+
+If you add a third disk later, extend `machine.disks` in `talconfig.yaml` (for
+example `/dev/sdc` → `/var/mnt/longhorn-disk2`) and register that path in
+Longhorn as an additional disk for that node.
+
+Recommended:
+
+- use one dedicated filesystem per Longhorn disk path
+- avoid using the Talos system disk for heavy Longhorn data
+- spread replicas across nodes for resiliency
+
+## 11) Upgrade Talos to `v1.12.x`
+
+This repo now pins:
+
+- `talosVersion: v1.12.5` in `talconfig.yaml`
+
+### Regenerate configs
+
+From `talos/`:
+
+```bash
+talhelper genconfig
+```
+
+### Rolling upgrade order
+
+Upgrade one node at a time, waiting for it to return healthy before moving on.
+
+1. Control plane nodes (`noble-cp-1`, then `noble-cp-2`, then `noble-cp-3`)
+2. Worker node (`noble-worker-1`)
+
+Example commands (adjust node IP per step):
+
+```bash
+talosctl --talosconfig ./clusterconfig/talosconfig -n 192.168.50.20 upgrade --image ghcr.io/siderolabs/installer:v1.12.5
+talosctl --talosconfig ./clusterconfig/talosconfig -n 192.168.50.20 reboot
+talosctl --talosconfig ./clusterconfig/talosconfig -n 192.168.50.20 health
+```
+
+After all nodes are upgraded, verify:
+
+```bash
+talosctl --talosconfig ./clusterconfig/talosconfig version
+kubectl get nodes -o wide
+```
+