# Talos deployment (4 nodes) This directory contains a `talhelper` cluster definition for a 4-node Talos cluster: - 3 hybrid control-plane/worker nodes: `noble-cp-1..3` - 1 worker-only node: `noble-worker-1` - `allowSchedulingOnControlPlanes: true` - CNI: `none` (for Cilium via GitOps) ## 1) Update values for your environment Edit `talconfig.yaml`: - `endpoint` (Kubernetes API VIP or LB IP) - each node `ipAddress` - each node `installDisk` (for example `/dev/sda`, `/dev/nvme0n1`) - `talosVersion` / `kubernetesVersion` if desired ## 2) Generate cluster secrets and machine configs From this directory: ```bash talhelper gensecret > talsecret.sops.yaml talhelper genconfig ``` Generated machine configs are written to `clusterconfig/`. ## 3) Apply Talos configs Apply each node file to the matching node IP from `talconfig.yaml`: ```bash talosctl apply-config --insecure -n 192.168.50.20 -f clusterconfig/noble-noble-cp-1.yaml talosctl apply-config --insecure -n 192.168.50.30 -f clusterconfig/noble-noble-cp-2.yaml talosctl apply-config --insecure -n 192.168.50.40 -f clusterconfig/noble-noble-cp-3.yaml talosctl apply-config --insecure -n 192.168.50.10 -f clusterconfig/noble-noble-worker-1.yaml ``` ## 4) Bootstrap the cluster After all nodes are up (bootstrap once, from any control-plane node): ```bash talosctl bootstrap -n 192.168.50.20 -e 192.168.50.230 talosctl kubeconfig -n 192.168.50.20 -e 192.168.50.230 . ``` ## 5) Validate ```bash talosctl -n 192.168.50.20 -e 192.168.50.230 health kubectl get nodes -o wide ``` ### `kubectl` errors: `lookup https: no such host` or `https://https/...` That means the **active** kubeconfig has a broken `cluster.server` URL (often a **double** `https://` or **duplicate** `:6443`). Kubernetes then tries to resolve the hostname `https`, which fails. Inspect what you are using: ```bash kubectl config view --minify -o jsonpath='{.clusters[0].cluster.server}{"\n"}' ``` It must be a **single** valid URL, for example: - `https://192.168.50.230:6443` (API VIP from `talconfig.yaml`), or - `https://kube.noble.lab.pcenicni.dev:6443` (if DNS points at that VIP) Fix the cluster entry (replace `noble` with your context’s cluster name if different): ```bash kubectl config set-cluster noble --server=https://192.168.50.230:6443 ``` Or point `kubectl` at this repo’s kubeconfig (known-good server line): ```bash export KUBECONFIG="$(pwd)/kubeconfig" kubectl cluster-info ``` Avoid pasting `https://` twice when running `kubectl config set-cluster ... --server=...`. ## 6) GitOps-pinned Cilium values The Cilium settings that worked for this Talos cluster are now persisted in: - `clusters/noble/apps/cilium/application.yaml` That Argo CD `Application` pins chart `1.16.6` and includes the required Helm values for this environment (API host/port, cgroup settings, IPAM CIDR, and security capabilities), so future reconciles do not drift back to defaults. ## 7) Argo CD app-of-apps bootstrap This repo includes an app-of-apps structure for cluster apps: - Root app: `clusters/noble/root-application.yaml` - Child apps index: `clusters/noble/apps/kustomization.yaml` - Argo CD app: `clusters/noble/apps/argocd/application.yaml` - Cilium app: `clusters/noble/apps/cilium/application.yaml` Bootstrap once from your workstation: ```bash kubectl apply -k clusters/noble/bootstrap/argocd kubectl apply -f clusters/noble/root-application.yaml ``` After this, Argo CD continuously reconciles all applications under `clusters/noble/apps/`. ## 8) kube-vip API VIP (`192.168.50.230`) HAProxy has been removed in favor of `kube-vip` running on control-plane nodes. Manifests are in: - `clusters/noble/apps/kube-vip/application.yaml` - `clusters/noble/apps/kube-vip/vip-rbac.yaml` - `clusters/noble/apps/kube-vip/vip-daemonset.yaml` The DaemonSet advertises `192.168.50.230` in ARP mode and fronts the Kubernetes API on port `6443`. Apply manually (or let Argo CD sync from root app): ```bash kubectl apply -k clusters/noble/apps/kube-vip ``` Validate: ```bash kubectl -n kube-system get pods -l app.kubernetes.io/name=kube-vip-ds -o wide nc -vz 192.168.50.230 6443 ``` ## 9) Argo CD via DNS host (no port) Argo CD is exposed through a kube-vip managed LoadBalancer Service: - `argo.noble.lab.pcenicni.dev` Manifests: - `clusters/noble/bootstrap/argocd/argocd-server-lb.yaml` - `clusters/noble/apps/kube-vip/vip-daemonset.yaml` (`svc_enable: "true"`) After syncing manifests, create a Pi-hole DNS A record: - `argo.noble.lab.pcenicni.dev` -> `192.168.50.231` ## 10) Longhorn storage and extra disks Longhorn is deployed from: - `clusters/noble/apps/longhorn/application.yaml` Monitoring apps are configured to use `storageClassName: longhorn`, so you can persist Prometheus/Alertmanager/Loki data once Longhorn is healthy. ### Argo CD: `longhorn` OutOfSync, Health **Missing**, no `longhorn-role` **Missing** means nothing has been applied yet, or a sync never completed. The Helm chart creates `ClusterRole/longhorn-role` on a successful install. 1. See the failure reason: ```bash kubectl describe application longhorn -n argocd ``` Check **Status → Conditions** and **Status → Operation State** for the error (for example Helm render error, CRD apply failure, or repo-server cannot reach `https://charts.longhorn.io`). 2. Trigger a sync (Argo CD UI **Sync**, or CLI): ```bash argocd app sync longhorn ``` 3. After a good sync, confirm: ```bash kubectl get clusterrole longhorn-role kubectl get pods -n longhorn-system ``` ### Extra drive layout (this cluster) Each node uses: - `/dev/sda` — Talos install disk (`installDisk` in `talconfig.yaml`) - `/dev/sdb` — dedicated Longhorn data disk `talconfig.yaml` includes a global patch that partitions `/dev/sdb` and mounts it at `/var/mnt/longhorn`, which matches Longhorn `defaultDataPath` in the Argo Helm values. After editing `talconfig.yaml`, regenerate and apply configs: ```bash cd talos talhelper genconfig # apply each node’s YAML from clusterconfig/ with talosctl apply-config ``` Then reboot each node once so the new disk layout is applied. ### `talosctl` TLS errors (`unknown authority`, `Ed25519 verification failure`) `talosctl` **does not** automatically use `talos/clusterconfig/talosconfig`. If you omit it, the client falls back to **`~/.talos/config`**, which is usually a **different** cluster CA — you then get TLS handshake failures against the noble nodes. **Always** set this in the shell where you run `talosctl` (use an absolute path if you change directories): ```bash cd talos export TALOSCONFIG="$(pwd)/clusterconfig/talosconfig" export ENDPOINT=192.168.50.230 ``` Sanity check (should print Talos and Kubernetes versions, not TLS errors): ```bash talosctl -e "${ENDPOINT}" -n 192.168.50.20 version ``` Then use the same shell for `apply-config`, `reboot`, and `health`. If it **still** fails after `TALOSCONFIG` is set, the running cluster was likely bootstrapped with **different** secrets than the ones in your current `talsecret.sops.yaml` / regenerated `clusterconfig/`. In that case you need the **original** `talosconfig` that matched the cluster when it was created, or you must align secrets and cluster state (recovery / rebuild is a larger topic). Keep **`talosctl`** roughly aligned with the node Talos version (for example `v1.12.x` clients for `v1.12.5` nodes). **Paste tip:** run **one** command per line. Pasting `...cp-3.yaml` and `talosctl` on the same line breaks the filename and can confuse the shell. ### More than one extra disk per node If you add a third disk later, extend `machine.disks` in `talconfig.yaml` (for example `/dev/sdc` → `/var/mnt/longhorn-disk2`) and register that path in Longhorn as an additional disk for that node. Recommended: - use one dedicated filesystem per Longhorn disk path - avoid using the Talos system disk for heavy Longhorn data - spread replicas across nodes for resiliency ## 11) Upgrade Talos to `v1.12.x` This repo now pins: - `talosVersion: v1.12.5` in `talconfig.yaml` ### Regenerate configs From `talos/`: ```bash talhelper genconfig ``` ### Rolling upgrade order Upgrade one node at a time, waiting for it to return healthy before moving on. 1. Control plane nodes (`noble-cp-1`, then `noble-cp-2`, then `noble-cp-3`) 2. Worker node (`noble-worker-1`) Example commands (adjust node IP per step): ```bash talosctl --talosconfig ./clusterconfig/talosconfig -n 192.168.50.20 upgrade --image ghcr.io/siderolabs/installer:v1.12.5 talosctl --talosconfig ./clusterconfig/talosconfig -n 192.168.50.20 reboot talosctl --talosconfig ./clusterconfig/talosconfig -n 192.168.50.20 health ``` After all nodes are upgraded, verify: ```bash talosctl --talosconfig ./clusterconfig/talosconfig version kubectl get nodes -o wide ```