Files
home-server/talos/README.md

242 lines
6.9 KiB
Markdown
Raw Blame History

This file contains ambiguous Unicode characters
This file contains Unicode characters that might be confused with other characters. If you think that this is intentional, you can safely ignore this warning. Use the Escape button to reveal them.
# Talos deployment (4 nodes)
This directory contains a `talhelper` cluster definition for a 4-node Talos
cluster:
- 3 hybrid control-plane/worker nodes: `noble-cp-1..3`
- 1 worker-only node: `noble-worker-1`
- `allowSchedulingOnControlPlanes: true`
- CNI: `none` (for Cilium via GitOps)
## 1) Update values for your environment
Edit `talconfig.yaml`:
- `endpoint` (Kubernetes API VIP or LB IP)
- each node `ipAddress`
- each node `installDisk` (for example `/dev/sda`, `/dev/nvme0n1`)
- `talosVersion` / `kubernetesVersion` if desired
## 2) Generate cluster secrets and machine configs
From this directory:
```bash
talhelper gensecret > talsecret.sops.yaml
talhelper genconfig
```
Generated machine configs are written to `clusterconfig/`.
## 3) Apply Talos configs
Apply each node file to the matching node IP from `talconfig.yaml`:
```bash
talosctl apply-config --insecure -n 192.168.50.20 -f clusterconfig/noble-noble-cp-1.yaml
talosctl apply-config --insecure -n 192.168.50.30 -f clusterconfig/noble-noble-cp-2.yaml
talosctl apply-config --insecure -n 192.168.50.40 -f clusterconfig/noble-noble-cp-3.yaml
talosctl apply-config --insecure -n 192.168.50.10 -f clusterconfig/noble-noble-worker-1.yaml
```
## 4) Bootstrap the cluster
After all nodes are up (bootstrap once, from any control-plane node):
```bash
talosctl bootstrap -n 192.168.50.20 -e 192.168.50.230
talosctl kubeconfig -n 192.168.50.20 -e 192.168.50.230 .
```
## 5) Validate
```bash
talosctl -n 192.168.50.20 -e 192.168.50.230 health
kubectl get nodes -o wide
```
## 6) GitOps-pinned Cilium values
The Cilium settings that worked for this Talos cluster are now persisted in:
- `clusters/noble/apps/cilium/application.yaml`
That Argo CD `Application` pins chart `1.16.6` and includes the required Helm
values for this environment (API host/port, cgroup settings, IPAM CIDR, and
security capabilities), so future reconciles do not drift back to defaults.
## 7) Argo CD app-of-apps bootstrap
This repo includes an app-of-apps structure for cluster apps:
- Root app: `clusters/noble/root-application.yaml`
- Child apps index: `clusters/noble/apps/kustomization.yaml`
- Argo CD app: `clusters/noble/apps/argocd/application.yaml`
- Cilium app: `clusters/noble/apps/cilium/application.yaml`
Bootstrap once from your workstation:
```bash
kubectl apply -k clusters/noble/bootstrap/argocd
kubectl apply -f clusters/noble/root-application.yaml
```
After this, Argo CD continuously reconciles all applications under
`clusters/noble/apps/`.
## 8) kube-vip API VIP (`192.168.50.230`)
HAProxy has been removed in favor of `kube-vip` running on control-plane nodes.
Manifests are in:
- `clusters/noble/apps/kube-vip/application.yaml`
- `clusters/noble/apps/kube-vip/vip-rbac.yaml`
- `clusters/noble/apps/kube-vip/vip-daemonset.yaml`
The DaemonSet advertises `192.168.50.230` in ARP mode and fronts the Kubernetes
API on port `6443`.
Apply manually (or let Argo CD sync from root app):
```bash
kubectl apply -k clusters/noble/apps/kube-vip
```
Validate:
```bash
kubectl -n kube-system get pods -l app.kubernetes.io/name=kube-vip-ds -o wide
nc -vz 192.168.50.230 6443
```
## 9) Argo CD via DNS host (no port)
Argo CD is exposed through a kube-vip managed LoadBalancer Service:
- `argo.noble.lab.pcenicni.dev`
Manifests:
- `clusters/noble/bootstrap/argocd/argocd-server-lb.yaml`
- `clusters/noble/apps/kube-vip/vip-daemonset.yaml` (`svc_enable: "true"`)
After syncing manifests, create a Pi-hole DNS A record:
- `argo.noble.lab.pcenicni.dev` -> `192.168.50.231`
## 10) Longhorn storage and extra disks
Longhorn is deployed from:
- `clusters/noble/apps/longhorn/application.yaml`
Monitoring apps are configured to use `storageClassName: longhorn`, so you can
persist Prometheus/Alertmanager/Loki data once Longhorn is healthy.
### Extra drive layout (this cluster)
Each node uses:
- `/dev/sda` — Talos install disk (`installDisk` in `talconfig.yaml`)
- `/dev/sdb` — dedicated Longhorn data disk
`talconfig.yaml` includes a global patch that partitions `/dev/sdb` and mounts it
at `/var/mnt/longhorn`, which matches Longhorn `defaultDataPath` in the Argo
Helm values.
After editing `talconfig.yaml`, regenerate and apply configs:
```bash
cd talos
talhelper genconfig
# apply each nodes YAML from clusterconfig/ with talosctl apply-config
```
Then reboot each node once so the new disk layout is applied.
### `talosctl` TLS errors (`unknown authority`, `Ed25519 verification failure`)
`talosctl` **does not** automatically use `talos/clusterconfig/talosconfig`. If you
omit it, the client falls back to **`~/.talos/config`**, which is usually a
**different** cluster CA — you then get TLS handshake failures against the noble
nodes.
**Always** set this in the shell where you run `talosctl` (use an absolute path
if you change directories):
```bash
cd talos
export TALOSCONFIG="$(pwd)/clusterconfig/talosconfig"
export ENDPOINT=192.168.50.230
```
Sanity check (should print Talos and Kubernetes versions, not TLS errors):
```bash
talosctl -e "${ENDPOINT}" -n 192.168.50.20 version
```
Then use the same shell for `apply-config`, `reboot`, and `health`.
If it **still** fails after `TALOSCONFIG` is set, the running cluster was likely
bootstrapped with **different** secrets than the ones in your current
`talsecret.sops.yaml` / regenerated `clusterconfig/`. In that case you need the
**original** `talosconfig` that matched the cluster when it was created, or you
must align secrets and cluster state (recovery / rebuild is a larger topic).
Keep **`talosctl`** roughly aligned with the node Talos version (for example
`v1.12.x` clients for `v1.12.5` nodes).
**Paste tip:** run **one** command per line. Pasting `...cp-3.yaml` and
`talosctl` on the same line breaks the filename and can confuse the shell.
### More than one extra disk per node
If you add a third disk later, extend `machine.disks` in `talconfig.yaml` (for
example `/dev/sdc``/var/mnt/longhorn-disk2`) and register that path in
Longhorn as an additional disk for that node.
Recommended:
- use one dedicated filesystem per Longhorn disk path
- avoid using the Talos system disk for heavy Longhorn data
- spread replicas across nodes for resiliency
## 11) Upgrade Talos to `v1.12.x`
This repo now pins:
- `talosVersion: v1.12.5` in `talconfig.yaml`
### Regenerate configs
From `talos/`:
```bash
talhelper genconfig
```
### Rolling upgrade order
Upgrade one node at a time, waiting for it to return healthy before moving on.
1. Control plane nodes (`noble-cp-1`, then `noble-cp-2`, then `noble-cp-3`)
2. Worker node (`noble-worker-1`)
Example commands (adjust node IP per step):
```bash
talosctl --talosconfig ./clusterconfig/talosconfig -n 192.168.50.20 upgrade --image ghcr.io/siderolabs/installer:v1.12.5
talosctl --talosconfig ./clusterconfig/talosconfig -n 192.168.50.20 reboot
talosctl --talosconfig ./clusterconfig/talosconfig -n 192.168.50.20 health
```
After all nodes are upgraded, verify:
```bash
talosctl --talosconfig ./clusterconfig/talosconfig version
kubectl get nodes -o wide
```