242 lines
6.9 KiB
Markdown
242 lines
6.9 KiB
Markdown
# Talos deployment (4 nodes)
|
||
|
||
This directory contains a `talhelper` cluster definition for a 4-node Talos
|
||
cluster:
|
||
|
||
- 3 hybrid control-plane/worker nodes: `noble-cp-1..3`
|
||
- 1 worker-only node: `noble-worker-1`
|
||
- `allowSchedulingOnControlPlanes: true`
|
||
- CNI: `none` (for Cilium via GitOps)
|
||
|
||
## 1) Update values for your environment
|
||
|
||
Edit `talconfig.yaml`:
|
||
|
||
- `endpoint` (Kubernetes API VIP or LB IP)
|
||
- each node `ipAddress`
|
||
- each node `installDisk` (for example `/dev/sda`, `/dev/nvme0n1`)
|
||
- `talosVersion` / `kubernetesVersion` if desired
|
||
|
||
## 2) Generate cluster secrets and machine configs
|
||
|
||
From this directory:
|
||
|
||
```bash
|
||
talhelper gensecret > talsecret.sops.yaml
|
||
talhelper genconfig
|
||
```
|
||
|
||
Generated machine configs are written to `clusterconfig/`.
|
||
|
||
## 3) Apply Talos configs
|
||
|
||
Apply each node file to the matching node IP from `talconfig.yaml`:
|
||
|
||
```bash
|
||
talosctl apply-config --insecure -n 192.168.50.20 -f clusterconfig/noble-noble-cp-1.yaml
|
||
talosctl apply-config --insecure -n 192.168.50.30 -f clusterconfig/noble-noble-cp-2.yaml
|
||
talosctl apply-config --insecure -n 192.168.50.40 -f clusterconfig/noble-noble-cp-3.yaml
|
||
talosctl apply-config --insecure -n 192.168.50.10 -f clusterconfig/noble-noble-worker-1.yaml
|
||
```
|
||
|
||
## 4) Bootstrap the cluster
|
||
|
||
After all nodes are up (bootstrap once, from any control-plane node):
|
||
|
||
```bash
|
||
talosctl bootstrap -n 192.168.50.20 -e 192.168.50.230
|
||
talosctl kubeconfig -n 192.168.50.20 -e 192.168.50.230 .
|
||
```
|
||
|
||
## 5) Validate
|
||
|
||
```bash
|
||
talosctl -n 192.168.50.20 -e 192.168.50.230 health
|
||
kubectl get nodes -o wide
|
||
```
|
||
|
||
## 6) GitOps-pinned Cilium values
|
||
|
||
The Cilium settings that worked for this Talos cluster are now persisted in:
|
||
|
||
- `clusters/noble/apps/cilium/application.yaml`
|
||
|
||
That Argo CD `Application` pins chart `1.16.6` and includes the required Helm
|
||
values for this environment (API host/port, cgroup settings, IPAM CIDR, and
|
||
security capabilities), so future reconciles do not drift back to defaults.
|
||
|
||
## 7) Argo CD app-of-apps bootstrap
|
||
|
||
This repo includes an app-of-apps structure for cluster apps:
|
||
|
||
- Root app: `clusters/noble/root-application.yaml`
|
||
- Child apps index: `clusters/noble/apps/kustomization.yaml`
|
||
- Argo CD app: `clusters/noble/apps/argocd/application.yaml`
|
||
- Cilium app: `clusters/noble/apps/cilium/application.yaml`
|
||
|
||
Bootstrap once from your workstation:
|
||
|
||
```bash
|
||
kubectl apply -k clusters/noble/bootstrap/argocd
|
||
kubectl apply -f clusters/noble/root-application.yaml
|
||
```
|
||
|
||
After this, Argo CD continuously reconciles all applications under
|
||
`clusters/noble/apps/`.
|
||
|
||
## 8) kube-vip API VIP (`192.168.50.230`)
|
||
|
||
HAProxy has been removed in favor of `kube-vip` running on control-plane nodes.
|
||
|
||
Manifests are in:
|
||
|
||
- `clusters/noble/apps/kube-vip/application.yaml`
|
||
- `clusters/noble/apps/kube-vip/vip-rbac.yaml`
|
||
- `clusters/noble/apps/kube-vip/vip-daemonset.yaml`
|
||
|
||
The DaemonSet advertises `192.168.50.230` in ARP mode and fronts the Kubernetes
|
||
API on port `6443`.
|
||
|
||
Apply manually (or let Argo CD sync from root app):
|
||
|
||
```bash
|
||
kubectl apply -k clusters/noble/apps/kube-vip
|
||
```
|
||
|
||
Validate:
|
||
|
||
```bash
|
||
kubectl -n kube-system get pods -l app.kubernetes.io/name=kube-vip-ds -o wide
|
||
nc -vz 192.168.50.230 6443
|
||
```
|
||
|
||
## 9) Argo CD via DNS host (no port)
|
||
|
||
Argo CD is exposed through a kube-vip managed LoadBalancer Service:
|
||
|
||
- `argo.noble.lab.pcenicni.dev`
|
||
|
||
Manifests:
|
||
|
||
- `clusters/noble/bootstrap/argocd/argocd-server-lb.yaml`
|
||
- `clusters/noble/apps/kube-vip/vip-daemonset.yaml` (`svc_enable: "true"`)
|
||
|
||
After syncing manifests, create a Pi-hole DNS A record:
|
||
|
||
- `argo.noble.lab.pcenicni.dev` -> `192.168.50.231`
|
||
|
||
## 10) Longhorn storage and extra disks
|
||
|
||
Longhorn is deployed from:
|
||
|
||
- `clusters/noble/apps/longhorn/application.yaml`
|
||
|
||
Monitoring apps are configured to use `storageClassName: longhorn`, so you can
|
||
persist Prometheus/Alertmanager/Loki data once Longhorn is healthy.
|
||
|
||
### Extra drive layout (this cluster)
|
||
|
||
Each node uses:
|
||
|
||
- `/dev/sda` — Talos install disk (`installDisk` in `talconfig.yaml`)
|
||
- `/dev/sdb` — dedicated Longhorn data disk
|
||
|
||
`talconfig.yaml` includes a global patch that partitions `/dev/sdb` and mounts it
|
||
at `/var/mnt/longhorn`, which matches Longhorn `defaultDataPath` in the Argo
|
||
Helm values.
|
||
|
||
After editing `talconfig.yaml`, regenerate and apply configs:
|
||
|
||
```bash
|
||
cd talos
|
||
talhelper genconfig
|
||
# apply each node’s YAML from clusterconfig/ with talosctl apply-config
|
||
```
|
||
|
||
Then reboot each node once so the new disk layout is applied.
|
||
|
||
### `talosctl` TLS errors (`unknown authority`, `Ed25519 verification failure`)
|
||
|
||
`talosctl` **does not** automatically use `talos/clusterconfig/talosconfig`. If you
|
||
omit it, the client falls back to **`~/.talos/config`**, which is usually a
|
||
**different** cluster CA — you then get TLS handshake failures against the noble
|
||
nodes.
|
||
|
||
**Always** set this in the shell where you run `talosctl` (use an absolute path
|
||
if you change directories):
|
||
|
||
```bash
|
||
cd talos
|
||
export TALOSCONFIG="$(pwd)/clusterconfig/talosconfig"
|
||
export ENDPOINT=192.168.50.230
|
||
```
|
||
|
||
Sanity check (should print Talos and Kubernetes versions, not TLS errors):
|
||
|
||
```bash
|
||
talosctl -e "${ENDPOINT}" -n 192.168.50.20 version
|
||
```
|
||
|
||
Then use the same shell for `apply-config`, `reboot`, and `health`.
|
||
|
||
If it **still** fails after `TALOSCONFIG` is set, the running cluster was likely
|
||
bootstrapped with **different** secrets than the ones in your current
|
||
`talsecret.sops.yaml` / regenerated `clusterconfig/`. In that case you need the
|
||
**original** `talosconfig` that matched the cluster when it was created, or you
|
||
must align secrets and cluster state (recovery / rebuild is a larger topic).
|
||
|
||
Keep **`talosctl`** roughly aligned with the node Talos version (for example
|
||
`v1.12.x` clients for `v1.12.5` nodes).
|
||
|
||
**Paste tip:** run **one** command per line. Pasting `...cp-3.yaml` and
|
||
`talosctl` on the same line breaks the filename and can confuse the shell.
|
||
|
||
### More than one extra disk per node
|
||
|
||
If you add a third disk later, extend `machine.disks` in `talconfig.yaml` (for
|
||
example `/dev/sdc` → `/var/mnt/longhorn-disk2`) and register that path in
|
||
Longhorn as an additional disk for that node.
|
||
|
||
Recommended:
|
||
|
||
- use one dedicated filesystem per Longhorn disk path
|
||
- avoid using the Talos system disk for heavy Longhorn data
|
||
- spread replicas across nodes for resiliency
|
||
|
||
## 11) Upgrade Talos to `v1.12.x`
|
||
|
||
This repo now pins:
|
||
|
||
- `talosVersion: v1.12.5` in `talconfig.yaml`
|
||
|
||
### Regenerate configs
|
||
|
||
From `talos/`:
|
||
|
||
```bash
|
||
talhelper genconfig
|
||
```
|
||
|
||
### Rolling upgrade order
|
||
|
||
Upgrade one node at a time, waiting for it to return healthy before moving on.
|
||
|
||
1. Control plane nodes (`noble-cp-1`, then `noble-cp-2`, then `noble-cp-3`)
|
||
2. Worker node (`noble-worker-1`)
|
||
|
||
Example commands (adjust node IP per step):
|
||
|
||
```bash
|
||
talosctl --talosconfig ./clusterconfig/talosconfig -n 192.168.50.20 upgrade --image ghcr.io/siderolabs/installer:v1.12.5
|
||
talosctl --talosconfig ./clusterconfig/talosconfig -n 192.168.50.20 reboot
|
||
talosctl --talosconfig ./clusterconfig/talosconfig -n 192.168.50.20 health
|
||
```
|
||
|
||
After all nodes are upgraded, verify:
|
||
|
||
```bash
|
||
talosctl --talosconfig ./clusterconfig/talosconfig version
|
||
kubectl get nodes -o wide
|
||
```
|
||
|