Files
home-server/talos/README.md

545 lines
18 KiB
Markdown
Raw Blame History

This file contains ambiguous Unicode characters
This file contains Unicode characters that might be confused with other characters. If you think that this is intentional, you can safely ignore this warning. Use the Escape button to reveal them.
# Talos deployment (4 nodes)
This directory contains a `talhelper` cluster definition for a 4-node Talos
cluster:
- 3 hybrid control-plane/worker nodes: `noble-cp-1..3`
- 1 worker-only node: `noble-worker-1`
- `allowSchedulingOnControlPlanes: true`
- CNI: `none` (for Cilium via GitOps)
## 1) Update values for your environment
Edit `talconfig.yaml`:
- `endpoint` (Kubernetes API VIP or LB IP)
- **`additionalApiServerCertSans`** / **`additionalMachineCertSans`**: must include the
**same VIP** (and DNS name, if you use one) that clients and `talosctl` use —
otherwise TLS to `https://<VIP>:6443` fails because the cert only lists node
IPs by default. This repo sets **`192.168.50.230`** (and
**`kube.noble.lab.pcenicni.dev`**) to match kube-vip.
- each node `ipAddress`
- each node `installDisk` (for example `/dev/sda`, `/dev/nvme0n1`)
- `talosVersion` / `kubernetesVersion` if desired
After changing SANs, run **`talhelper genconfig`**, re-**apply-config** to all
**control-plane** nodes (certs are regenerated), then refresh **`talosctl kubeconfig`**.
## 2) Generate cluster secrets and machine configs
From this directory:
```bash
talhelper gensecret > talsecret.sops.yaml
talhelper genconfig
```
Generated machine configs are written to `clusterconfig/`.
## 3) Apply Talos configs
Apply each node file to the matching node IP from `talconfig.yaml`:
```bash
talosctl apply-config --insecure -n 192.168.50.20 -f clusterconfig/noble-noble-cp-1.yaml
talosctl apply-config --insecure -n 192.168.50.30 -f clusterconfig/noble-noble-cp-2.yaml
talosctl apply-config --insecure -n 192.168.50.40 -f clusterconfig/noble-noble-cp-3.yaml
talosctl apply-config --insecure -n 192.168.50.10 -f clusterconfig/noble-noble-worker-1.yaml
```
## 4) Bootstrap the cluster
After all nodes are up (bootstrap once, from any control-plane node):
```bash
talosctl bootstrap -n 192.168.50.20 -e 192.168.50.230
talosctl kubeconfig -n 192.168.50.20 -e 192.168.50.230 .
```
## 5) Validate
```bash
talosctl -n 192.168.50.20 -e 192.168.50.230 health
kubectl get nodes -o wide
```
### `kubectl` errors: `lookup https: no such host` or `https://https/...`
That means the **active** kubeconfig has a broken `cluster.server` URL (often a
**double** `https://` or **duplicate** `:6443`). Kubernetes then tries to resolve
the hostname `https`, which fails.
Inspect what you are using:
```bash
kubectl config view --minify -o jsonpath='{.clusters[0].cluster.server}{"\n"}'
```
It must be a **single** valid URL, for example:
- `https://192.168.50.230:6443` (API VIP from `talconfig.yaml`), or
- `https://kube.noble.lab.pcenicni.dev:6443` (if DNS points at that VIP)
Fix the cluster entry (replace `noble` with your contexts cluster name if
different):
```bash
kubectl config set-cluster noble --server=https://192.168.50.230:6443
```
Or point `kubectl` at this repos kubeconfig (known-good server line):
```bash
export KUBECONFIG="$(pwd)/kubeconfig"
kubectl cluster-info
```
Avoid pasting `https://` twice when running `kubectl config set-cluster ... --server=...`.
### `kubectl apply` fails: `localhost:8080` / `openapi` connection refused
`kubectl` is **not** using a real cluster config; it falls back to the default
`http://localhost:8080` (no `KUBECONFIG`, empty file, or wrong file).
Fix:
```bash
cd talos
export KUBECONFIG="$(pwd)/kubeconfig"
kubectl config current-context
kubectl cluster-info
```
Then run `kubectl apply` from the **repository root** (parent of `talos/`) in
the same shell. Do **not** use a literal `cd /path/to/...` — that was only a
placeholder. Example (adjust to where you cloned this repo):
```bash
export KUBECONFIG="${HOME}/Developer/home-server/talos/kubeconfig"
```
`kubectl config set-cluster noble ...` only updates the file **`kubectl` is
actually reading** (often `~/.kube/config`). It does nothing if `KUBECONFIG`
points at another path.
## 6) GitOps-pinned Cilium values
The Cilium settings that worked for this Talos cluster are now persisted in:
- `clusters/noble/apps/cilium/helm-values.yaml`
- `clusters/noble/apps/cilium/application.yaml` (Helm chart + `valueFiles` from this repo)
That Argo CD `Application` pins chart `1.16.6` and uses the same values file
for API host/port, cgroup settings, IPAM CIDR, and security capabilities.
### Cilium before Argo CD (`cni: none`)
This cluster uses **`cniConfig.name: none`** in `talconfig.yaml` so Talos does
not install a CNI. **Argo CD pods cannot schedule** until some CNI makes nodes
`Ready` (otherwise the `node.kubernetes.io/not-ready` taint blocks scheduling).
Install Cilium **once** with Helm from your workstation (same chart and values
Argo will manage later), **then** bootstrap Argo CD:
```bash
helm repo add cilium https://helm.cilium.io/
helm repo update
helm upgrade --install cilium cilium/cilium \
--namespace kube-system \
--version 1.16.6 \
-f clusters/noble/apps/cilium/helm-values.yaml \
--wait --timeout 10m
kubectl get nodes
kubectl wait --for=condition=Ready nodes --all --timeout=300s
```
If **`helm --install` seems stuck** after “Installing it now”, it is usually still
pulling images (`quay.io/cilium/...`) or waiting for pods to become Ready. In
another terminal run `kubectl get pods -n kube-system -w` and check for
`ImagePullBackOff`, `Pending`, or `CrashLoopBackOff`. To avoid blocking on
Helms wait logic, install without `--wait`, confirm Cilium pods, then continue:
```bash
helm upgrade --install cilium cilium/cilium \
--namespace kube-system \
--version 1.16.6 \
-f clusters/noble/apps/cilium/helm-values.yaml
kubectl get pods -n kube-system -l app.kubernetes.io/part-of=cilium -w
```
`helm-values.yaml` sets **`operator.replicas: 1`** so the chart default (two
operators with hard anti-affinity) cannot deadlock `helm --wait` when only one
node can take the operator early in bootstrap.
If **`helm upgrade` fails** with server-side apply conflicts and
**`argocd-controller`**, Argo already synced Cilium and **owns those fields**
on live objects. Clearing **`syncPolicy`** on the Application does **not**
remove that ownership; Helm still conflicts until you **take over** the fields
or only use Argo.
**One-shot CLI fix** (Helm 3.13+): add **`--force-conflicts`** so SSA wins the
disputed fields:
```bash
helm upgrade --install cilium cilium/cilium \
--namespace kube-system \
--version 1.16.6 \
-f clusters/noble/apps/cilium/helm-values.yaml \
--force-conflicts
```
Typical conflicts: Secret **`hubble-server-certs`** (`.data` TLS) and
Deployment **`cilium-operator`** (`.spec.replicas`,
`.spec/strategy/rollingUpdate/maxUnavailable`). The **`cilium` Application**
lists **`ignoreDifferences`** for those paths plus **`RespectIgnoreDifferences`**
so later Argo syncs do not keep overwriting them. Apply the manifest after you
change it: **`kubectl apply -f clusters/noble/apps/cilium/application.yaml`**.
After bootstrap, prefer syncing Cilium **only through Argo** (from Git) instead
of ad hoc Helm, unless you suspend the **`cilium`** Application first.
Shell tip: a line like **`# comment`** must start with **`#`**; if the shell
reports **`command not found: #`**, the character is not a real hash or the
line was pasted wrong—run **`kubectl apply ...`** as its own command without a
leading comment on the same paste block.
If nodes were already `Ready`, you can skip straight to section 7.
## 7) Argo CD app-of-apps bootstrap
This repo includes an app-of-apps structure for cluster apps:
- Root app: `clusters/noble/root-application.yaml`
- Child apps index: `clusters/noble/apps/kustomization.yaml`
- Argo CD app: `clusters/noble/apps/argocd/application.yaml`
- Cilium app: `clusters/noble/apps/cilium/application.yaml`
Bootstrap once from your workstation:
```bash
kubectl apply -k clusters/noble/bootstrap/argocd
kubectl wait --for=condition=Established crd/appprojects.argoproj.io --timeout=120s
kubectl apply -f clusters/noble/bootstrap/argocd/default-appproject.yaml
kubectl apply -f clusters/noble/root-application.yaml
```
If the first command errors on `AppProject` (“no matches for kind `AppProject`”), the CRDs were not ready yet; run the `kubectl wait` and `kubectl apply -f .../default-appproject.yaml` lines, then continue.
After this, Argo CD continuously reconciles all applications under
`clusters/noble/apps/`.
## 8) kube-vip API VIP (`192.168.50.230`)
HAProxy has been removed in favor of `kube-vip` running on control-plane nodes.
Manifests are in:
- `clusters/noble/apps/kube-vip/application.yaml`
- `clusters/noble/apps/kube-vip/vip-rbac.yaml`
- `clusters/noble/apps/kube-vip/vip-daemonset.yaml`
The DaemonSet advertises `192.168.50.230` in ARP mode and fronts the Kubernetes
API on port `6443`.
Apply manually (or let Argo CD sync from root app):
```bash
kubectl apply -k clusters/noble/apps/kube-vip
```
Validate:
```bash
kubectl -n kube-system get pods -l app.kubernetes.io/name=kube-vip-ds -o wide
nc -vz 192.168.50.230 6443
```
If **`kube-vip-ds` pods are `CrashLoopBackOff`**, logs usually show
`could not get link for interface '…'`. kube-vip binds the VIP to
**`vip_interface`**; on Talos the uplink is often **`eno1`**, **`enp0s…`**, or
**`enx…`**, not **`eth0`**. On a control-plane node IP from `talconfig.yaml`:
```bash
talosctl -n 192.168.50.20 get links
```
Do **not** paste that commands **table output** back into the shell: zsh runs
each line as a command (e.g. `192.168.50.20``command not found`), and a line
starting with **`NODE`** can be mistaken for the **`node`** binary and try to
load a file like **`NAMESPACE`** in the current directory. Also avoid pasting
the **prompt** (`(base) … %`) together with the command (duplicate prompt →
parse errors).
Set **`vip_interface`** in `clusters/noble/apps/kube-vip/vip-daemonset.yaml` to
that links **`metadata.id`**, commit, sync (or `kubectl apply -k
clusters/noble/apps/kube-vip`), and confirm pods go **`Running`**.
## 9) Argo CD via DNS host (no port)
Argo CD is exposed through a kube-vip managed LoadBalancer Service:
- `argo.noble.lab.pcenicni.dev`
Manifests:
- `clusters/noble/bootstrap/argocd/argocd-server-lb.yaml`
- `clusters/noble/apps/kube-vip/vip-daemonset.yaml` (`svc_enable: "true"`)
After syncing manifests, create a Pi-hole DNS A record:
- `argo.noble.lab.pcenicni.dev` -> `192.168.50.231`
## 10) Longhorn storage and extra disks
Longhorn is deployed from:
- `clusters/noble/apps/longhorn/application.yaml`
Monitoring apps are configured to use `storageClassName: longhorn`, so you can
persist Prometheus/Alertmanager/Loki data once Longhorn is healthy.
### Argo CD: `longhorn` OutOfSync, Health **Missing**, no `longhorn-role`
**Missing** means nothing has been applied yet, or a sync never completed. The
Helm chart creates `ClusterRole/longhorn-role` on a successful install.
1. See the failure reason:
```bash
kubectl describe application longhorn -n argocd
```
Check **Status → Conditions** and **Status → Operation State** for the error
(for example Helm render error, CRD apply failure, or repo-server cannot reach
`https://charts.longhorn.io`).
2. Trigger a sync (Argo CD UI **Sync**, or CLI):
```bash
argocd app sync longhorn
```
3. After a good sync, confirm:
```bash
kubectl get clusterrole longhorn-role
kubectl get pods -n longhorn-system
```
### Extra drive layout (this cluster)
Each node uses:
- `/dev/sda` — Talos install disk (`installDisk` in `talconfig.yaml`)
- `/dev/sdb` — dedicated Longhorn data disk
`talconfig.yaml` includes a global patch that partitions `/dev/sdb` and mounts it
at `/var/mnt/longhorn`, which matches Longhorn `defaultDataPath` in the Argo
Helm values.
After editing `talconfig.yaml`, regenerate and apply configs:
```bash
cd talos
talhelper genconfig
# apply each nodes YAML from clusterconfig/ with talosctl apply-config
```
Then reboot each node once so the new disk layout is applied.
### `talosctl` TLS errors (`unknown authority`, `Ed25519 verification failure`)
`talosctl` **does not** automatically use `talos/clusterconfig/talosconfig`. If you
omit it, the client falls back to **`~/.talos/config`**, which is usually a
**different** cluster CA — you then get TLS handshake failures against the noble
nodes.
**Always** set this in the shell where you run `talosctl` (use an absolute path
if you change directories):
```bash
cd talos
export TALOSCONFIG="$(pwd)/clusterconfig/talosconfig"
export ENDPOINT=192.168.50.230
```
Sanity check (should print Talos and Kubernetes versions, not TLS errors):
```bash
talosctl -e "${ENDPOINT}" -n 192.168.50.20 version
```
Then use the same shell for `apply-config`, `reboot`, and `health`.
If it **still** fails after `TALOSCONFIG` is set, the running cluster was likely
bootstrapped with **different** secrets than the ones in your current
`talsecret.sops.yaml` / regenerated `clusterconfig/`. In that case you need the
**original** `talosconfig` that matched the cluster when it was created, or you
must align secrets and cluster state (recovery / rebuild is a larger topic).
Keep **`talosctl`** roughly aligned with the node Talos version (for example
`v1.12.x` clients for `v1.12.5` nodes).
**Paste tip:** run **one** command per line. Pasting `...cp-3.yaml` and
`talosctl` on the same line breaks the filename and can confuse the shell.
### More than one extra disk per node
If you add a third disk later, extend `machine.disks` in `talconfig.yaml` (for
example `/dev/sdc``/var/mnt/longhorn-disk2`) and register that path in
Longhorn as an additional disk for that node.
Recommended:
- use one dedicated filesystem per Longhorn disk path
- avoid using the Talos system disk for heavy Longhorn data
- spread replicas across nodes for resiliency
## 11) Upgrade Talos to `v1.12.x`
This repo now pins:
- `talosVersion: v1.12.5` in `talconfig.yaml`
### Regenerate configs
From `talos/`:
```bash
talhelper genconfig
```
### Rolling upgrade order
Upgrade one node at a time, waiting for it to return healthy before moving on.
1. Control plane nodes (`noble-cp-1`, then `noble-cp-2`, then `noble-cp-3`)
2. Worker node (`noble-worker-1`)
Example commands (adjust node IP per step):
```bash
talosctl --talosconfig ./clusterconfig/talosconfig -n 192.168.50.20 upgrade --image ghcr.io/siderolabs/installer:v1.12.5
talosctl --talosconfig ./clusterconfig/talosconfig -n 192.168.50.20 reboot
talosctl --talosconfig ./clusterconfig/talosconfig -n 192.168.50.20 health
```
After all nodes are upgraded, verify:
```bash
talosctl --talosconfig ./clusterconfig/talosconfig version
kubectl get nodes -o wide
```
## 12) Destroy the cluster and rebuild from scratch
Use this when Kubernetes / etcd / Argo / Longhorn state is corrupted and you want a
**clean** cluster. This **wipes cluster state on the nodes** (etcd, workloads,
Longhorn data on cluster disks). Plan for **downtime** and **backup** anything
you must keep off-cluster first.
### 12.1 Reset every Talos node (Kubernetes is destroyed)
From `talos/` with a working **`talosconfig`** that matches the machines (same
`TALOSCONFIG` / `ENDPOINT` guidance as elsewhere in this README):
```bash
cd talos
export TALOSCONFIG="$(pwd)/clusterconfig/talosconfig"
export ENDPOINT=192.168.50.230
```
Reset **one node at a time**, waiting for each to reboot before the next. Order:
**worker first**, then **non-bootstrap control planes**, then the **bootstrap**
control plane **last** (`noble-cp-1``192.168.50.20`).
```bash
talosctl -e "${ENDPOINT}" -n 192.168.50.10 reset --graceful=false
talosctl -e "${ENDPOINT}" -n 192.168.50.30 reset --graceful=false
talosctl -e "${ENDPOINT}" -n 192.168.50.40 reset --graceful=false
talosctl -e "${ENDPOINT}" -n 192.168.50.20 reset --graceful=false
```
If the API VIP is already unreachable, target the **node IP** as endpoint for that
node, for example:
`talosctl -e 192.168.50.10 -n 192.168.50.10 reset --graceful=false`.
Your workstation **`kubeconfig`** will not work for the old cluster after this;
that is expected until you bootstrap again.
### 12.2 (Optional) New cluster secrets
For a fully fresh identity (new cluster CA and `talosconfig`):
```bash
cd talos
talhelper gensecret > talsecret.sops.yaml
# encrypt / store talsecret as you usually do, then:
talhelper genconfig
```
If you **keep** the existing `talsecret.sops.yaml`, still run **`talhelper genconfig`**
so `clusterconfig/` matches what you will apply.
### 12.3 Apply configs, bootstrap, kubeconfig
Repeat **§3 Apply Talos configs** and **§4 Bootstrap the cluster** (and **§5
Validate**) from the top of this README: `apply-config` each node, then
`talosctl bootstrap`, then `talosctl kubeconfig` into `talos/kubeconfig`.
### 12.4 Redeploy GitOps (Argo CD + apps)
From your workstation (repo root), with `KUBECONFIG` pointing at the new
`talos/kubeconfig`:
```bash
# Set REPO to the directory that contains both talos/ and clusters/ (not a literal "path/to")
REPO="${HOME}/Developer/home-server"
export KUBECONFIG="${REPO}/talos/kubeconfig"
cd "${REPO}"
kubectl apply -k clusters/noble/bootstrap/argocd
kubectl apply -f clusters/noble/root-application.yaml
```
Resolve **Argo CD admin** login (secret / password reset) as needed; then let
`noble-root` sync `clusters/noble/apps/`.
## 13) Mid-rebuild issues: etcd, bootstrap, and `apply-config`
### `tls: certificate required` when using `apply-config --insecure`
After a node has **joined** the cluster, the Talos API expects **client
certificates** from your `talosconfig`. `--insecure` only applies to **maintenance**
(before join / after a reset).
**Do one of:**
- Apply config **with** `talosconfig` (no `--insecure`):
```bash
cd talos
export TALOSCONFIG="$(pwd)/clusterconfig/talosconfig"
export ENDPOINT=192.168.50.230
talosctl -e "${ENDPOINT}" apply-config -n 192.168.50.30 -f clusterconfig/noble-noble-cp-2.yaml
```
- Or **`talosctl reset`** that node first (see §12.1), then use
`apply-config --insecure` again while it is in maintenance.
### `bootstrap`: `etcd data directory is not empty`
The bootstrap node (`192.168.50.20`) already has a **previous etcd** on disk (failed
or partial bootstrap). Kubernetes will not bootstrap again until that state is
**wiped**.
**Fix:** run **`talosctl reset --graceful=false`** on the **control plane nodes**
(at minimum the bootstrap node; often **all four nodes** is cleaner). See §12.1.
Then re-apply machine configs and run **`talosctl bootstrap` exactly once**.
### etcd unhealthy / “Preparing” on some control planes
Usually means **split or partial** cluster state. The reliable fix is the same
**full reset** (§12.1), then a single ordered bring-up: apply all configs →
bootstrap once → `talosctl health`.