Files
home-server/talos/README.md

186 lines
11 KiB
Markdown
Raw Blame History

This file contains ambiguous Unicode characters
This file contains Unicode characters that might be confused with other characters. If you think that this is intentional, you can safely ignore this warning. Use the Escape button to reveal them.
# Talos — noble lab
- **Cluster build checklist (exported TODO):** [CLUSTER-BUILD.md](./CLUSTER-BUILD.md)
- **Operational runbooks (API VIP, etcd, Longhorn, SOPS):** [runbooks/README.md](./runbooks/README.md)
## Versions
Align with [CLUSTER-BUILD.md](./CLUSTER-BUILD.md): Talos **v1.12.6**; `talosctl` client should match installed node image.
## DNS (prerequisites)
| Name | Points to |
|------|-----------|
| `noble.lab`, `kube.noble.lab` (API SANs) | `192.168.50.230` (kube-vip) |
| `*.apps.noble.lab.pcenicni.dev` | Traefik `LoadBalancer` IP from MetalLB pool (`192.168.50.210``229`) once ingress is up |
## 1. Secrets and generated configs
From this directory:
```bash
talhelper gensecret > talsecret.yaml
# Encrypt for git if desired: sops -e -i talsecret.sops.yaml (see talhelper docs)
talhelper genconfig -o out
```
`out/` is ignored via repo root `.gitignore` (`talos/out/`). Do not commit `talsecret.yaml` or generated machine configs.
**Never commit `talos/kubeconfig`** (also gitignored). It contains cluster admin credentials; generate locally with `talosctl kubeconfig` (§3). If it was ever pushed, remove it from git tracking, regenerate kubeconfig, and treat the old credentials as compromised (purge from history with `git filter-repo` or BFG if needed).
**After any `talconfig.yaml` edit, run `genconfig` again** before `apply-config`. Stale `out/*.yaml` is easy to apply by mistake. Quick check: `grep -A8 kind: UserVolumeConfig out/noble-neon.yaml` should match what you expect (e.g. Longhorn `volumeType: disk`, not `grow`/`maxSize` on a partition).
## 2. Apply machine config
Order: **§1 `genconfig` → apply all nodes → §3 bootstrap** (not the reverse). Use the same `talsecret` / `out/` generation for the life of the cluster; rotating secrets without reinstalling nodes breaks client trust.
**A) First install — node still in maintenance mode** (no Talos OS on disk yet, or explicitly in maintenance):
```bash
talosctl apply-config --insecure -n 192.168.50.20 --file out/noble-neon.yaml
# repeat for each node; TALOSCONFIG not required for --insecure maintenance API
```
**B) Node already installed / cluster already bootstrapped** (`tls: certificate required` if you use `--insecure` here):
```bash
export TALOSCONFIG="${TALOSCONFIG:-$(pwd)/out/talosconfig}"
talosctl apply-config -n 192.168.50.20 --file out/noble-neon.yaml
```
**Do not pass `--insecure` for (B).** With `--insecure`, `talosctl` does not use client certificates from `TALOSCONFIG`, so the node still responds with `tls: certificate required`. The flag means “maintenance API only,” not “skip server verification.”
**Wrong (what triggers the error):**
```bash
export TALOSCONFIG="$(pwd)/out/talosconfig"
talosctl apply-config --insecure -n 192.168.50.20 --file out/noble-neon.yaml # still broken on joined nodes
```
## 3. Bootstrap and kubeconfig
Bootstrap **once** on the first control plane **after** configs are applied (example: neon):
```bash
export TALOSCONFIG="${TALOSCONFIG:-$(pwd)/out/talosconfig}"
talosctl bootstrap -n 192.168.50.20
```
After the API is up (direct node IP first; use VIP after kube-vip is healthy):
```bash
export TALOSCONFIG="${TALOSCONFIG:-$(pwd)/out/talosconfig}"
talosctl kubeconfig ./kubeconfig -n 192.168.50.20 -e 192.168.50.230 --merge=false
export KUBECONFIG="$(pwd)/kubeconfig"
kubectl get nodes
```
Adjust `-n` / `-e` if your bootstrap node or VIP differ.
**Reachability (same idea for Talos and Kubernetes):**
| Command | What it connects to |
|---------|---------------------|
| `talosctl … -e <addr>` | Talos **apid** on `<addr>:50000` (not 6443) |
| `kubectl` / Helm | Kubernetes API on `https://<addr>:6443` from kubeconfig |
If your Mac shows **`network is unreachable`** to `192.168.50.230`, fix **L2/L3** first (same **LAN** as the nodes, **VPN**, or routing). **`talosctl kubeconfig -e 192.168.50.20`** only chooses **which Talos node** fetches the admin cert; the **`server:`** URL inside kubeconfig still comes from **`cluster.controlPlane.endpoint`** in Talos config (here **`https://192.168.50.230:6443`**). So `kubectl` can still dial the **VIP** even when `-e` used a node IP.
After a successful `talosctl kubeconfig`, **point kubectl at a reachable control-plane IP** (same as bootstrap node until kube-vip works from your network):
```bash
export TALOSCONFIG="${TALOSCONFIG:-$(pwd)/out/talosconfig}"
talosctl kubeconfig ./kubeconfig -n 192.168.50.20 -e 192.168.50.20 --merge=false
export KUBECONFIG="$(pwd)/kubeconfig"
# Kubeconfig still says https://192.168.50.230:6443 — override if VIP is unreachable from this machine:
kubectl config set-cluster noble --server=https://192.168.50.20:6443
kubectl get nodes
```
One-liner alternative (macOS/BSD `sed -i ''`; on Linux use `sed -i`):
```bash
sed -i '' 's|https://192.168.50.230:6443|https://192.168.50.20:6443|g' kubeconfig
```
Quick check from your Mac: `nc -vz 192.168.50.20 50000` (Talos) and `nc -vz 192.168.50.20 6443` (Kubernetes).
**`dial tcp 192.168.50.230:6443` on nodes:** Host-network components (including **Cilium**) cannot use the in-cluster `kubernetes` Service; they otherwise follow **`cluster.controlPlane.endpoint`** (the VIP). Talos **KubePrism** on **`127.0.0.1:7445`** (default) load-balances to healthy apiservers. Ensure the CNI Helm values set **`k8sServiceHost: "127.0.0.1"`** and **`k8sServicePort: "7445"`** — see [`clusters/noble/bootstrap/cilium/values.yaml`](../clusters/noble/bootstrap/cilium/values.yaml). Also confirm **kube-vip**s **`vip_interface`** matches the uplink (`talosctl -n <ip> get links` — e.g. **`ens18`** on these nodes). A bare **`curl -k https://192.168.50.230:6443/healthz`** often returns **`401 Unauthorized`** because no client cert was sent — that still means TLS to the VIP worked.
**Verify the VIP with `kubectl` (copy as-is):** use a real kubeconfig path (not ` /path/to/…`). From the **repository root**:
```bash
export KUBECONFIG="${KUBECONFIG:-$(pwd)/talos/kubeconfig}"
kubectl config set-cluster noble --server=https://192.168.50.230:6443
kubectl get --raw /healthz
```
Expect a single line: **`ok`**. If you see **`The connection to the server localhost:8080 was refused`**, `KUBECONFIG` was missing or wrong (e.g. typo **`.export`** instead of **`export`**, or a path that does not exist). Do not put **`#` comments** on the same line as `kubectl config set-cluster` when pasting — some shells copy the comment into the command.
**`kubectl``localhost:8080` / connection refused:** `talosctl kubeconfig` did **not** write a valid kubeconfig (often because the step above failed). Fix Talos/API reachability first; do not trust `kubectl` until `talosctl kubeconfig` completes without error.
## 4. Platform manifests (this repo)
| Component | Apply |
|-----------|--------|
| Cilium | **Before** kube-vip/MetalLB scheduling: Helm from [`clusters/noble/bootstrap/cilium/README.md`](../clusters/noble/bootstrap/cilium/README.md) (`values.yaml`) |
| kube-vip | `kubectl apply -k ../clusters/noble/bootstrap/kube-vip` |
| MetalLB pool | After MetalLB controller install: `kubectl apply -k ../clusters/noble/bootstrap/metallb` |
| Longhorn PSA + Helm | `kubectl apply -k ../clusters/noble/bootstrap/longhorn` then Helm from §5 below |
Set `vip_interface` in `clusters/noble/bootstrap/kube-vip/vip-daemonset.yaml` if it does not match the control-plane uplink (`talosctl -n <cp-ip> get links`).
## 5. Longhorn (Talos)
1. **Machine image:** `talconfig.yaml` includes `iscsi-tools` and `util-linux-tools` extensions. After `talhelper genconfig`, **upgrade each node** so the running installer image matches (extensions are in the image, not applied live by config alone). If `longhorn-manager` logs **`iscsiadm` / `open-iscsi`**, the node image does not include the extension yet.
2. **Pod Security + path:** Apply `kubectl apply -k ../clusters/noble/bootstrap/longhorn` (privileged `longhorn-system`). The Helm chart host-mounts **`/var/lib/longhorn`**; `talconfig` adds a kubelet **bind** from `/var/mnt/longhorn``/var/lib/longhorn` so that path matches the dedicated XFS volume.
3. **Data path:** From the **repository root** (not `talos/`), run Helm with a real release and chart name — not literal `...`:
```bash
helm repo add longhorn https://charts.longhorn.io && helm repo update
helm upgrade --install longhorn longhorn/longhorn -n longhorn-system --create-namespace \
-f clusters/noble/bootstrap/longhorn/values.yaml
```
If Longhorn defaults to `/var/lib/longhorn`, you get **wrong format** / **no space** on the Talos root filesystem.
4. **Disk device:** Second disk is often `/dev/vdb` under Proxmox virtio; `talconfig` selects `sdb` or `vdb`. Confirm with `talosctl get disks -n <ip>`.
5. **`filesystem type mismatch: gpt != xfs` on `volumeType: disk`:** The data disk still has a **GPT** from an older partition attempt. Whole-disk XFS needs a **raw** disk. Talos cannot `wipe disk` while `u-longhorn` claims the device.
**Repo layout:** `talconfig.yaml` = **wipe-phase** (no Longhorn volume / no kubelet bind). `talconfig.with-longhorn.yaml` = restore after wipes.
**Order matters.** `blockdevice "sdb" is in use by volume "u-longhorn"` means you tried to **wipe before** the running nodes received the wipe-phase machine config. You must **`talosctl apply-config`** (wipe YAML) on **every** node first, **reboot** if `u-longhorn` still appears, **then** `talosctl wipe disk`.
**Automated (recommended):** from `talos/` after `talhelper genconfig -o out`:
```bash
cd talos && talhelper genconfig -o out && export TALOSCONFIG="$(pwd)/out/talosconfig"
./scripts/longhorn-gpt-recovery.sh phase1 # apply wipe config to all 4 nodes; reboot cluster if needed
./scripts/longhorn-gpt-recovery.sh phase2 # wipe disk, restore Longhorn talconfig, genconfig, apply all nodes
```
Use `DISK=vdb ./scripts/longhorn-gpt-recovery.sh phase2` if the second disk is `vdb`.
**Manual:** same sequence, but do not paste comment lines into zsh as commands (`#` lines can error if copy-paste breaks).
6. **“Error fetching pod status”** in the Longhorn UI is often API connectivity (VIP/DNS), `longhorn-manager` / CSI pods not ready, or RBAC. Check `kubectl get pods -n longhorn-system` and `kubectl logs -n longhorn-system -l app=longhorn-manager --tail=50` from a working kubeconfig.
## Troubleshooting
### `user=apiserver-kubelet-client` / `verb=get` / `resource=nodes` (authorization error)
That identity is the **client cert the kube-apiserver uses when talking to kubelets** (logs, exec, node metrics, etc.). Audit logs often show it when the apiserver checks **Node** access before proxying. It is **not** your human `kubectl` user.
- If **`kubectl get nodes`** and normal workloads work, treat log noise as **informational** unless something user-facing breaks (`kubectl logs`, `kubectl exec`, **metrics-server** node metrics, **HorizontalPodAutoscaler**).
- If **logs/exec/metrics** fail cluster-wide, check default RBAC still exists (nothing should delete `system:*` ClusterRoles):
```bash
kubectl get clusterrole system:kubelet-api-admin system:node-proxier 2>&1
```
- If you **customized** `authorization-config` / RBAC on the API server, revert or align with [kubelet authentication/authorization](https://kubernetes.io/docs/reference/access-authn-authz/kubelet-authn-authz/) expectations.
## Kubeconfig from running nodes
The repo root `kubeconfig` may be incomplete until you merge credentials; prefer generating `talos/kubeconfig` with the commands in §3 above.