Files
home-server/talos
..

Talos — noble lab

Versions

Align with CLUSTER-BUILD.md: Talos v1.12.6; talosctl client should match installed node image.

DNS (prerequisites)

Name Points to
noble.lab, kube.noble.lab (API SANs) 192.168.50.230 (kube-vip)
*.apps.noble.lab.pcenicni.dev Traefik LoadBalancer IP from MetalLB pool (192.168.50.210229) once ingress is up

1. Secrets and generated configs

From this directory:

talhelper gensecret > talsecret.yaml
# Encrypt for git if desired: sops -e -i talsecret.sops.yaml (see talhelper docs)

talhelper genconfig -o out

out/ is ignored via repo root .gitignore (talos/out/). Do not commit talsecret.yaml or generated machine configs.

Never commit talos/kubeconfig (also gitignored). It contains cluster admin credentials; generate locally with talosctl kubeconfig (§3). If it was ever pushed, remove it from git tracking, regenerate kubeconfig, and treat the old credentials as compromised (purge from history with git filter-repo or BFG if needed).

After any talconfig.yaml edit, run genconfig again before apply-config. Stale out/*.yaml is easy to apply by mistake. Quick check: grep -A8 kind: UserVolumeConfig out/noble-neon.yaml should match what you expect (e.g. Longhorn volumeType: disk, not grow/maxSize on a partition).

2. Apply machine config

Order: §1 genconfig → apply all nodes → §3 bootstrap (not the reverse). Use the same talsecret / out/ generation for the life of the cluster; rotating secrets without reinstalling nodes breaks client trust.

A) First install — node still in maintenance mode (no Talos OS on disk yet, or explicitly in maintenance):

talosctl apply-config --insecure -n 192.168.50.20 --file out/noble-neon.yaml
# repeat for each node; TALOSCONFIG not required for --insecure maintenance API

B) Node already installed / cluster already bootstrapped (tls: certificate required if you use --insecure here):

export TALOSCONFIG="${TALOSCONFIG:-$(pwd)/out/talosconfig}"
talosctl apply-config -n 192.168.50.20 --file out/noble-neon.yaml

Do not pass --insecure for (B). With --insecure, talosctl does not use client certificates from TALOSCONFIG, so the node still responds with tls: certificate required. The flag means “maintenance API only,” not “skip server verification.”

Wrong (what triggers the error):

export TALOSCONFIG="$(pwd)/out/talosconfig"
talosctl apply-config --insecure -n 192.168.50.20 --file out/noble-neon.yaml   # still broken on joined nodes

3. Bootstrap and kubeconfig

Bootstrap once on the first control plane after configs are applied (example: neon):

export TALOSCONFIG="${TALOSCONFIG:-$(pwd)/out/talosconfig}"
talosctl bootstrap -n 192.168.50.20

After the API is up (direct node IP first; use VIP after kube-vip is healthy):

export TALOSCONFIG="${TALOSCONFIG:-$(pwd)/out/talosconfig}"
talosctl kubeconfig ./kubeconfig -n 192.168.50.20 -e 192.168.50.230 --merge=false
export KUBECONFIG="$(pwd)/kubeconfig"
kubectl get nodes

Adjust -n / -e if your bootstrap node or VIP differ.

Reachability (same idea for Talos and Kubernetes):

Command What it connects to
talosctl … -e <addr> Talos apid on <addr>:50000 (not 6443)
kubectl / Helm Kubernetes API on https://<addr>:6443 from kubeconfig

If your Mac shows network is unreachable to 192.168.50.230, fix L2/L3 first (same LAN as the nodes, VPN, or routing). talosctl kubeconfig -e 192.168.50.20 only chooses which Talos node fetches the admin cert; the server: URL inside kubeconfig still comes from cluster.controlPlane.endpoint in Talos config (here https://192.168.50.230:6443). So kubectl can still dial the VIP even when -e used a node IP.

After a successful talosctl kubeconfig, point kubectl at a reachable control-plane IP (same as bootstrap node until kube-vip works from your network):

export TALOSCONFIG="${TALOSCONFIG:-$(pwd)/out/talosconfig}"
talosctl kubeconfig ./kubeconfig -n 192.168.50.20 -e 192.168.50.20 --merge=false
export KUBECONFIG="$(pwd)/kubeconfig"
# Kubeconfig still says https://192.168.50.230:6443 — override if VIP is unreachable from this machine:
kubectl config set-cluster noble --server=https://192.168.50.20:6443
kubectl get nodes

One-liner alternative (macOS/BSD sed -i ''; on Linux use sed -i):

sed -i '' 's|https://192.168.50.230:6443|https://192.168.50.20:6443|g' kubeconfig

Quick check from your Mac: nc -vz 192.168.50.20 50000 (Talos) and nc -vz 192.168.50.20 6443 (Kubernetes).

dial tcp 192.168.50.230:6443 on nodes: Host-network components (including Cilium) cannot use the in-cluster kubernetes Service; they otherwise follow cluster.controlPlane.endpoint (the VIP). Talos KubePrism on 127.0.0.1:7445 (default) load-balances to healthy apiservers. Ensure the CNI Helm values set k8sServiceHost: "127.0.0.1" and k8sServicePort: "7445" — see clusters/noble/bootstrap/cilium/values.yaml. Also confirm kube-vips vip_interface matches the uplink (talosctl -n <ip> get links — e.g. ens18 on these nodes). A bare curl -k https://192.168.50.230:6443/healthz often returns 401 Unauthorized because no client cert was sent — that still means TLS to the VIP worked.

Verify the VIP with kubectl (copy as-is): use a real kubeconfig path (not /path/to/…). From the repository root:

export KUBECONFIG="${KUBECONFIG:-$(pwd)/talos/kubeconfig}"
kubectl config set-cluster noble --server=https://192.168.50.230:6443
kubectl get --raw /healthz

Expect a single line: ok. If you see The connection to the server localhost:8080 was refused, KUBECONFIG was missing or wrong (e.g. typo .export instead of export, or a path that does not exist). Do not put # comments on the same line as kubectl config set-cluster when pasting — some shells copy the comment into the command.

kubectllocalhost:8080 / connection refused: talosctl kubeconfig did not write a valid kubeconfig (often because the step above failed). Fix Talos/API reachability first; do not trust kubectl until talosctl kubeconfig completes without error.

4. Platform manifests (this repo)

Component Apply
Cilium Before kube-vip/MetalLB scheduling: Helm from clusters/noble/bootstrap/cilium/README.md (values.yaml)
kube-vip kubectl apply -k ../clusters/noble/bootstrap/kube-vip
MetalLB pool After MetalLB controller install: kubectl apply -k ../clusters/noble/bootstrap/metallb
Longhorn PSA + Helm kubectl apply -k ../clusters/noble/bootstrap/longhorn then Helm from §5 below

Set vip_interface in clusters/noble/bootstrap/kube-vip/vip-daemonset.yaml if it does not match the control-plane uplink (talosctl -n <cp-ip> get links).

5. Longhorn (Talos)

  1. Machine image: talconfig.yaml includes iscsi-tools and util-linux-tools extensions. After talhelper genconfig, upgrade each node so the running installer image matches (extensions are in the image, not applied live by config alone). If longhorn-manager logs iscsiadm / open-iscsi, the node image does not include the extension yet.
  2. Pod Security + path: Apply kubectl apply -k ../clusters/noble/bootstrap/longhorn (privileged longhorn-system). The Helm chart host-mounts /var/lib/longhorn; talconfig adds a kubelet bind from /var/mnt/longhorn/var/lib/longhorn so that path matches the dedicated XFS volume.
  3. Data path: From the repository root (not talos/), run Helm with a real release and chart name — not literal ...:
helm repo add longhorn https://charts.longhorn.io && helm repo update
helm upgrade --install longhorn longhorn/longhorn -n longhorn-system --create-namespace \
  -f clusters/noble/bootstrap/longhorn/values.yaml

If Longhorn defaults to /var/lib/longhorn, you get wrong format / no space on the Talos root filesystem. 4. Disk device: Second disk is often /dev/vdb under Proxmox virtio; talconfig selects sdb or vdb. Confirm with talosctl get disks -n <ip>. 5. filesystem type mismatch: gpt != xfs on volumeType: disk: The data disk still has a GPT from an older partition attempt. Whole-disk XFS needs a raw disk. Talos cannot wipe disk while u-longhorn claims the device.

Repo layout: talconfig.yaml = wipe-phase (no Longhorn volume / no kubelet bind). talconfig.with-longhorn.yaml = restore after wipes.

Order matters. blockdevice "sdb" is in use by volume "u-longhorn" means you tried to wipe before the running nodes received the wipe-phase machine config. You must talosctl apply-config (wipe YAML) on every node first, reboot if u-longhorn still appears, then talosctl wipe disk.

Automated (recommended): from talos/ after talhelper genconfig -o out:

cd talos && talhelper genconfig -o out && export TALOSCONFIG="$(pwd)/out/talosconfig"
./scripts/longhorn-gpt-recovery.sh phase1   # apply wipe config to all 4 nodes; reboot cluster if needed
./scripts/longhorn-gpt-recovery.sh phase2   # wipe disk, restore Longhorn talconfig, genconfig, apply all nodes

Use DISK=vdb ./scripts/longhorn-gpt-recovery.sh phase2 if the second disk is vdb.

Manual: same sequence, but do not paste comment lines into zsh as commands (# lines can error if copy-paste breaks).

  1. “Error fetching pod status” in the Longhorn UI is often API connectivity (VIP/DNS), longhorn-manager / CSI pods not ready, or RBAC. Check kubectl get pods -n longhorn-system and kubectl logs -n longhorn-system -l app=longhorn-manager --tail=50 from a working kubeconfig.

Troubleshooting

user=apiserver-kubelet-client / verb=get / resource=nodes (authorization error)

That identity is the client cert the kube-apiserver uses when talking to kubelets (logs, exec, node metrics, etc.). Audit logs often show it when the apiserver checks Node access before proxying. It is not your human kubectl user.

  • If kubectl get nodes and normal workloads work, treat log noise as informational unless something user-facing breaks (kubectl logs, kubectl exec, metrics-server node metrics, HorizontalPodAutoscaler).

  • If logs/exec/metrics fail cluster-wide, check default RBAC still exists (nothing should delete system:* ClusterRoles):

    kubectl get clusterrole system:kubelet-api-admin system:node-proxier 2>&1
    
  • If you customized authorization-config / RBAC on the API server, revert or align with kubelet authentication/authorization expectations.

Kubeconfig from running nodes

The repo root kubeconfig may be incomplete until you merge credentials; prefer generating talos/kubeconfig with the commands in §3 above.