Files
home-server/talos
..

Talos deployment (4 nodes)

This directory contains a talhelper cluster definition for a 4-node Talos cluster:

  • 3 hybrid control-plane/worker nodes: noble-cp-1..3
  • 1 worker-only node: noble-worker-1
  • allowSchedulingOnControlPlanes: true
  • CNI: none (for Cilium via GitOps)

1) Update values for your environment

Edit talconfig.yaml:

  • endpoint (Kubernetes API VIP or LB IP)
  • additionalApiServerCertSans / additionalMachineCertSans: must include the same VIP (and DNS name, if you use one) that clients and talosctl use — otherwise TLS to https://<VIP>:6443 fails because the cert only lists node IPs by default. This repo sets 192.168.50.230 (and kube.noble.lab.pcenicni.dev) to match kube-vip.
  • each node ipAddress
  • each node installDisk (for example /dev/sda, /dev/nvme0n1)
  • talosVersion / kubernetesVersion if desired

After changing SANs, run talhelper genconfig, re-apply-config to all control-plane nodes (certs are regenerated), then refresh talosctl kubeconfig.

2) Generate cluster secrets and machine configs

From this directory:

talhelper gensecret > talsecret.sops.yaml
talhelper genconfig

Generated machine configs are written to clusterconfig/.

3) Apply Talos configs

Apply each node file to the matching node IP from talconfig.yaml:

talosctl apply-config --insecure -n 192.168.50.20 -f clusterconfig/noble-noble-cp-1.yaml
talosctl apply-config --insecure -n 192.168.50.30 -f clusterconfig/noble-noble-cp-2.yaml
talosctl apply-config --insecure -n 192.168.50.40 -f clusterconfig/noble-noble-cp-3.yaml
talosctl apply-config --insecure -n 192.168.50.10 -f clusterconfig/noble-noble-worker-1.yaml

4) Bootstrap the cluster

After all nodes are up (bootstrap once, from any control-plane node):

talosctl bootstrap -n 192.168.50.20 -e 192.168.50.230
talosctl kubeconfig -n 192.168.50.20 -e 192.168.50.230 .

5) Validate

talosctl -n 192.168.50.20 -e 192.168.50.230 health
kubectl get nodes -o wide

kubectl errors: lookup https: no such host or https://https/...

That means the active kubeconfig has a broken cluster.server URL (often a double https:// or duplicate :6443). Kubernetes then tries to resolve the hostname https, which fails.

Inspect what you are using:

kubectl config view --minify -o jsonpath='{.clusters[0].cluster.server}{"\n"}'

It must be a single valid URL, for example:

  • https://192.168.50.230:6443 (API VIP from talconfig.yaml), or
  • https://kube.noble.lab.pcenicni.dev:6443 (if DNS points at that VIP)

Fix the cluster entry (replace noble with your contexts cluster name if different):

kubectl config set-cluster noble --server=https://192.168.50.230:6443

Or point kubectl at this repos kubeconfig (known-good server line):

export KUBECONFIG="$(pwd)/kubeconfig"
kubectl cluster-info

Avoid pasting https:// twice when running kubectl config set-cluster ... --server=....

kubectl apply fails: localhost:8080 / openapi connection refused

kubectl is not using a real cluster config; it falls back to the default http://localhost:8080 (no KUBECONFIG, empty file, or wrong file).

Fix:

cd talos
export KUBECONFIG="$(pwd)/kubeconfig"
kubectl config current-context
kubectl cluster-info

Then run kubectl apply from the repository root (parent of talos/) in the same shell. Do not use a literal cd /path/to/... — that was only a placeholder. Example (adjust to where you cloned this repo):

export KUBECONFIG="${HOME}/Developer/home-server/talos/kubeconfig"

kubectl config set-cluster noble ... only updates the file kubectl is actually reading (often ~/.kube/config). It does nothing if KUBECONFIG points at another path.

6) GitOps-pinned Cilium values

The Cilium settings that worked for this Talos cluster are now persisted in:

  • clusters/noble/apps/cilium/helm-values.yaml
  • clusters/noble/apps/cilium/application.yaml (Helm chart + valueFiles from this repo)

That Argo CD Application pins chart 1.16.6 and uses the same values file for API host/port, cgroup settings, IPAM CIDR, and security capabilities.

Cilium before Argo CD (cni: none)

This cluster uses cniConfig.name: none in talconfig.yaml so Talos does not install a CNI. Argo CD pods cannot schedule until some CNI makes nodes Ready (otherwise the node.kubernetes.io/not-ready taint blocks scheduling).

Install Cilium once with Helm from your workstation (same chart and values Argo will manage later), then bootstrap Argo CD:

helm repo add cilium https://helm.cilium.io/
helm repo update
helm upgrade --install cilium cilium/cilium \
  --namespace kube-system \
  --version 1.16.6 \
  -f clusters/noble/apps/cilium/helm-values.yaml \
  --wait --timeout 10m
kubectl get nodes
kubectl wait --for=condition=Ready nodes --all --timeout=300s

If helm --install seems stuck after “Installing it now”, it is usually still pulling images (quay.io/cilium/...) or waiting for pods to become Ready. In another terminal run kubectl get pods -n kube-system -w and check for ImagePullBackOff, Pending, or CrashLoopBackOff. To avoid blocking on Helms wait logic, install without --wait, confirm Cilium pods, then continue:

helm upgrade --install cilium cilium/cilium \
  --namespace kube-system \
  --version 1.16.6 \
  -f clusters/noble/apps/cilium/helm-values.yaml
kubectl get pods -n kube-system -l app.kubernetes.io/part-of=cilium -w

helm-values.yaml sets operator.replicas: 1 so the chart default (two operators with hard anti-affinity) cannot deadlock helm --wait when only one node can take the operator early in bootstrap.

If helm upgrade fails with a server-side apply conflict on kube-system/hubble-server-certs and argocd-controller, Argo already synced Cilium and owns that Secrets TLS fields. The cilium Application uses ignoreDifferences on that Secret plus RespectIgnoreDifferences so GitOps and occasional CLI Helm runs do not fight over .data. Until that manifest is applied in the cluster, either suspend the cilium Application in Argo, or delete the Secret once (kubectl delete secret hubble-server-certs -n kube-system) and re-run helm upgrade --install before Argo reconciles again. After bootstrap, prefer kubectl -n argocd get application cilium -o yaml / Argo UI to sync Cilium instead of ad hoc Helm, unless you suspend the app first.

If nodes were already Ready, you can skip straight to section 7.

7) Argo CD app-of-apps bootstrap

This repo includes an app-of-apps structure for cluster apps:

  • Root app: clusters/noble/root-application.yaml
  • Child apps index: clusters/noble/apps/kustomization.yaml
  • Argo CD app: clusters/noble/apps/argocd/application.yaml
  • Cilium app: clusters/noble/apps/cilium/application.yaml

Bootstrap once from your workstation:

kubectl apply -k clusters/noble/bootstrap/argocd
kubectl wait --for=condition=Established crd/appprojects.argoproj.io --timeout=120s
kubectl apply -f clusters/noble/bootstrap/argocd/default-appproject.yaml
kubectl apply -f clusters/noble/root-application.yaml

If the first command errors on AppProject (“no matches for kind AppProject”), the CRDs were not ready yet; run the kubectl wait and kubectl apply -f .../default-appproject.yaml lines, then continue.

After this, Argo CD continuously reconciles all applications under clusters/noble/apps/.

8) kube-vip API VIP (192.168.50.230)

HAProxy has been removed in favor of kube-vip running on control-plane nodes.

Manifests are in:

  • clusters/noble/apps/kube-vip/application.yaml
  • clusters/noble/apps/kube-vip/vip-rbac.yaml
  • clusters/noble/apps/kube-vip/vip-daemonset.yaml

The DaemonSet advertises 192.168.50.230 in ARP mode and fronts the Kubernetes API on port 6443.

Apply manually (or let Argo CD sync from root app):

kubectl apply -k clusters/noble/apps/kube-vip

Validate:

kubectl -n kube-system get pods -l app.kubernetes.io/name=kube-vip-ds -o wide
nc -vz 192.168.50.230 6443

9) Argo CD via DNS host (no port)

Argo CD is exposed through a kube-vip managed LoadBalancer Service:

  • argo.noble.lab.pcenicni.dev

Manifests:

  • clusters/noble/bootstrap/argocd/argocd-server-lb.yaml
  • clusters/noble/apps/kube-vip/vip-daemonset.yaml (svc_enable: "true")

After syncing manifests, create a Pi-hole DNS A record:

  • argo.noble.lab.pcenicni.dev -> 192.168.50.231

10) Longhorn storage and extra disks

Longhorn is deployed from:

  • clusters/noble/apps/longhorn/application.yaml

Monitoring apps are configured to use storageClassName: longhorn, so you can persist Prometheus/Alertmanager/Loki data once Longhorn is healthy.

Argo CD: longhorn OutOfSync, Health Missing, no longhorn-role

Missing means nothing has been applied yet, or a sync never completed. The Helm chart creates ClusterRole/longhorn-role on a successful install.

  1. See the failure reason:
kubectl describe application longhorn -n argocd

Check Status → Conditions and Status → Operation State for the error (for example Helm render error, CRD apply failure, or repo-server cannot reach https://charts.longhorn.io).

  1. Trigger a sync (Argo CD UI Sync, or CLI):
argocd app sync longhorn
  1. After a good sync, confirm:
kubectl get clusterrole longhorn-role
kubectl get pods -n longhorn-system

Extra drive layout (this cluster)

Each node uses:

  • /dev/sda — Talos install disk (installDisk in talconfig.yaml)
  • /dev/sdb — dedicated Longhorn data disk

talconfig.yaml includes a global patch that partitions /dev/sdb and mounts it at /var/mnt/longhorn, which matches Longhorn defaultDataPath in the Argo Helm values.

After editing talconfig.yaml, regenerate and apply configs:

cd talos
talhelper genconfig
# apply each nodes YAML from clusterconfig/ with talosctl apply-config

Then reboot each node once so the new disk layout is applied.

talosctl TLS errors (unknown authority, Ed25519 verification failure)

talosctl does not automatically use talos/clusterconfig/talosconfig. If you omit it, the client falls back to ~/.talos/config, which is usually a different cluster CA — you then get TLS handshake failures against the noble nodes.

Always set this in the shell where you run talosctl (use an absolute path if you change directories):

cd talos
export TALOSCONFIG="$(pwd)/clusterconfig/talosconfig"
export ENDPOINT=192.168.50.230

Sanity check (should print Talos and Kubernetes versions, not TLS errors):

talosctl -e "${ENDPOINT}" -n 192.168.50.20 version

Then use the same shell for apply-config, reboot, and health.

If it still fails after TALOSCONFIG is set, the running cluster was likely bootstrapped with different secrets than the ones in your current talsecret.sops.yaml / regenerated clusterconfig/. In that case you need the original talosconfig that matched the cluster when it was created, or you must align secrets and cluster state (recovery / rebuild is a larger topic).

Keep talosctl roughly aligned with the node Talos version (for example v1.12.x clients for v1.12.5 nodes).

Paste tip: run one command per line. Pasting ...cp-3.yaml and talosctl on the same line breaks the filename and can confuse the shell.

More than one extra disk per node

If you add a third disk later, extend machine.disks in talconfig.yaml (for example /dev/sdc/var/mnt/longhorn-disk2) and register that path in Longhorn as an additional disk for that node.

Recommended:

  • use one dedicated filesystem per Longhorn disk path
  • avoid using the Talos system disk for heavy Longhorn data
  • spread replicas across nodes for resiliency

11) Upgrade Talos to v1.12.x

This repo now pins:

  • talosVersion: v1.12.5 in talconfig.yaml

Regenerate configs

From talos/:

talhelper genconfig

Rolling upgrade order

Upgrade one node at a time, waiting for it to return healthy before moving on.

  1. Control plane nodes (noble-cp-1, then noble-cp-2, then noble-cp-3)
  2. Worker node (noble-worker-1)

Example commands (adjust node IP per step):

talosctl --talosconfig ./clusterconfig/talosconfig -n 192.168.50.20 upgrade --image ghcr.io/siderolabs/installer:v1.12.5
talosctl --talosconfig ./clusterconfig/talosconfig -n 192.168.50.20 reboot
talosctl --talosconfig ./clusterconfig/talosconfig -n 192.168.50.20 health

After all nodes are upgraded, verify:

talosctl --talosconfig ./clusterconfig/talosconfig version
kubectl get nodes -o wide

12) Destroy the cluster and rebuild from scratch

Use this when Kubernetes / etcd / Argo / Longhorn state is corrupted and you want a clean cluster. This wipes cluster state on the nodes (etcd, workloads, Longhorn data on cluster disks). Plan for downtime and backup anything you must keep off-cluster first.

12.1 Reset every Talos node (Kubernetes is destroyed)

From talos/ with a working talosconfig that matches the machines (same TALOSCONFIG / ENDPOINT guidance as elsewhere in this README):

cd talos
export TALOSCONFIG="$(pwd)/clusterconfig/talosconfig"
export ENDPOINT=192.168.50.230

Reset one node at a time, waiting for each to reboot before the next. Order: worker first, then non-bootstrap control planes, then the bootstrap control plane last (noble-cp-1192.168.50.20).

talosctl -e "${ENDPOINT}" -n 192.168.50.10 reset --graceful=false
talosctl -e "${ENDPOINT}" -n 192.168.50.30 reset --graceful=false
talosctl -e "${ENDPOINT}" -n 192.168.50.40 reset --graceful=false
talosctl -e "${ENDPOINT}" -n 192.168.50.20 reset --graceful=false

If the API VIP is already unreachable, target the node IP as endpoint for that node, for example: talosctl -e 192.168.50.10 -n 192.168.50.10 reset --graceful=false.

Your workstation kubeconfig will not work for the old cluster after this; that is expected until you bootstrap again.

12.2 (Optional) New cluster secrets

For a fully fresh identity (new cluster CA and talosconfig):

cd talos
talhelper gensecret > talsecret.sops.yaml
# encrypt / store talsecret as you usually do, then:
talhelper genconfig

If you keep the existing talsecret.sops.yaml, still run talhelper genconfig so clusterconfig/ matches what you will apply.

12.3 Apply configs, bootstrap, kubeconfig

Repeat §3 Apply Talos configs and §4 Bootstrap the cluster (and §5 Validate) from the top of this README: apply-config each node, then talosctl bootstrap, then talosctl kubeconfig into talos/kubeconfig.

12.4 Redeploy GitOps (Argo CD + apps)

From your workstation (repo root), with KUBECONFIG pointing at the new talos/kubeconfig:

# Set REPO to the directory that contains both talos/ and clusters/ (not a literal "path/to")
REPO="${HOME}/Developer/home-server"
export KUBECONFIG="${REPO}/talos/kubeconfig"
cd "${REPO}"
kubectl apply -k clusters/noble/bootstrap/argocd
kubectl apply -f clusters/noble/root-application.yaml

Resolve Argo CD admin login (secret / password reset) as needed; then let noble-root sync clusters/noble/apps/.

13) Mid-rebuild issues: etcd, bootstrap, and apply-config

tls: certificate required when using apply-config --insecure

After a node has joined the cluster, the Talos API expects client certificates from your talosconfig. --insecure only applies to maintenance (before join / after a reset).

Do one of:

  • Apply config with talosconfig (no --insecure):
cd talos
export TALOSCONFIG="$(pwd)/clusterconfig/talosconfig"
export ENDPOINT=192.168.50.230
talosctl -e "${ENDPOINT}" apply-config -n 192.168.50.30 -f clusterconfig/noble-noble-cp-2.yaml
  • Or talosctl reset that node first (see §12.1), then use apply-config --insecure again while it is in maintenance.

bootstrap: etcd data directory is not empty

The bootstrap node (192.168.50.20) already has a previous etcd on disk (failed or partial bootstrap). Kubernetes will not bootstrap again until that state is wiped.

Fix: run talosctl reset --graceful=false on the control plane nodes (at minimum the bootstrap node; often all four nodes is cleaner). See §12.1. Then re-apply machine configs and run talosctl bootstrap exactly once.

etcd unhealthy / “Preparing” on some control planes

Usually means split or partial cluster state. The reliable fix is the same full reset (§12.1), then a single ordered bring-up: apply all configs → bootstrap once → talosctl health.