endpoint (Kubernetes API VIP or LB IP)
additionalApiServerCertSans / additionalMachineCertSans: must include the same VIP (and DNS name, if you use one) that clients and talosctl use — otherwise TLS to https://<VIP>:6443 fails because the cert only lists node IPs by default. This repo sets 192.168.50.230 (and kube.noble.lab.pcenicni.dev) to match kube-vip.
each node ipAddress
each node installDisk (for example /dev/sda, /dev/nvme0n1)
talosVersion / kubernetesVersion if desired

After changing SANs, run talhelper genconfig, re-apply-config to all control-plane nodes (certs are regenerated), then refresh talosctl kubeconfig.

2) Generate cluster secrets and machine configs

From this directory:

talhelper gensecret > talsecret.sops.yaml
talhelper genconfig

Generated machine configs are written to clusterconfig/.

3) Apply Talos configs

Apply each node file to the matching node IP from talconfig.yaml:

talosctl apply-config --insecure -n 192.168.50.20 -f clusterconfig/noble-noble-cp-1.yaml
talosctl apply-config --insecure -n 192.168.50.30 -f clusterconfig/noble-noble-cp-2.yaml
talosctl apply-config --insecure -n 192.168.50.40 -f clusterconfig/noble-noble-cp-3.yaml
talosctl apply-config --insecure -n 192.168.50.10 -f clusterconfig/noble-noble-worker-1.yaml

4) Bootstrap the cluster

After all nodes are up (bootstrap once, from any control-plane node):

talosctl bootstrap -n 192.168.50.20 -e 192.168.50.230
talosctl kubeconfig -n 192.168.50.20 -e 192.168.50.230 .

5) Validate

talosctl -n 192.168.50.20 -e 192.168.50.230 health
kubectl get nodes -o wide

`kubectl` errors: `lookup https: no such host` or `https://https/...`

That means the active kubeconfig has a broken cluster.server URL (often a double https:// or duplicate :6443). Kubernetes then tries to resolve the hostname https, which fails.

Inspect what you are using:

kubectl config view --minify -o jsonpath='{.clusters[0].cluster.server}{"\n"}'

It must be a single valid URL, for example:

https://192.168.50.230:6443 (API VIP from talconfig.yaml), or
https://kube.noble.lab.pcenicni.dev:6443 (if DNS points at that VIP)

Fix the cluster entry (replace noble with your context’s cluster name if different):

kubectl config set-cluster noble --server=https://192.168.50.230:6443

Or point kubectl at this repo’s kubeconfig (known-good server line):

export KUBECONFIG="$(pwd)/kubeconfig"
kubectl cluster-info

Avoid pasting https:// twice when running kubectl config set-cluster ... --server=....

`kubectl apply` fails: `localhost:8080` / `openapi` connection refused

kubectl is not using a real cluster config; it falls back to the default http://localhost:8080 (no KUBECONFIG, empty file, or wrong file).

Fix:

cd talos
export KUBECONFIG="$(pwd)/kubeconfig"
kubectl config current-context
kubectl cluster-info

Then run kubectl apply from the repository root (parent of talos/) in the same shell. Do not use a literal cd /path/to/... — that was only a placeholder. Example (adjust to where you cloned this repo):

export KUBECONFIG="${HOME}/Developer/home-server/talos/kubeconfig"

kubectl config set-cluster noble ... only updates the file kubectl is actually reading (often ~/.kube/config). It does nothing if KUBECONFIG points at another path.

6) GitOps-pinned Cilium values

The Cilium settings that worked for this Talos cluster are now persisted in:

clusters/noble/apps/cilium/helm-values.yaml
clusters/noble/apps/cilium/application.yaml (Helm chart + valueFiles from this repo)

That Argo CD Application pins chart 1.16.6 and uses the same values file for API host/port, cgroup settings, IPAM CIDR, and security capabilities.

Cilium before Argo CD (`cni: none`)

This cluster uses cniConfig.name: none in talconfig.yaml so Talos does not install a CNI. Argo CD pods cannot schedule until some CNI makes nodes Ready (otherwise the node.kubernetes.io/not-ready taint blocks scheduling).

Install Cilium once with Helm from your workstation (same chart and values Argo will manage later), then bootstrap Argo CD:

helm repo add cilium https://helm.cilium.io/
helm repo update
helm upgrade --install cilium cilium/cilium \
  --namespace kube-system \
  --version 1.16.6 \
  -f clusters/noble/apps/cilium/helm-values.yaml \
  --wait --timeout 10m
kubectl get nodes
kubectl wait --for=condition=Ready nodes --all --timeout=300s

If helm --install seems stuck after “Installing it now”, it is usually still pulling images (quay.io/cilium/...) or waiting for pods to become Ready. In another terminal run kubectl get pods -n kube-system -w and check for ImagePullBackOff, Pending, or CrashLoopBackOff. To avoid blocking on Helm’s wait logic, install without --wait, confirm Cilium pods, then continue:

helm upgrade --install cilium cilium/cilium \
  --namespace kube-system \
  --version 1.16.6 \
  -f clusters/noble/apps/cilium/helm-values.yaml
kubectl get pods -n kube-system -l app.kubernetes.io/part-of=cilium -w

helm-values.yaml sets operator.replicas: 1 so the chart default (two operators with hard anti-affinity) cannot deadlock helm --wait when only one node can take the operator early in bootstrap.

If helm upgrade fails with server-side apply conflicts and argocd-controller, Argo already synced Cilium and owns those fields on live objects. Clearing syncPolicy on the Application does not remove that ownership; Helm still conflicts until you take over the fields or only use Argo.

One-shot CLI fix (Helm 3.13+): add --force-conflicts so SSA wins the disputed fields:

helm upgrade --install cilium cilium/cilium \
  --namespace kube-system \
  --version 1.16.6 \
  -f clusters/noble/apps/cilium/helm-values.yaml \
  --force-conflicts

Typical conflicts: Secret hubble-server-certs (.data TLS) and Deployment cilium-operator (.spec.replicas, .spec/strategy/rollingUpdate/maxUnavailable). The cilium Application lists ignoreDifferences for those paths plus RespectIgnoreDifferences so later Argo syncs do not keep overwriting them. Apply the manifest after you change it: kubectl apply -f clusters/noble/apps/cilium/application.yaml.

After bootstrap, prefer syncing Cilium only through Argo (from Git) instead of ad hoc Helm, unless you suspend the cilium Application first.

Shell tip: a line like # comment must start with #; if the shell reports command not found: #, the character is not a real hash or the line was pasted wrong—run kubectl apply ... as its own command without a leading comment on the same paste block.

If nodes were already Ready, you can skip straight to section 7.

7) Argo CD app-of-apps bootstrap

This repo includes an app-of-apps structure for cluster apps:

Root app: clusters/noble/root-application.yaml
Child apps index: clusters/noble/apps/kustomization.yaml
Argo CD app: clusters/noble/apps/argocd/application.yaml
Cilium app: clusters/noble/apps/cilium/application.yaml

Bootstrap once from your workstation:

kubectl apply -k clusters/noble/bootstrap/argocd
kubectl wait --for=condition=Established crd/appprojects.argoproj.io --timeout=120s
kubectl apply -f clusters/noble/bootstrap/argocd/default-appproject.yaml
kubectl apply -f clusters/noble/root-application.yaml

If the first command errors on AppProject (“no matches for kind AppProject”), the CRDs were not ready yet; run the kubectl wait and kubectl apply -f .../default-appproject.yaml lines, then continue.

After this, Argo CD continuously reconciles all applications under clusters/noble/apps/.

8) kube-vip API VIP (`192.168.50.230`)

HAProxy has been removed in favor of kube-vip running on control-plane nodes.

Manifests are in:

clusters/noble/apps/kube-vip/application.yaml
clusters/noble/apps/kube-vip/vip-rbac.yaml
clusters/noble/apps/kube-vip/vip-daemonset.yaml

The DaemonSet advertises 192.168.50.230 in ARP mode and fronts the Kubernetes API on port 6443.

Apply manually (or let Argo CD sync from root app):

kubectl apply -k clusters/noble/apps/kube-vip

Validate:

kubectl -n kube-system get pods -l app.kubernetes.io/name=kube-vip-ds -o wide
nc -vz 192.168.50.230 6443

If kube-vip-ds pods are CrashLoopBackOff, logs usually show could not get link for interface '…'. kube-vip binds the VIP to vip_interface; on Talos the uplink is often eno1, enp0s…, or enx…, not eth0. On a control-plane node IP from talconfig.yaml:

talosctl -n 192.168.50.20 get links

Do not paste that command’s table output back into the shell: zsh runs each line as a command (e.g. 192.168.50.20 → command not found), and a line starting with NODE can be mistaken for the node binary and try to load a file like NAMESPACE in the current directory. Also avoid pasting the prompt ((base) … %) together with the command (duplicate prompt → parse errors).

Set vip_interface in clusters/noble/apps/kube-vip/vip-daemonset.yaml to that link’s metadata.id, commit, sync (or kubectl apply -k clusters/noble/apps/kube-vip), and confirm pods go Running.

9) Argo CD via DNS host (no port)

Argo CD is exposed through a kube-vip managed LoadBalancer Service:

argo.noble.lab.pcenicni.dev

Manifests:

clusters/noble/bootstrap/argocd/argocd-server-lb.yaml
clusters/noble/apps/kube-vip/vip-daemonset.yaml (svc_enable: "true")

After syncing manifests, create a Pi-hole DNS A record:

argo.noble.lab.pcenicni.dev -> 192.168.50.231

10) Longhorn storage and extra disks

Longhorn is deployed from:

clusters/noble/apps/longhorn/application.yaml

Monitoring apps are configured to use storageClassName: longhorn, so you can persist Prometheus/Alertmanager/Loki data once Longhorn is healthy.

Argo CD: `longhorn` OutOfSync, Health Missing, no `longhorn-role`

Missing means nothing has been applied yet, or a sync never completed. The Helm chart creates ClusterRole/longhorn-role on a successful install.

See the failure reason:

kubectl describe application longhorn -n argocd

Check Status → Conditions and Status → Operation State for the error (for example Helm render error, CRD apply failure, or repo-server cannot reach https://charts.longhorn.io).

Trigger a sync (Argo CD UI Sync, or CLI):

argocd app sync longhorn

After a good sync, confirm:

kubectl get clusterrole longhorn-role
kubectl get pods -n longhorn-system

Extra drive layout (this cluster)

Each node uses:

/dev/sda — Talos install disk (installDisk in talconfig.yaml)
/dev/sdb — dedicated Longhorn data disk

talconfig.yaml includes a global patch that partitions /dev/sdb and mounts it at /var/mnt/longhorn, which matches Longhorn defaultDataPath in the Argo Helm values.

After editing talconfig.yaml, regenerate and apply configs:

cd talos
talhelper genconfig
# apply each node’s YAML from clusterconfig/ with talosctl apply-config

Then reboot each node once so the new disk layout is applied.

`talosctl` TLS errors (`unknown authority`, `Ed25519 verification failure`)

talosctl does not automatically use talos/clusterconfig/talosconfig. If you omit it, the client falls back to ~/.talos/config, which is usually a different cluster CA — you then get TLS handshake failures against the noble nodes.

Always set this in the shell where you run talosctl (use an absolute path if you change directories):

cd talos
export TALOSCONFIG="$(pwd)/clusterconfig/talosconfig"
export ENDPOINT=192.168.50.230

Sanity check (should print Talos and Kubernetes versions, not TLS errors):

talosctl -e "${ENDPOINT}" -n 192.168.50.20 version

Then use the same shell for apply-config, reboot, and health.

If it still fails after TALOSCONFIG is set, the running cluster was likely bootstrapped with different secrets than the ones in your current talsecret.sops.yaml / regenerated clusterconfig/. In that case you need the original talosconfig that matched the cluster when it was created, or you must align secrets and cluster state (recovery / rebuild is a larger topic).

Keep talosctl roughly aligned with the node Talos version (for example v1.12.x clients for v1.12.5 nodes).

Paste tip: run one command per line. Pasting ...cp-3.yaml and talosctl on the same line breaks the filename and can confuse the shell.

More than one extra disk per node

If you add a third disk later, extend machine.disks in talconfig.yaml (for example /dev/sdc → /var/mnt/longhorn-disk2) and register that path in Longhorn as an additional disk for that node.

Recommended:

use one dedicated filesystem per Longhorn disk path
avoid using the Talos system disk for heavy Longhorn data
spread replicas across nodes for resiliency

11) Upgrade Talos to `v1.12.x`

This repo now pins:

talosVersion: v1.12.5 in talconfig.yaml

Regenerate configs

From talos/:

talhelper genconfig

Rolling upgrade order

Upgrade one node at a time, waiting for it to return healthy before moving on.

Control plane nodes (noble-cp-1, then noble-cp-2, then noble-cp-3)
Worker node (noble-worker-1)

Example commands (adjust node IP per step):

talosctl --talosconfig ./clusterconfig/talosconfig -n 192.168.50.20 upgrade --image ghcr.io/siderolabs/installer:v1.12.5
talosctl --talosconfig ./clusterconfig/talosconfig -n 192.168.50.20 reboot
talosctl --talosconfig ./clusterconfig/talosconfig -n 192.168.50.20 health

After all nodes are upgraded, verify:

talosctl --talosconfig ./clusterconfig/talosconfig version
kubectl get nodes -o wide

12) Destroy the cluster and rebuild from scratch

Use this when Kubernetes / etcd / Argo / Longhorn state is corrupted and you want a clean cluster. This wipes cluster state on the nodes (etcd, workloads, Longhorn data on cluster disks). Plan for downtime and backup anything you must keep off-cluster first.

12.1 Reset every Talos node (Kubernetes is destroyed)

From talos/ with a working talosconfig that matches the machines (same TALOSCONFIG / ENDPOINT guidance as elsewhere in this README):

cd talos
export TALOSCONFIG="$(pwd)/clusterconfig/talosconfig"
export ENDPOINT=192.168.50.230

Reset one node at a time, waiting for each to reboot before the next. Order: worker first, then non-bootstrap control planes, then the bootstrap control plane last (noble-cp-1 → 192.168.50.20).

talosctl -e "${ENDPOINT}" -n 192.168.50.10 reset --graceful=false
talosctl -e "${ENDPOINT}" -n 192.168.50.30 reset --graceful=false
talosctl -e "${ENDPOINT}" -n 192.168.50.40 reset --graceful=false
talosctl -e "${ENDPOINT}" -n 192.168.50.20 reset --graceful=false

If the API VIP is already unreachable, target the node IP as endpoint for that node, for example: talosctl -e 192.168.50.10 -n 192.168.50.10 reset --graceful=false.

Your workstation kubeconfig will not work for the old cluster after this; that is expected until you bootstrap again.

12.2 (Optional) New cluster secrets

For a fully fresh identity (new cluster CA and talosconfig):

cd talos
talhelper gensecret > talsecret.sops.yaml
# encrypt / store talsecret as you usually do, then:
talhelper genconfig

If you keep the existing talsecret.sops.yaml, still run talhelper genconfig so clusterconfig/ matches what you will apply.

12.3 Apply configs, bootstrap, kubeconfig

Repeat §3 Apply Talos configs and §4 Bootstrap the cluster (and §5 Validate) from the top of this README: apply-config each node, then talosctl bootstrap, then talosctl kubeconfig into talos/kubeconfig.

12.4 Redeploy GitOps (Argo CD + apps)

From your workstation (repo root), with KUBECONFIG pointing at the new talos/kubeconfig:

# Set REPO to the directory that contains both talos/ and clusters/ (not a literal "path/to")
REPO="${HOME}/Developer/home-server"
export KUBECONFIG="${REPO}/talos/kubeconfig"
cd "${REPO}"
kubectl apply -k clusters/noble/bootstrap/argocd
kubectl apply -f clusters/noble/root-application.yaml

Resolve Argo CD admin login (secret / password reset) as needed; then let noble-root sync clusters/noble/apps/.

13) Mid-rebuild issues: etcd, bootstrap, and `apply-config`

`tls: certificate required` when using `apply-config --insecure`

After a node has joined the cluster, the Talos API expects client certificates from your talosconfig. --insecure only applies to maintenance (before join / after a reset).

Do one of:

Apply config with talosconfig (no --insecure):

cd talos
export TALOSCONFIG="$(pwd)/clusterconfig/talosconfig"
export ENDPOINT=192.168.50.230
talosctl -e "${ENDPOINT}" apply-config -n 192.168.50.30 -f clusterconfig/noble-noble-cp-2.yaml

Or talosctl reset that node first (see §12.1), then use apply-config --insecure again while it is in maintenance.

`bootstrap`: `etcd data directory is not empty`

The bootstrap node (192.168.50.20) already has a previous etcd on disk (failed or partial bootstrap). Kubernetes will not bootstrap again until that state is wiped.

Fix: run talosctl reset --graceful=false on the control plane nodes (at minimum the bootstrap node; often all four nodes is cleaner). See §12.1. Then re-apply machine configs and run talosctl bootstrap exactly once.

etcd unhealthy / “Preparing” on some control planes

Usually means split or partial cluster state. The reliable fix is the same full reset (§12.1), then a single ordered bring-up: apply all configs → bootstrap once → talosctl health.

README.md Unescape Escape

Talos deployment (4 nodes)

1) Update values for your environment