Talos deployment (4 nodes)
This directory contains a talhelper cluster definition for a 4-node Talos
cluster:
- 3 hybrid control-plane/worker nodes:
noble-cp-1..3 - 1 worker-only node:
noble-worker-1 allowSchedulingOnControlPlanes: true- CNI:
none(for Cilium via GitOps)
1) Update values for your environment
Edit talconfig.yaml:
endpoint(Kubernetes API VIP or LB IP)additionalApiServerCertSans/additionalMachineCertSans: must include the same VIP (and DNS name, if you use one) that clients andtalosctluse — otherwise TLS tohttps://<VIP>:6443fails because the cert only lists node IPs by default. This repo sets192.168.50.230(andkube.noble.lab.pcenicni.dev) to match kube-vip.- each node
ipAddress - each node
installDisk(for example/dev/sda,/dev/nvme0n1) talosVersion/kubernetesVersionif desired
After changing SANs, run talhelper genconfig, re-apply-config to all
control-plane nodes (certs are regenerated), then refresh talosctl kubeconfig.
2) Generate cluster secrets and machine configs
From this directory:
talhelper gensecret > talsecret.sops.yaml
talhelper genconfig
Generated machine configs are written to clusterconfig/.
3) Apply Talos configs
Apply each node file to the matching node IP from talconfig.yaml:
talosctl apply-config --insecure -n 192.168.50.20 -f clusterconfig/noble-noble-cp-1.yaml
talosctl apply-config --insecure -n 192.168.50.30 -f clusterconfig/noble-noble-cp-2.yaml
talosctl apply-config --insecure -n 192.168.50.40 -f clusterconfig/noble-noble-cp-3.yaml
talosctl apply-config --insecure -n 192.168.50.10 -f clusterconfig/noble-noble-worker-1.yaml
4) Bootstrap the cluster
After all nodes are up (bootstrap once, from any control-plane node):
talosctl bootstrap -n 192.168.50.20 -e 192.168.50.230
talosctl kubeconfig -n 192.168.50.20 -e 192.168.50.230 .
5) Validate
talosctl -n 192.168.50.20 -e 192.168.50.230 health
kubectl get nodes -o wide
kubectl errors: lookup https: no such host or https://https/...
That means the active kubeconfig has a broken cluster.server URL (often a
double https:// or duplicate :6443). Kubernetes then tries to resolve
the hostname https, which fails.
Inspect what you are using:
kubectl config view --minify -o jsonpath='{.clusters[0].cluster.server}{"\n"}'
It must be a single valid URL, for example:
https://192.168.50.230:6443(API VIP fromtalconfig.yaml), orhttps://kube.noble.lab.pcenicni.dev:6443(if DNS points at that VIP)
Fix the cluster entry (replace noble with your context’s cluster name if
different):
kubectl config set-cluster noble --server=https://192.168.50.230:6443
Or point kubectl at this repo’s kubeconfig (known-good server line):
export KUBECONFIG="$(pwd)/kubeconfig"
kubectl cluster-info
Avoid pasting https:// twice when running kubectl config set-cluster ... --server=....
kubectl apply fails: localhost:8080 / openapi connection refused
kubectl is not using a real cluster config; it falls back to the default
http://localhost:8080 (no KUBECONFIG, empty file, or wrong file).
Fix:
cd talos
export KUBECONFIG="$(pwd)/kubeconfig"
kubectl config current-context
kubectl cluster-info
Then run kubectl apply from the repository root (parent of talos/) in
the same shell. Do not use a literal cd /path/to/... — that was only a
placeholder. Example (adjust to where you cloned this repo):
export KUBECONFIG="${HOME}/Developer/home-server/talos/kubeconfig"
kubectl config set-cluster noble ... only updates the file kubectl is
actually reading (often ~/.kube/config). It does nothing if KUBECONFIG
points at another path.
6) GitOps-pinned Cilium values
The Cilium settings that worked for this Talos cluster are now persisted in:
clusters/noble/apps/cilium/helm-values.yamlclusters/noble/apps/cilium/application.yaml(Helm chart +valueFilesfrom this repo)
That Argo CD Application pins chart 1.16.6 and uses the same values file
for API host/port, cgroup settings, IPAM CIDR, and security capabilities.
Cilium before Argo CD (cni: none)
This cluster uses cniConfig.name: none in talconfig.yaml so Talos does
not install a CNI. Argo CD pods cannot schedule until some CNI makes nodes
Ready (otherwise the node.kubernetes.io/not-ready taint blocks scheduling).
Install Cilium once with Helm from your workstation (same chart and values Argo will manage later), then bootstrap Argo CD:
helm repo add cilium https://helm.cilium.io/
helm repo update
helm upgrade --install cilium cilium/cilium \
--namespace kube-system \
--version 1.16.6 \
-f clusters/noble/apps/cilium/helm-values.yaml \
--wait --timeout 10m
kubectl get nodes
kubectl wait --for=condition=Ready nodes --all --timeout=300s
If helm --install seems stuck after “Installing it now”, it is usually still
pulling images (quay.io/cilium/...) or waiting for pods to become Ready. In
another terminal run kubectl get pods -n kube-system -w and check for
ImagePullBackOff, Pending, or CrashLoopBackOff. To avoid blocking on
Helm’s wait logic, install without --wait, confirm Cilium pods, then continue:
helm upgrade --install cilium cilium/cilium \
--namespace kube-system \
--version 1.16.6 \
-f clusters/noble/apps/cilium/helm-values.yaml
kubectl get pods -n kube-system -l app.kubernetes.io/part-of=cilium -w
helm-values.yaml sets operator.replicas: 1 so the chart default (two
operators with hard anti-affinity) cannot deadlock helm --wait when only one
node can take the operator early in bootstrap.
If helm upgrade fails with server-side apply conflicts and
argocd-controller, Argo already synced Cilium and owns those fields
on live objects. Clearing syncPolicy on the Application does not
remove that ownership; Helm still conflicts until you take over the fields
or only use Argo.
One-shot CLI fix (Helm 3.13+): add --force-conflicts so SSA wins the
disputed fields:
helm upgrade --install cilium cilium/cilium \
--namespace kube-system \
--version 1.16.6 \
-f clusters/noble/apps/cilium/helm-values.yaml \
--force-conflicts
Typical conflicts: Secret hubble-server-certs (.data TLS) and
Deployment cilium-operator (.spec.replicas,
.spec/strategy/rollingUpdate/maxUnavailable). The cilium Application
lists ignoreDifferences for those paths plus RespectIgnoreDifferences
so later Argo syncs do not keep overwriting them. Apply the manifest after you
change it: kubectl apply -f clusters/noble/apps/cilium/application.yaml.
After bootstrap, prefer syncing Cilium only through Argo (from Git) instead
of ad hoc Helm, unless you suspend the cilium Application first.
Shell tip: a line like # comment must start with #; if the shell
reports command not found: #, the character is not a real hash or the
line was pasted wrong—run kubectl apply ... as its own command without a
leading comment on the same paste block.
If nodes were already Ready, you can skip straight to section 7.
7) Argo CD app-of-apps bootstrap
This repo includes an app-of-apps structure for cluster apps:
- Root app:
clusters/noble/root-application.yaml - Child apps index:
clusters/noble/apps/kustomization.yaml - Argo CD app:
clusters/noble/apps/argocd/application.yaml - Cilium app:
clusters/noble/apps/cilium/application.yaml
Bootstrap once from your workstation:
kubectl apply -k clusters/noble/bootstrap/argocd
kubectl wait --for=condition=Established crd/appprojects.argoproj.io --timeout=120s
kubectl apply -f clusters/noble/bootstrap/argocd/default-appproject.yaml
kubectl apply -f clusters/noble/root-application.yaml
If the first command errors on AppProject (“no matches for kind AppProject”), the CRDs were not ready yet; run the kubectl wait and kubectl apply -f .../default-appproject.yaml lines, then continue.
After this, Argo CD continuously reconciles all applications under
clusters/noble/apps/.
8) kube-vip API VIP (192.168.50.230)
HAProxy has been removed in favor of kube-vip running on control-plane nodes.
Manifests are in:
clusters/noble/apps/kube-vip/application.yamlclusters/noble/apps/kube-vip/vip-rbac.yamlclusters/noble/apps/kube-vip/vip-daemonset.yaml
The DaemonSet advertises 192.168.50.230 in ARP mode and fronts the Kubernetes
API on port 6443.
Apply manually (or let Argo CD sync from root app):
kubectl apply -k clusters/noble/apps/kube-vip
Validate:
kubectl -n kube-system get pods -l app.kubernetes.io/name=kube-vip-ds -o wide
nc -vz 192.168.50.230 6443
If kube-vip-ds pods are CrashLoopBackOff, logs usually show
could not get link for interface '…'. kube-vip binds the VIP to
vip_interface; on Talos the uplink is often eno1, enp0s…, or
enx…, not eth0. On a control-plane node IP from talconfig.yaml:
talosctl -n 192.168.50.20 get links
Do not paste that command’s table output back into the shell: zsh runs
each line as a command (e.g. 192.168.50.20 → command not found), and a line
starting with NODE can be mistaken for the node binary and try to
load a file like NAMESPACE in the current directory. Also avoid pasting
the prompt ((base) … %) together with the command (duplicate prompt →
parse errors).
Set vip_interface in clusters/noble/apps/kube-vip/vip-daemonset.yaml to
that link’s metadata.id, commit, sync (or kubectl apply -k clusters/noble/apps/kube-vip), and confirm pods go Running.
9) Argo CD via DNS host (no port)
Argo CD is exposed through a kube-vip managed LoadBalancer Service:
argo.noble.lab.pcenicni.dev
Manifests:
clusters/noble/bootstrap/argocd/argocd-server-lb.yamlclusters/noble/apps/kube-vip/vip-daemonset.yaml(svc_enable: "true")
After syncing manifests, create a Pi-hole DNS A record:
argo.noble.lab.pcenicni.dev->192.168.50.231
10) Longhorn storage and extra disks
Longhorn is deployed from:
clusters/noble/apps/longhorn/application.yaml
Monitoring apps are configured to use storageClassName: longhorn, so you can
persist Prometheus/Alertmanager/Loki data once Longhorn is healthy.
Argo CD: longhorn OutOfSync, Health Missing, no longhorn-role
Missing means nothing has been applied yet, or a sync never completed. The
Helm chart creates ClusterRole/longhorn-role on a successful install.
- See the failure reason:
kubectl describe application longhorn -n argocd
Check Status → Conditions and Status → Operation State for the error
(for example Helm render error, CRD apply failure, or repo-server cannot reach
https://charts.longhorn.io).
- Trigger a sync (Argo CD UI Sync, or CLI):
argocd app sync longhorn
- After a good sync, confirm:
kubectl get clusterrole longhorn-role
kubectl get pods -n longhorn-system
Extra drive layout (this cluster)
Each node uses:
/dev/sda— Talos install disk (installDiskintalconfig.yaml)/dev/sdb— dedicated Longhorn data disk
talconfig.yaml includes a global patch that partitions /dev/sdb and mounts it
at /var/mnt/longhorn, which matches Longhorn defaultDataPath in the Argo
Helm values.
After editing talconfig.yaml, regenerate and apply configs:
cd talos
talhelper genconfig
# apply each node’s YAML from clusterconfig/ with talosctl apply-config
Then reboot each node once so the new disk layout is applied.
talosctl TLS errors (unknown authority, Ed25519 verification failure)
talosctl does not automatically use talos/clusterconfig/talosconfig. If you
omit it, the client falls back to ~/.talos/config, which is usually a
different cluster CA — you then get TLS handshake failures against the noble
nodes.
Always set this in the shell where you run talosctl (use an absolute path
if you change directories):
cd talos
export TALOSCONFIG="$(pwd)/clusterconfig/talosconfig"
export ENDPOINT=192.168.50.230
Sanity check (should print Talos and Kubernetes versions, not TLS errors):
talosctl -e "${ENDPOINT}" -n 192.168.50.20 version
Then use the same shell for apply-config, reboot, and health.
If it still fails after TALOSCONFIG is set, the running cluster was likely
bootstrapped with different secrets than the ones in your current
talsecret.sops.yaml / regenerated clusterconfig/. In that case you need the
original talosconfig that matched the cluster when it was created, or you
must align secrets and cluster state (recovery / rebuild is a larger topic).
Keep talosctl roughly aligned with the node Talos version (for example
v1.12.x clients for v1.12.5 nodes).
Paste tip: run one command per line. Pasting ...cp-3.yaml and
talosctl on the same line breaks the filename and can confuse the shell.
More than one extra disk per node
If you add a third disk later, extend machine.disks in talconfig.yaml (for
example /dev/sdc → /var/mnt/longhorn-disk2) and register that path in
Longhorn as an additional disk for that node.
Recommended:
- use one dedicated filesystem per Longhorn disk path
- avoid using the Talos system disk for heavy Longhorn data
- spread replicas across nodes for resiliency
11) Upgrade Talos to v1.12.x
This repo now pins:
talosVersion: v1.12.5intalconfig.yaml
Regenerate configs
From talos/:
talhelper genconfig
Rolling upgrade order
Upgrade one node at a time, waiting for it to return healthy before moving on.
- Control plane nodes (
noble-cp-1, thennoble-cp-2, thennoble-cp-3) - Worker node (
noble-worker-1)
Example commands (adjust node IP per step):
talosctl --talosconfig ./clusterconfig/talosconfig -n 192.168.50.20 upgrade --image ghcr.io/siderolabs/installer:v1.12.5
talosctl --talosconfig ./clusterconfig/talosconfig -n 192.168.50.20 reboot
talosctl --talosconfig ./clusterconfig/talosconfig -n 192.168.50.20 health
After all nodes are upgraded, verify:
talosctl --talosconfig ./clusterconfig/talosconfig version
kubectl get nodes -o wide
12) Destroy the cluster and rebuild from scratch
Use this when Kubernetes / etcd / Argo / Longhorn state is corrupted and you want a clean cluster. This wipes cluster state on the nodes (etcd, workloads, Longhorn data on cluster disks). Plan for downtime and backup anything you must keep off-cluster first.
12.1 Reset every Talos node (Kubernetes is destroyed)
From talos/ with a working talosconfig that matches the machines (same
TALOSCONFIG / ENDPOINT guidance as elsewhere in this README):
cd talos
export TALOSCONFIG="$(pwd)/clusterconfig/talosconfig"
export ENDPOINT=192.168.50.230
Reset one node at a time, waiting for each to reboot before the next. Order:
worker first, then non-bootstrap control planes, then the bootstrap
control plane last (noble-cp-1 → 192.168.50.20).
talosctl -e "${ENDPOINT}" -n 192.168.50.10 reset --graceful=false
talosctl -e "${ENDPOINT}" -n 192.168.50.30 reset --graceful=false
talosctl -e "${ENDPOINT}" -n 192.168.50.40 reset --graceful=false
talosctl -e "${ENDPOINT}" -n 192.168.50.20 reset --graceful=false
If the API VIP is already unreachable, target the node IP as endpoint for that
node, for example:
talosctl -e 192.168.50.10 -n 192.168.50.10 reset --graceful=false.
Your workstation kubeconfig will not work for the old cluster after this;
that is expected until you bootstrap again.
12.2 (Optional) New cluster secrets
For a fully fresh identity (new cluster CA and talosconfig):
cd talos
talhelper gensecret > talsecret.sops.yaml
# encrypt / store talsecret as you usually do, then:
talhelper genconfig
If you keep the existing talsecret.sops.yaml, still run talhelper genconfig
so clusterconfig/ matches what you will apply.
12.3 Apply configs, bootstrap, kubeconfig
Repeat §3 Apply Talos configs and §4 Bootstrap the cluster (and §5
Validate) from the top of this README: apply-config each node, then
talosctl bootstrap, then talosctl kubeconfig into talos/kubeconfig.
12.4 Redeploy GitOps (Argo CD + apps)
From your workstation (repo root), with KUBECONFIG pointing at the new
talos/kubeconfig:
# Set REPO to the directory that contains both talos/ and clusters/ (not a literal "path/to")
REPO="${HOME}/Developer/home-server"
export KUBECONFIG="${REPO}/talos/kubeconfig"
cd "${REPO}"
kubectl apply -k clusters/noble/bootstrap/argocd
kubectl apply -f clusters/noble/root-application.yaml
Resolve Argo CD admin login (secret / password reset) as needed; then let
noble-root sync clusters/noble/apps/.
13) Mid-rebuild issues: etcd, bootstrap, and apply-config
tls: certificate required when using apply-config --insecure
After a node has joined the cluster, the Talos API expects client
certificates from your talosconfig. --insecure only applies to maintenance
(before join / after a reset).
Do one of:
- Apply config with
talosconfig(no--insecure):
cd talos
export TALOSCONFIG="$(pwd)/clusterconfig/talosconfig"
export ENDPOINT=192.168.50.230
talosctl -e "${ENDPOINT}" apply-config -n 192.168.50.30 -f clusterconfig/noble-noble-cp-2.yaml
- Or
talosctl resetthat node first (see §12.1), then useapply-config --insecureagain while it is in maintenance.
bootstrap: etcd data directory is not empty
The bootstrap node (192.168.50.20) already has a previous etcd on disk (failed
or partial bootstrap). Kubernetes will not bootstrap again until that state is
wiped.
Fix: run talosctl reset --graceful=false on the control plane nodes
(at minimum the bootstrap node; often all four nodes is cleaner). See §12.1.
Then re-apply machine configs and run talosctl bootstrap exactly once.
etcd unhealthy / “Preparing” on some control planes
Usually means split or partial cluster state. The reliable fix is the same
full reset (§12.1), then a single ordered bring-up: apply all configs →
bootstrap once → talosctl health.