Talos deployment (4 nodes)
This directory contains a talhelper cluster definition for a 4-node Talos
cluster:
- 3 hybrid control-plane/worker nodes:
noble-cp-1..3 - 1 worker-only node:
noble-worker-1 allowSchedulingOnControlPlanes: true- CNI:
none(for Cilium via GitOps)
1) Update values for your environment
Edit talconfig.yaml:
endpoint(Kubernetes API VIP or LB IP)additionalApiServerCertSans/additionalMachineCertSans: must include the same VIP (and DNS name, if you use one) that clients andtalosctluse — otherwise TLS tohttps://<VIP>:6443fails because the cert only lists node IPs by default. This repo sets192.168.50.230(andkube.noble.lab.pcenicni.dev) to match kube-vip.- each node
ipAddress - each node
installDisk(for example/dev/sda,/dev/nvme0n1) talosVersion/kubernetesVersionif desired
After changing SANs, run talhelper genconfig, re-apply-config to all
control-plane nodes (certs are regenerated), then refresh talosctl kubeconfig.
2) Generate cluster secrets and machine configs
From this directory:
talhelper gensecret > talsecret.sops.yaml
talhelper genconfig
Generated machine configs are written to clusterconfig/.
3) Apply Talos configs
Apply each node file to the matching node IP from talconfig.yaml:
talosctl apply-config --insecure -n 192.168.50.20 -f clusterconfig/noble-noble-cp-1.yaml
talosctl apply-config --insecure -n 192.168.50.30 -f clusterconfig/noble-noble-cp-2.yaml
talosctl apply-config --insecure -n 192.168.50.40 -f clusterconfig/noble-noble-cp-3.yaml
talosctl apply-config --insecure -n 192.168.50.10 -f clusterconfig/noble-noble-worker-1.yaml
4) Bootstrap the cluster
After all nodes are up (bootstrap once, from any control-plane node):
talosctl bootstrap -n 192.168.50.20 -e 192.168.50.230
talosctl kubeconfig -n 192.168.50.20 -e 192.168.50.230 .
5) Validate
talosctl -n 192.168.50.20 -e 192.168.50.230 health
kubectl get nodes -o wide
kubectl errors: lookup https: no such host or https://https/...
That means the active kubeconfig has a broken cluster.server URL (often a
double https:// or duplicate :6443). Kubernetes then tries to resolve
the hostname https, which fails.
Inspect what you are using:
kubectl config view --minify -o jsonpath='{.clusters[0].cluster.server}{"\n"}'
It must be a single valid URL, for example:
https://192.168.50.230:6443(API VIP fromtalconfig.yaml), orhttps://kube.noble.lab.pcenicni.dev:6443(if DNS points at that VIP)
Fix the cluster entry (replace noble with your context’s cluster name if
different):
kubectl config set-cluster noble --server=https://192.168.50.230:6443
Or point kubectl at this repo’s kubeconfig (known-good server line):
export KUBECONFIG="$(pwd)/kubeconfig"
kubectl cluster-info
Avoid pasting https:// twice when running kubectl config set-cluster ... --server=....
kubectl apply fails: localhost:8080 / openapi connection refused
kubectl is not using a real cluster config; it falls back to the default
http://localhost:8080 (no KUBECONFIG, empty file, or wrong file).
Fix:
cd talos
export KUBECONFIG="$(pwd)/kubeconfig"
kubectl config current-context
kubectl cluster-info
Then run kubectl apply from the repository root (parent of talos/) in
the same shell. Do not use a literal cd /path/to/... — that was only a
placeholder. Example (adjust to where you cloned this repo):
export KUBECONFIG="${HOME}/Developer/home-server/talos/kubeconfig"
kubectl config set-cluster noble ... only updates the file kubectl is
actually reading (often ~/.kube/config). It does nothing if KUBECONFIG
points at another path.
6) GitOps-pinned Cilium values
The Cilium settings that worked for this Talos cluster are now persisted in:
clusters/noble/apps/cilium/helm-values.yamlclusters/noble/apps/cilium/application.yaml(Helm chart +valueFilesfrom this repo)
That Argo CD Application pins chart 1.16.6 and uses the same values file
for API host/port, cgroup settings, IPAM CIDR, and security capabilities.
Cilium before Argo CD (cni: none)
This cluster uses cniConfig.name: none in talconfig.yaml so Talos does
not install a CNI. Argo CD pods cannot schedule until some CNI makes nodes
Ready (otherwise the node.kubernetes.io/not-ready taint blocks scheduling).
Install Cilium once with Helm from your workstation (same chart and values Argo will manage later), then bootstrap Argo CD:
helm repo add cilium https://helm.cilium.io/
helm repo update
helm upgrade --install cilium cilium/cilium \
--namespace kube-system \
--version 1.16.6 \
-f clusters/noble/apps/cilium/helm-values.yaml \
--wait --timeout 10m
kubectl get nodes
kubectl wait --for=condition=Ready nodes --all --timeout=300s
If helm --install seems stuck after “Installing it now”, it is usually still
pulling images (quay.io/cilium/...) or waiting for pods to become Ready. In
another terminal run kubectl get pods -n kube-system -w and check for
ImagePullBackOff, Pending, or CrashLoopBackOff. To avoid blocking on
Helm’s wait logic, install without --wait, confirm Cilium pods, then continue:
helm upgrade --install cilium cilium/cilium \
--namespace kube-system \
--version 1.16.6 \
-f clusters/noble/apps/cilium/helm-values.yaml
kubectl get pods -n kube-system -l app.kubernetes.io/part-of=cilium -w
helm-values.yaml sets operator.replicas: 1 so the chart default (two
operators with hard anti-affinity) cannot deadlock helm --wait when only one
node can take the operator early in bootstrap.
If helm upgrade fails with a server-side apply conflict on
kube-system/hubble-server-certs and argocd-controller, Argo already
synced Cilium and owns that Secret’s TLS fields. The cilium Application
uses ignoreDifferences on that Secret plus RespectIgnoreDifferences
so GitOps and occasional CLI Helm runs do not fight over .data. Until that
manifest is applied in the cluster, either suspend the cilium Application
in Argo, or delete the Secret once (kubectl delete secret hubble-server-certs -n kube-system) and re-run helm upgrade --install
before Argo reconciles again. After bootstrap, prefer kubectl -n argocd get application cilium -o yaml / Argo UI to sync Cilium instead of ad hoc
Helm, unless you suspend the app first.
If nodes were already Ready, you can skip straight to section 7.
7) Argo CD app-of-apps bootstrap
This repo includes an app-of-apps structure for cluster apps:
- Root app:
clusters/noble/root-application.yaml - Child apps index:
clusters/noble/apps/kustomization.yaml - Argo CD app:
clusters/noble/apps/argocd/application.yaml - Cilium app:
clusters/noble/apps/cilium/application.yaml
Bootstrap once from your workstation:
kubectl apply -k clusters/noble/bootstrap/argocd
kubectl wait --for=condition=Established crd/appprojects.argoproj.io --timeout=120s
kubectl apply -f clusters/noble/bootstrap/argocd/default-appproject.yaml
kubectl apply -f clusters/noble/root-application.yaml
If the first command errors on AppProject (“no matches for kind AppProject”), the CRDs were not ready yet; run the kubectl wait and kubectl apply -f .../default-appproject.yaml lines, then continue.
After this, Argo CD continuously reconciles all applications under
clusters/noble/apps/.
8) kube-vip API VIP (192.168.50.230)
HAProxy has been removed in favor of kube-vip running on control-plane nodes.
Manifests are in:
clusters/noble/apps/kube-vip/application.yamlclusters/noble/apps/kube-vip/vip-rbac.yamlclusters/noble/apps/kube-vip/vip-daemonset.yaml
The DaemonSet advertises 192.168.50.230 in ARP mode and fronts the Kubernetes
API on port 6443.
Apply manually (or let Argo CD sync from root app):
kubectl apply -k clusters/noble/apps/kube-vip
Validate:
kubectl -n kube-system get pods -l app.kubernetes.io/name=kube-vip-ds -o wide
nc -vz 192.168.50.230 6443
9) Argo CD via DNS host (no port)
Argo CD is exposed through a kube-vip managed LoadBalancer Service:
argo.noble.lab.pcenicni.dev
Manifests:
clusters/noble/bootstrap/argocd/argocd-server-lb.yamlclusters/noble/apps/kube-vip/vip-daemonset.yaml(svc_enable: "true")
After syncing manifests, create a Pi-hole DNS A record:
argo.noble.lab.pcenicni.dev->192.168.50.231
10) Longhorn storage and extra disks
Longhorn is deployed from:
clusters/noble/apps/longhorn/application.yaml
Monitoring apps are configured to use storageClassName: longhorn, so you can
persist Prometheus/Alertmanager/Loki data once Longhorn is healthy.
Argo CD: longhorn OutOfSync, Health Missing, no longhorn-role
Missing means nothing has been applied yet, or a sync never completed. The
Helm chart creates ClusterRole/longhorn-role on a successful install.
- See the failure reason:
kubectl describe application longhorn -n argocd
Check Status → Conditions and Status → Operation State for the error
(for example Helm render error, CRD apply failure, or repo-server cannot reach
https://charts.longhorn.io).
- Trigger a sync (Argo CD UI Sync, or CLI):
argocd app sync longhorn
- After a good sync, confirm:
kubectl get clusterrole longhorn-role
kubectl get pods -n longhorn-system
Extra drive layout (this cluster)
Each node uses:
/dev/sda— Talos install disk (installDiskintalconfig.yaml)/dev/sdb— dedicated Longhorn data disk
talconfig.yaml includes a global patch that partitions /dev/sdb and mounts it
at /var/mnt/longhorn, which matches Longhorn defaultDataPath in the Argo
Helm values.
After editing talconfig.yaml, regenerate and apply configs:
cd talos
talhelper genconfig
# apply each node’s YAML from clusterconfig/ with talosctl apply-config
Then reboot each node once so the new disk layout is applied.
talosctl TLS errors (unknown authority, Ed25519 verification failure)
talosctl does not automatically use talos/clusterconfig/talosconfig. If you
omit it, the client falls back to ~/.talos/config, which is usually a
different cluster CA — you then get TLS handshake failures against the noble
nodes.
Always set this in the shell where you run talosctl (use an absolute path
if you change directories):
cd talos
export TALOSCONFIG="$(pwd)/clusterconfig/talosconfig"
export ENDPOINT=192.168.50.230
Sanity check (should print Talos and Kubernetes versions, not TLS errors):
talosctl -e "${ENDPOINT}" -n 192.168.50.20 version
Then use the same shell for apply-config, reboot, and health.
If it still fails after TALOSCONFIG is set, the running cluster was likely
bootstrapped with different secrets than the ones in your current
talsecret.sops.yaml / regenerated clusterconfig/. In that case you need the
original talosconfig that matched the cluster when it was created, or you
must align secrets and cluster state (recovery / rebuild is a larger topic).
Keep talosctl roughly aligned with the node Talos version (for example
v1.12.x clients for v1.12.5 nodes).
Paste tip: run one command per line. Pasting ...cp-3.yaml and
talosctl on the same line breaks the filename and can confuse the shell.
More than one extra disk per node
If you add a third disk later, extend machine.disks in talconfig.yaml (for
example /dev/sdc → /var/mnt/longhorn-disk2) and register that path in
Longhorn as an additional disk for that node.
Recommended:
- use one dedicated filesystem per Longhorn disk path
- avoid using the Talos system disk for heavy Longhorn data
- spread replicas across nodes for resiliency
11) Upgrade Talos to v1.12.x
This repo now pins:
talosVersion: v1.12.5intalconfig.yaml
Regenerate configs
From talos/:
talhelper genconfig
Rolling upgrade order
Upgrade one node at a time, waiting for it to return healthy before moving on.
- Control plane nodes (
noble-cp-1, thennoble-cp-2, thennoble-cp-3) - Worker node (
noble-worker-1)
Example commands (adjust node IP per step):
talosctl --talosconfig ./clusterconfig/talosconfig -n 192.168.50.20 upgrade --image ghcr.io/siderolabs/installer:v1.12.5
talosctl --talosconfig ./clusterconfig/talosconfig -n 192.168.50.20 reboot
talosctl --talosconfig ./clusterconfig/talosconfig -n 192.168.50.20 health
After all nodes are upgraded, verify:
talosctl --talosconfig ./clusterconfig/talosconfig version
kubectl get nodes -o wide
12) Destroy the cluster and rebuild from scratch
Use this when Kubernetes / etcd / Argo / Longhorn state is corrupted and you want a clean cluster. This wipes cluster state on the nodes (etcd, workloads, Longhorn data on cluster disks). Plan for downtime and backup anything you must keep off-cluster first.
12.1 Reset every Talos node (Kubernetes is destroyed)
From talos/ with a working talosconfig that matches the machines (same
TALOSCONFIG / ENDPOINT guidance as elsewhere in this README):
cd talos
export TALOSCONFIG="$(pwd)/clusterconfig/talosconfig"
export ENDPOINT=192.168.50.230
Reset one node at a time, waiting for each to reboot before the next. Order:
worker first, then non-bootstrap control planes, then the bootstrap
control plane last (noble-cp-1 → 192.168.50.20).
talosctl -e "${ENDPOINT}" -n 192.168.50.10 reset --graceful=false
talosctl -e "${ENDPOINT}" -n 192.168.50.30 reset --graceful=false
talosctl -e "${ENDPOINT}" -n 192.168.50.40 reset --graceful=false
talosctl -e "${ENDPOINT}" -n 192.168.50.20 reset --graceful=false
If the API VIP is already unreachable, target the node IP as endpoint for that
node, for example:
talosctl -e 192.168.50.10 -n 192.168.50.10 reset --graceful=false.
Your workstation kubeconfig will not work for the old cluster after this;
that is expected until you bootstrap again.
12.2 (Optional) New cluster secrets
For a fully fresh identity (new cluster CA and talosconfig):
cd talos
talhelper gensecret > talsecret.sops.yaml
# encrypt / store talsecret as you usually do, then:
talhelper genconfig
If you keep the existing talsecret.sops.yaml, still run talhelper genconfig
so clusterconfig/ matches what you will apply.
12.3 Apply configs, bootstrap, kubeconfig
Repeat §3 Apply Talos configs and §4 Bootstrap the cluster (and §5
Validate) from the top of this README: apply-config each node, then
talosctl bootstrap, then talosctl kubeconfig into talos/kubeconfig.
12.4 Redeploy GitOps (Argo CD + apps)
From your workstation (repo root), with KUBECONFIG pointing at the new
talos/kubeconfig:
# Set REPO to the directory that contains both talos/ and clusters/ (not a literal "path/to")
REPO="${HOME}/Developer/home-server"
export KUBECONFIG="${REPO}/talos/kubeconfig"
cd "${REPO}"
kubectl apply -k clusters/noble/bootstrap/argocd
kubectl apply -f clusters/noble/root-application.yaml
Resolve Argo CD admin login (secret / password reset) as needed; then let
noble-root sync clusters/noble/apps/.
13) Mid-rebuild issues: etcd, bootstrap, and apply-config
tls: certificate required when using apply-config --insecure
After a node has joined the cluster, the Talos API expects client
certificates from your talosconfig. --insecure only applies to maintenance
(before join / after a reset).
Do one of:
- Apply config with
talosconfig(no--insecure):
cd talos
export TALOSCONFIG="$(pwd)/clusterconfig/talosconfig"
export ENDPOINT=192.168.50.230
talosctl -e "${ENDPOINT}" apply-config -n 192.168.50.30 -f clusterconfig/noble-noble-cp-2.yaml
- Or
talosctl resetthat node first (see §12.1), then useapply-config --insecureagain while it is in maintenance.
bootstrap: etcd data directory is not empty
The bootstrap node (192.168.50.20) already has a previous etcd on disk (failed
or partial bootstrap). Kubernetes will not bootstrap again until that state is
wiped.
Fix: run talosctl reset --graceful=false on the control plane nodes
(at minimum the bootstrap node; often all four nodes is cleaner). See §12.1.
Then re-apply machine configs and run talosctl bootstrap exactly once.
etcd unhealthy / “Preparing” on some control planes
Usually means split or partial cluster state. The reliable fix is the same
full reset (§12.1), then a single ordered bring-up: apply all configs →
bootstrap once → talosctl health.