Files
home-server/talos

Talos deployment (4 nodes)

This directory contains a talhelper cluster definition for a 4-node Talos cluster:

  • 3 hybrid control-plane/worker nodes: noble-cp-1..3
  • 1 worker-only node: noble-worker-1
  • allowSchedulingOnControlPlanes: true
  • CNI: none (for Cilium via GitOps)

1) Update values for your environment

Edit talconfig.yaml:

  • endpoint (Kubernetes API VIP or LB IP)
  • each node ipAddress
  • each node installDisk (for example /dev/sda, /dev/nvme0n1)
  • talosVersion / kubernetesVersion if desired

2) Generate cluster secrets and machine configs

From this directory:

talhelper gensecret > talsecret.sops.yaml
talhelper genconfig

Generated machine configs are written to clusterconfig/.

3) Apply Talos configs

Apply each node file to the matching node IP from talconfig.yaml:

talosctl apply-config --insecure -n 192.168.50.20 -f clusterconfig/noble-noble-cp-1.yaml
talosctl apply-config --insecure -n 192.168.50.30 -f clusterconfig/noble-noble-cp-2.yaml
talosctl apply-config --insecure -n 192.168.50.40 -f clusterconfig/noble-noble-cp-3.yaml
talosctl apply-config --insecure -n 192.168.50.10 -f clusterconfig/noble-noble-worker-1.yaml

4) Bootstrap the cluster

After all nodes are up (bootstrap once, from any control-plane node):

talosctl bootstrap -n 192.168.50.20 -e 192.168.50.230
talosctl kubeconfig -n 192.168.50.20 -e 192.168.50.230 .

5) Validate

talosctl -n 192.168.50.20 -e 192.168.50.230 health
kubectl get nodes -o wide

kubectl errors: lookup https: no such host or https://https/...

That means the active kubeconfig has a broken cluster.server URL (often a double https:// or duplicate :6443). Kubernetes then tries to resolve the hostname https, which fails.

Inspect what you are using:

kubectl config view --minify -o jsonpath='{.clusters[0].cluster.server}{"\n"}'

It must be a single valid URL, for example:

  • https://192.168.50.230:6443 (API VIP from talconfig.yaml), or
  • https://kube.noble.lab.pcenicni.dev:6443 (if DNS points at that VIP)

Fix the cluster entry (replace noble with your contexts cluster name if different):

kubectl config set-cluster noble --server=https://192.168.50.230:6443

Or point kubectl at this repos kubeconfig (known-good server line):

export KUBECONFIG="$(pwd)/kubeconfig"
kubectl cluster-info

Avoid pasting https:// twice when running kubectl config set-cluster ... --server=....

6) GitOps-pinned Cilium values

The Cilium settings that worked for this Talos cluster are now persisted in:

  • clusters/noble/apps/cilium/application.yaml

That Argo CD Application pins chart 1.16.6 and includes the required Helm values for this environment (API host/port, cgroup settings, IPAM CIDR, and security capabilities), so future reconciles do not drift back to defaults.

7) Argo CD app-of-apps bootstrap

This repo includes an app-of-apps structure for cluster apps:

  • Root app: clusters/noble/root-application.yaml
  • Child apps index: clusters/noble/apps/kustomization.yaml
  • Argo CD app: clusters/noble/apps/argocd/application.yaml
  • Cilium app: clusters/noble/apps/cilium/application.yaml

Bootstrap once from your workstation:

kubectl apply -k clusters/noble/bootstrap/argocd
kubectl apply -f clusters/noble/root-application.yaml

After this, Argo CD continuously reconciles all applications under clusters/noble/apps/.

8) kube-vip API VIP (192.168.50.230)

HAProxy has been removed in favor of kube-vip running on control-plane nodes.

Manifests are in:

  • clusters/noble/apps/kube-vip/application.yaml
  • clusters/noble/apps/kube-vip/vip-rbac.yaml
  • clusters/noble/apps/kube-vip/vip-daemonset.yaml

The DaemonSet advertises 192.168.50.230 in ARP mode and fronts the Kubernetes API on port 6443.

Apply manually (or let Argo CD sync from root app):

kubectl apply -k clusters/noble/apps/kube-vip

Validate:

kubectl -n kube-system get pods -l app.kubernetes.io/name=kube-vip-ds -o wide
nc -vz 192.168.50.230 6443

9) Argo CD via DNS host (no port)

Argo CD is exposed through a kube-vip managed LoadBalancer Service:

  • argo.noble.lab.pcenicni.dev

Manifests:

  • clusters/noble/bootstrap/argocd/argocd-server-lb.yaml
  • clusters/noble/apps/kube-vip/vip-daemonset.yaml (svc_enable: "true")

After syncing manifests, create a Pi-hole DNS A record:

  • argo.noble.lab.pcenicni.dev -> 192.168.50.231

10) Longhorn storage and extra disks

Longhorn is deployed from:

  • clusters/noble/apps/longhorn/application.yaml

Monitoring apps are configured to use storageClassName: longhorn, so you can persist Prometheus/Alertmanager/Loki data once Longhorn is healthy.

Argo CD: longhorn OutOfSync, Health Missing, no longhorn-role

Missing means nothing has been applied yet, or a sync never completed. The Helm chart creates ClusterRole/longhorn-role on a successful install.

  1. See the failure reason:
kubectl describe application longhorn -n argocd

Check Status → Conditions and Status → Operation State for the error (for example Helm render error, CRD apply failure, or repo-server cannot reach https://charts.longhorn.io).

  1. Trigger a sync (Argo CD UI Sync, or CLI):
argocd app sync longhorn
  1. After a good sync, confirm:
kubectl get clusterrole longhorn-role
kubectl get pods -n longhorn-system

Extra drive layout (this cluster)

Each node uses:

  • /dev/sda — Talos install disk (installDisk in talconfig.yaml)
  • /dev/sdb — dedicated Longhorn data disk

talconfig.yaml includes a global patch that partitions /dev/sdb and mounts it at /var/mnt/longhorn, which matches Longhorn defaultDataPath in the Argo Helm values.

After editing talconfig.yaml, regenerate and apply configs:

cd talos
talhelper genconfig
# apply each nodes YAML from clusterconfig/ with talosctl apply-config

Then reboot each node once so the new disk layout is applied.

talosctl TLS errors (unknown authority, Ed25519 verification failure)

talosctl does not automatically use talos/clusterconfig/talosconfig. If you omit it, the client falls back to ~/.talos/config, which is usually a different cluster CA — you then get TLS handshake failures against the noble nodes.

Always set this in the shell where you run talosctl (use an absolute path if you change directories):

cd talos
export TALOSCONFIG="$(pwd)/clusterconfig/talosconfig"
export ENDPOINT=192.168.50.230

Sanity check (should print Talos and Kubernetes versions, not TLS errors):

talosctl -e "${ENDPOINT}" -n 192.168.50.20 version

Then use the same shell for apply-config, reboot, and health.

If it still fails after TALOSCONFIG is set, the running cluster was likely bootstrapped with different secrets than the ones in your current talsecret.sops.yaml / regenerated clusterconfig/. In that case you need the original talosconfig that matched the cluster when it was created, or you must align secrets and cluster state (recovery / rebuild is a larger topic).

Keep talosctl roughly aligned with the node Talos version (for example v1.12.x clients for v1.12.5 nodes).

Paste tip: run one command per line. Pasting ...cp-3.yaml and talosctl on the same line breaks the filename and can confuse the shell.

More than one extra disk per node

If you add a third disk later, extend machine.disks in talconfig.yaml (for example /dev/sdc/var/mnt/longhorn-disk2) and register that path in Longhorn as an additional disk for that node.

Recommended:

  • use one dedicated filesystem per Longhorn disk path
  • avoid using the Talos system disk for heavy Longhorn data
  • spread replicas across nodes for resiliency

11) Upgrade Talos to v1.12.x

This repo now pins:

  • talosVersion: v1.12.5 in talconfig.yaml

Regenerate configs

From talos/:

talhelper genconfig

Rolling upgrade order

Upgrade one node at a time, waiting for it to return healthy before moving on.

  1. Control plane nodes (noble-cp-1, then noble-cp-2, then noble-cp-3)
  2. Worker node (noble-worker-1)

Example commands (adjust node IP per step):

talosctl --talosconfig ./clusterconfig/talosconfig -n 192.168.50.20 upgrade --image ghcr.io/siderolabs/installer:v1.12.5
talosctl --talosconfig ./clusterconfig/talosconfig -n 192.168.50.20 reboot
talosctl --talosconfig ./clusterconfig/talosconfig -n 192.168.50.20 health

After all nodes are upgraded, verify:

talosctl --talosconfig ./clusterconfig/talosconfig version
kubectl get nodes -o wide