Files

Nikholas Pcenicni 55833b2593 Enhance Longhorn application configuration by adding skipCrds option and retry settings to improve deployment resilience and error handling.

2026-03-27 17:47:54 -04:00

clusterconfig

Update .gitignore and refactor Ubuntu template playbook to use role for Proxmox template management

2026-03-27 03:48:32 -04:00

kubeconfig

Update .gitignore and refactor Ubuntu template playbook to use role for Proxmox template management

2026-03-27 03:48:32 -04:00

README.md

Enhance Longhorn application configuration by adding skipCrds option and retry settings to improve deployment resilience and error handling.

2026-03-27 17:47:54 -04:00

talconfig.yaml

Enhance monitoring configurations by enabling persistence for Loki and updating storage settings for Prometheus and Alertmanager to use Longhorn. Add Longhorn application to kustomization.yaml for improved storage management.

2026-03-27 16:27:58 -04:00

talsecret.sops.yaml

Update .gitignore and refactor Ubuntu template playbook to use role for Proxmox template management

2026-03-27 03:48:32 -04:00

upgrade-talos-1.8.4-to-1.12.5.sh

2026-03-27 16:27:58 -04:00

README.md

Talos deployment (4 nodes)

This directory contains a talhelper cluster definition for a 4-node Talos cluster:

3 hybrid control-plane/worker nodes: noble-cp-1..3
1 worker-only node: noble-worker-1
allowSchedulingOnControlPlanes: true
CNI: none (for Cilium via GitOps)

1) Update values for your environment

Edit talconfig.yaml:

endpoint (Kubernetes API VIP or LB IP)
each node ipAddress
each node installDisk (for example /dev/sda, /dev/nvme0n1)
talosVersion / kubernetesVersion if desired

2) Generate cluster secrets and machine configs

From this directory:

talhelper gensecret > talsecret.sops.yaml
talhelper genconfig

Generated machine configs are written to clusterconfig/.

3) Apply Talos configs

Apply each node file to the matching node IP from talconfig.yaml:

talosctl apply-config --insecure -n 192.168.50.20 -f clusterconfig/noble-noble-cp-1.yaml
talosctl apply-config --insecure -n 192.168.50.30 -f clusterconfig/noble-noble-cp-2.yaml
talosctl apply-config --insecure -n 192.168.50.40 -f clusterconfig/noble-noble-cp-3.yaml
talosctl apply-config --insecure -n 192.168.50.10 -f clusterconfig/noble-noble-worker-1.yaml

4) Bootstrap the cluster

After all nodes are up (bootstrap once, from any control-plane node):

talosctl bootstrap -n 192.168.50.20 -e 192.168.50.230
talosctl kubeconfig -n 192.168.50.20 -e 192.168.50.230 .

5) Validate

talosctl -n 192.168.50.20 -e 192.168.50.230 health
kubectl get nodes -o wide

`kubectl` errors: `lookup https: no such host` or `https://https/...`

That means the active kubeconfig has a broken cluster.server URL (often a double https:// or duplicate :6443). Kubernetes then tries to resolve the hostname https, which fails.

Inspect what you are using:

kubectl config view --minify -o jsonpath='{.clusters[0].cluster.server}{"\n"}'

It must be a single valid URL, for example:

https://192.168.50.230:6443 (API VIP from talconfig.yaml), or
https://kube.noble.lab.pcenicni.dev:6443 (if DNS points at that VIP)

Fix the cluster entry (replace noble with your context’s cluster name if different):

kubectl config set-cluster noble --server=https://192.168.50.230:6443

Or point kubectl at this repo’s kubeconfig (known-good server line):

export KUBECONFIG="$(pwd)/kubeconfig"
kubectl cluster-info

Avoid pasting https:// twice when running kubectl config set-cluster ... --server=....

6) GitOps-pinned Cilium values

The Cilium settings that worked for this Talos cluster are now persisted in:

clusters/noble/apps/cilium/application.yaml

That Argo CD Application pins chart 1.16.6 and includes the required Helm values for this environment (API host/port, cgroup settings, IPAM CIDR, and security capabilities), so future reconciles do not drift back to defaults.

7) Argo CD app-of-apps bootstrap

This repo includes an app-of-apps structure for cluster apps:

Root app: clusters/noble/root-application.yaml
Child apps index: clusters/noble/apps/kustomization.yaml
Argo CD app: clusters/noble/apps/argocd/application.yaml
Cilium app: clusters/noble/apps/cilium/application.yaml

Bootstrap once from your workstation:

kubectl apply -k clusters/noble/bootstrap/argocd
kubectl apply -f clusters/noble/root-application.yaml

After this, Argo CD continuously reconciles all applications under clusters/noble/apps/.

8) kube-vip API VIP (`192.168.50.230`)

HAProxy has been removed in favor of kube-vip running on control-plane nodes.

Manifests are in:

clusters/noble/apps/kube-vip/application.yaml
clusters/noble/apps/kube-vip/vip-rbac.yaml
clusters/noble/apps/kube-vip/vip-daemonset.yaml

The DaemonSet advertises 192.168.50.230 in ARP mode and fronts the Kubernetes API on port 6443.

Apply manually (or let Argo CD sync from root app):

kubectl apply -k clusters/noble/apps/kube-vip

Validate:

kubectl -n kube-system get pods -l app.kubernetes.io/name=kube-vip-ds -o wide
nc -vz 192.168.50.230 6443

9) Argo CD via DNS host (no port)

Argo CD is exposed through a kube-vip managed LoadBalancer Service:

argo.noble.lab.pcenicni.dev

Manifests:

clusters/noble/bootstrap/argocd/argocd-server-lb.yaml
clusters/noble/apps/kube-vip/vip-daemonset.yaml (svc_enable: "true")

After syncing manifests, create a Pi-hole DNS A record:

argo.noble.lab.pcenicni.dev -> 192.168.50.231

10) Longhorn storage and extra disks

Longhorn is deployed from:

clusters/noble/apps/longhorn/application.yaml

Monitoring apps are configured to use storageClassName: longhorn, so you can persist Prometheus/Alertmanager/Loki data once Longhorn is healthy.

Argo CD: `longhorn` OutOfSync, Health Missing, no `longhorn-role`

Missing means nothing has been applied yet, or a sync never completed. The Helm chart creates ClusterRole/longhorn-role on a successful install.

See the failure reason:

kubectl describe application longhorn -n argocd

Check Status → Conditions and Status → Operation State for the error (for example Helm render error, CRD apply failure, or repo-server cannot reach https://charts.longhorn.io).

Trigger a sync (Argo CD UI Sync, or CLI):

argocd app sync longhorn

After a good sync, confirm:

kubectl get clusterrole longhorn-role
kubectl get pods -n longhorn-system

Extra drive layout (this cluster)

Each node uses:

/dev/sda — Talos install disk (installDisk in talconfig.yaml)
/dev/sdb — dedicated Longhorn data disk

talconfig.yaml includes a global patch that partitions /dev/sdb and mounts it at /var/mnt/longhorn, which matches Longhorn defaultDataPath in the Argo Helm values.

After editing talconfig.yaml, regenerate and apply configs:

cd talos
talhelper genconfig
# apply each node’s YAML from clusterconfig/ with talosctl apply-config

Then reboot each node once so the new disk layout is applied.

`talosctl` TLS errors (`unknown authority`, `Ed25519 verification failure`)

talosctl does not automatically use talos/clusterconfig/talosconfig. If you omit it, the client falls back to ~/.talos/config, which is usually a different cluster CA — you then get TLS handshake failures against the noble nodes.

Always set this in the shell where you run talosctl (use an absolute path if you change directories):

cd talos
export TALOSCONFIG="$(pwd)/clusterconfig/talosconfig"
export ENDPOINT=192.168.50.230

Sanity check (should print Talos and Kubernetes versions, not TLS errors):

talosctl -e "${ENDPOINT}" -n 192.168.50.20 version

Then use the same shell for apply-config, reboot, and health.

If it still fails after TALOSCONFIG is set, the running cluster was likely bootstrapped with different secrets than the ones in your current talsecret.sops.yaml / regenerated clusterconfig/. In that case you need the original talosconfig that matched the cluster when it was created, or you must align secrets and cluster state (recovery / rebuild is a larger topic).

Keep talosctl roughly aligned with the node Talos version (for example v1.12.x clients for v1.12.5 nodes).

Paste tip: run one command per line. Pasting ...cp-3.yaml and talosctl on the same line breaks the filename and can confuse the shell.

More than one extra disk per node

If you add a third disk later, extend machine.disks in talconfig.yaml (for example /dev/sdc → /var/mnt/longhorn-disk2) and register that path in Longhorn as an additional disk for that node.

Recommended:

use one dedicated filesystem per Longhorn disk path
avoid using the Talos system disk for heavy Longhorn data
spread replicas across nodes for resiliency

11) Upgrade Talos to `v1.12.x`

This repo now pins:

talosVersion: v1.12.5 in talconfig.yaml

Regenerate configs

From talos/:

talhelper genconfig

Rolling upgrade order

Upgrade one node at a time, waiting for it to return healthy before moving on.

Control plane nodes (noble-cp-1, then noble-cp-2, then noble-cp-3)
Worker node (noble-worker-1)

Example commands (adjust node IP per step):

talosctl --talosconfig ./clusterconfig/talosconfig -n 192.168.50.20 upgrade --image ghcr.io/siderolabs/installer:v1.12.5
talosctl --talosconfig ./clusterconfig/talosconfig -n 192.168.50.20 reboot
talosctl --talosconfig ./clusterconfig/talosconfig -n 192.168.50.20 health

After all nodes are upgraded, verify:

talosctl --talosconfig ./clusterconfig/talosconfig version
kubectl get nodes -o wide

README.md Unescape Escape

Talos deployment (4 nodes)

1) Update values for your environment

2) Generate cluster secrets and machine configs

3) Apply Talos configs

4) Bootstrap the cluster

5) Validate

kubectl errors: lookup https: no such host or https://https/...

6) GitOps-pinned Cilium values

7) Argo CD app-of-apps bootstrap

8) kube-vip API VIP (192.168.50.230)

9) Argo CD via DNS host (no port)

10) Longhorn storage and extra disks

Argo CD: longhorn OutOfSync, Health Missing, no longhorn-role

Extra drive layout (this cluster)

talosctl TLS errors (unknown authority, Ed25519 verification failure)

More than one extra disk per node

11) Upgrade Talos to v1.12.x

Regenerate configs

Rolling upgrade order

README.md

`kubectl` errors: `lookup https: no such host` or `https://https/...`

8) kube-vip API VIP (`192.168.50.230`)

Argo CD: `longhorn` OutOfSync, Health Missing, no `longhorn-role`

`talosctl` TLS errors (`unknown authority`, `Ed25519 verification failure`)

11) Upgrade Talos to `v1.12.x`