Enhance Longhorn application configuration by adding skipCrds option and retry settings to improve deployment resilience and error handling.

2026-03-27 17:47:54 -04:00
parent 76700a7b3f
commit 55833b2593
2 changed files with 68 additions and 1 deletions
--- a/clusters/noble/apps/longhorn/application.yaml
+++ b/clusters/noble/apps/longhorn/application.yaml
@@ -15,6 +15,7 @@ spec:
      chart: longhorn
      targetRevision: "1.11.1"
      helm:
        skipCrds: false
        valuesObject:
          defaultSettings:
            createDefaultDiskLabeledNodes: false
@@ -23,7 +24,12 @@ spec:
    automated:
      prune: true
      selfHeal: true
    retry:
      limit: 5
      backoff:
        duration: 20s
        factor: 2
        maxDuration: 3m
    syncOptions:
      - CreateNamespace=true
      - PruneLast=true
      - ServerSideApply=true
--- a/talos/README.md
+++ b/talos/README.md
@@ -55,6 +55,39 @@ talosctl -n 192.168.50.20 -e 192.168.50.230 health
 kubectl get nodes -o wide
 ```
 ### `kubectl` errors: `lookup https: no such host` or `https://https/...`
 That means the **active** kubeconfig has a broken `cluster.server` URL (often a
 **double** `https://` or **duplicate** `:6443`). Kubernetes then tries to resolve
 the hostname `https`, which fails.
 Inspect what you are using:
 ```bash
 kubectl config view --minify -o jsonpath='{.clusters[0].cluster.server}{"\n"}'
 ```
 It must be a **single** valid URL, for example:
 - `https://192.168.50.230:6443` (API VIP from `talconfig.yaml`), or
 - `https://kube.noble.lab.pcenicni.dev:6443` (if DNS points at that VIP)
 Fix the cluster entry (replace `noble` with your context’s cluster name if
 different):
 ```bash
 kubectl config set-cluster noble --server=https://192.168.50.230:6443
 ```
 Or point `kubectl` at this repo’s kubeconfig (known-good server line):
 ```bash
 export KUBECONFIG="$(pwd)/kubeconfig"
 kubectl cluster-info
 ```
 Avoid pasting `https://` twice when running `kubectl config set-cluster ... --server=...`.
 ## 6) GitOps-pinned Cilium values
 The Cilium settings that worked for this Talos cluster are now persisted in:
@@ -134,6 +167,34 @@ Longhorn is deployed from:
 Monitoring apps are configured to use `storageClassName: longhorn`, so you can
 persist Prometheus/Alertmanager/Loki data once Longhorn is healthy.
 ### Argo CD: `longhorn` OutOfSync, Health **Missing**, no `longhorn-role`
 **Missing** means nothing has been applied yet, or a sync never completed. The
 Helm chart creates `ClusterRole/longhorn-role` on a successful install.
 1. See the failure reason:
 ```bash
 kubectl describe application longhorn -n argocd
 ```
 Check **Status → Conditions** and **Status → Operation State** for the error
 (for example Helm render error, CRD apply failure, or repo-server cannot reach
 `https://charts.longhorn.io`).
 2. Trigger a sync (Argo CD UI **Sync**, or CLI):
 ```bash
 argocd app sync longhorn
 ```
 3. After a good sync, confirm:
 ```bash
 kubectl get clusterrole longhorn-role
 kubectl get pods -n longhorn-system
 ```
 ### Extra drive layout (this cluster)
 Each node uses: