Enhance Longhorn application configuration by adding skipCrds option and retry settings to improve deployment resilience and error handling.

2026-03-27 17:47:54 -04:00
parent 76700a7b3f
commit 55833b2593
2 changed files with 68 additions and 1 deletions
--- a/clusters/noble/apps/longhorn/application.yaml
+++ b/clusters/noble/apps/longhorn/application.yaml
@@ -15,6 +15,7 @@ spec:
      chart: longhorn
      targetRevision: "1.11.1"
      helm:
+        skipCrds: false
        valuesObject:
          defaultSettings:
            createDefaultDiskLabeledNodes: false
@@ -23,7 +24,12 @@ spec:
    automated:
      prune: true
      selfHeal: true
+    retry:
+      limit: 5
+      backoff:
+        duration: 20s
+        factor: 2
+        maxDuration: 3m
    syncOptions:
      - CreateNamespace=true
      - PruneLast=true
-      - ServerSideApply=true
--- a/talos/README.md
+++ b/talos/README.md
@@ -55,6 +55,39 @@ talosctl -n 192.168.50.20 -e 192.168.50.230 health
 kubectl get nodes -o wide
 ```

+### `kubectl` errors: `lookup https: no such host` or `https://https/...`
+
+That means the **active** kubeconfig has a broken `cluster.server` URL (often a
+**double** `https://` or **duplicate** `:6443`). Kubernetes then tries to resolve
+the hostname `https`, which fails.
+
+Inspect what you are using:
+
+```bash
+kubectl config view --minify -o jsonpath='{.clusters[0].cluster.server}{"\n"}'
+```
+
+It must be a **single** valid URL, for example:
+
+- `https://192.168.50.230:6443` (API VIP from `talconfig.yaml`), or
+- `https://kube.noble.lab.pcenicni.dev:6443` (if DNS points at that VIP)
+
+Fix the cluster entry (replace `noble` with your context’s cluster name if
+different):
+
+```bash
+kubectl config set-cluster noble --server=https://192.168.50.230:6443
+```
+
+Or point `kubectl` at this repo’s kubeconfig (known-good server line):
+
+```bash
+export KUBECONFIG="$(pwd)/kubeconfig"
+kubectl cluster-info
+```
+
+Avoid pasting `https://` twice when running `kubectl config set-cluster ... --server=...`.
+
 ## 6) GitOps-pinned Cilium values

 The Cilium settings that worked for this Talos cluster are now persisted in:
@@ -134,6 +167,34 @@ Longhorn is deployed from:
 Monitoring apps are configured to use `storageClassName: longhorn`, so you can
 persist Prometheus/Alertmanager/Loki data once Longhorn is healthy.

+### Argo CD: `longhorn` OutOfSync, Health **Missing**, no `longhorn-role`
+
+**Missing** means nothing has been applied yet, or a sync never completed. The
+Helm chart creates `ClusterRole/longhorn-role` on a successful install.
+
+1. See the failure reason:
+
+```bash
+kubectl describe application longhorn -n argocd
+```
+
+Check **Status → Conditions** and **Status → Operation State** for the error
+(for example Helm render error, CRD apply failure, or repo-server cannot reach
+`https://charts.longhorn.io`).
+
+2. Trigger a sync (Argo CD UI **Sync**, or CLI):
+
+```bash
+argocd app sync longhorn
+```
+
+3. After a good sync, confirm:
+
+```bash
+kubectl get clusterrole longhorn-role
+kubectl get pods -n longhorn-system
+```
+
 ### Extra drive layout (this cluster)

 Each node uses: