Update Headlamp and Vault documentation; enhance RBAC configurations in Argo CD. Modify Headlamp README to clarify sessionTTL handling and ServiceAccount permissions. Add Cilium network policy instructions to Vault README. Update Argo CD values.yaml for default RBAC settings, ensuring local admin retains full access while new users start with read-only permissions. Reflect these changes in CLUSTER-BUILD.md.
This commit is contained in:
11
talos/runbooks/README.md
Normal file
11
talos/runbooks/README.md
Normal file
@@ -0,0 +1,11 @@
|
||||
# Noble lab — operational runbooks
|
||||
|
||||
Short recovery / triage notes for the **noble** Talos cluster. Deep procedures live in [`talos/README.md`](../README.md) and [`talos/CLUSTER-BUILD.md`](../CLUSTER-BUILD.md).
|
||||
|
||||
| Topic | Runbook |
|
||||
|-------|---------|
|
||||
| Kubernetes API VIP (kube-vip) | [`api-vip-kube-vip.md`](./api-vip-kube-vip.md) |
|
||||
| etcd / Talos control plane | [`etcd-talos.md`](./etcd-talos.md) |
|
||||
| Longhorn storage | [`longhorn.md`](./longhorn.md) |
|
||||
| Vault (unseal, auth, ESO) | [`vault.md`](./vault.md) |
|
||||
| RBAC (Headlamp, Argo CD) | [`rbac.md`](./rbac.md) |
|
||||
15
talos/runbooks/api-vip-kube-vip.md
Normal file
15
talos/runbooks/api-vip-kube-vip.md
Normal file
@@ -0,0 +1,15 @@
|
||||
# Runbook: Kubernetes API VIP (kube-vip)
|
||||
|
||||
**Symptoms:** `kubectl` timeouts, `connection refused` to `https://192.168.50.230:6443`, or nodes `NotReady` while apiserver on a node IP still works.
|
||||
|
||||
**Checks**
|
||||
|
||||
1. VIP and interface align with [`talos/talconfig.yaml`](../talconfig.yaml) (`cluster.network`, `additionalApiServerCertSans`) and [`clusters/noble/apps/kube-vip/`](../../clusters/noble/apps/kube-vip/).
|
||||
2. `kubectl -n kube-system get pods -l app.kubernetes.io/name=kube-vip -o wide` — DaemonSet should be **Running** on control-plane nodes.
|
||||
3. From a workstation: `ping 192.168.50.230` (if ICMP allowed) and `curl -k https://192.168.50.230:6443/healthz` or `kubectl get --raw /healthz` with kubeconfig `server:` set to the VIP.
|
||||
4. `talosctl health` with `TALOSCONFIG` (see [`talos/README.md`](../README.md) §3).
|
||||
|
||||
**Common fixes**
|
||||
|
||||
- Wrong uplink name in kube-vip (`ens18` vs actual): fix manifest, re-apply, verify on node with `talosctl get links`.
|
||||
- Workstation routing/DNS: use VIP only when reachable; otherwise temporarily point kubeconfig `server:` at a control-plane IP (see README §3).
|
||||
16
talos/runbooks/etcd-talos.md
Normal file
16
talos/runbooks/etcd-talos.md
Normal file
@@ -0,0 +1,16 @@
|
||||
# Runbook: etcd / Talos control plane
|
||||
|
||||
**Symptoms:** API flaps, `etcd` alarms, multiple control planes `NotReady`, upgrades stuck.
|
||||
|
||||
**Checks**
|
||||
|
||||
1. `talosctl health` and `talosctl etcd status` (with `TALOSCONFIG`; target a control-plane node if needed).
|
||||
2. `kubectl get nodes` — control planes **Ready**; look for disk/memory pressure.
|
||||
3. Talos version skew: `talosctl version` vs node image in [`talos/talconfig.yaml`](../talconfig.yaml) / Image Factory schematic.
|
||||
|
||||
**Common fixes**
|
||||
|
||||
- One bad control plane: cordon/drain workloads only after confirming quorum; follow Talos maintenance docs for replace/remove.
|
||||
- Disk full on etcd volume: resolve host disk / system partition (Talos ephemeral vs user volumes per machine config).
|
||||
|
||||
**References:** [Talos etcd](https://www.talos.dev/latest/advanced/etcd-maintenance/), [`talos/README.md`](../README.md).
|
||||
16
talos/runbooks/longhorn.md
Normal file
16
talos/runbooks/longhorn.md
Normal file
@@ -0,0 +1,16 @@
|
||||
# Runbook: Longhorn
|
||||
|
||||
**Symptoms:** PVCs stuck **Pending**, volumes **Faulted**, workloads I/O errors, Longhorn UI/alerts.
|
||||
|
||||
**Checks**
|
||||
|
||||
1. `kubectl -n longhorn-system get pods` and `kubectl get nodes.longhorn.io -o wide`.
|
||||
2. Talos user disk + extensions for Longhorn (see [`talos/README.md`](../README.md) §5 and `talconfig.with-longhorn.yaml`).
|
||||
3. `kubectl get sc` — **longhorn** default as expected; PVC events: `kubectl describe pvc -n <ns> <name>`.
|
||||
|
||||
**Common fixes**
|
||||
|
||||
- Node disk pressure / mount missing: fix Talos machine config, reboot node per Talos docs.
|
||||
- Recovery / GPT wipe scripts: [`talos/scripts/longhorn-gpt-recovery.sh`](../scripts/longhorn-gpt-recovery.sh) and CLUSTER-BUILD notes.
|
||||
|
||||
**References:** [`clusters/noble/apps/longhorn/`](../../clusters/noble/apps/longhorn/), [Longhorn docs](https://longhorn.io/docs/).
|
||||
13
talos/runbooks/rbac.md
Normal file
13
talos/runbooks/rbac.md
Normal file
@@ -0,0 +1,13 @@
|
||||
# Runbook: Kubernetes RBAC (noble)
|
||||
|
||||
**Headlamp** (`clusters/noble/apps/headlamp/values.yaml`): the chart’s **ClusterRoleBinding** uses the built-in **`edit`** ClusterRole — not **`cluster-admin`**. Break-glass changes use **`kubectl`** with an admin kubeconfig.
|
||||
|
||||
**Argo CD** (`clusters/noble/bootstrap/argocd/values.yaml`): **`policy.default: role:readonly`** — new OIDC/Git users get read-only unless you add **`g, <user-or-group>, role:admin`** (or another role) in **`configs.rbac.policy.csv`**. Local user **`admin`** stays **`role:admin`** via **`g, admin, role:admin`**.
|
||||
|
||||
**Audits**
|
||||
|
||||
```bash
|
||||
kubectl get clusterrolebindings -o custom-columns='NAME:.metadata.name,ROLE:.roleRef.name,SA:.subjects[?(@.kind=="ServiceAccount")].name,NS:.subjects[?(@.kind=="ServiceAccount")].namespace' | grep -E 'NAME|cluster-admin|headlamp|argocd'
|
||||
```
|
||||
|
||||
**References:** [Headlamp chart RBAC](https://github.com/kubernetes-sigs/headlamp/tree/main/charts/headlamp), [Argo CD RBAC](https://argo-cd.readthedocs.io/en/stable/operator-manual/rbac/).
|
||||
15
talos/runbooks/vault.md
Normal file
15
talos/runbooks/vault.md
Normal file
@@ -0,0 +1,15 @@
|
||||
# Runbook: Vault (in-cluster)
|
||||
|
||||
**Symptoms:** External Secrets **not syncing**, `ClusterSecretStore` **InvalidProviderConfig**, Vault UI/API **503 sealed**, pods **CrashLoop** on auth.
|
||||
|
||||
**Checks**
|
||||
|
||||
1. `kubectl -n vault exec -i sts/vault -- vault status` — **Sealed** / **Initialized**.
|
||||
2. Unseal key Secret + optional CronJob: [`clusters/noble/apps/vault/README.md`](../../clusters/noble/apps/vault/README.md), `unseal-cronjob.yaml`.
|
||||
3. Kubernetes auth for ESO: [`clusters/noble/apps/vault/configure-kubernetes-auth.sh`](../../clusters/noble/apps/vault/configure-kubernetes-auth.sh) and `kubectl describe clustersecretstore vault`.
|
||||
4. **Cilium** policy: if Vault is unreachable from `external-secrets`, check [`clusters/noble/apps/vault/cilium-network-policy.yaml`](../../clusters/noble/apps/vault/cilium-network-policy.yaml) and extend `ingress` for new client namespaces.
|
||||
|
||||
**Common fixes**
|
||||
|
||||
- Sealed: `vault operator unseal` or fix auto-unseal CronJob + `vault-unseal-key` Secret.
|
||||
- **403/invalid role** on ESO: re-run Kubernetes auth setup (issuer/CA/reviewer JWT) per README.
|
||||
Reference in New Issue
Block a user