545 lines
18 KiB
Markdown
545 lines
18 KiB
Markdown
# Talos deployment (4 nodes)
|
||
|
||
This directory contains a `talhelper` cluster definition for a 4-node Talos
|
||
cluster:
|
||
|
||
- 3 hybrid control-plane/worker nodes: `noble-cp-1..3`
|
||
- 1 worker-only node: `noble-worker-1`
|
||
- `allowSchedulingOnControlPlanes: true`
|
||
- CNI: `none` (for Cilium via GitOps)
|
||
|
||
## 1) Update values for your environment
|
||
|
||
Edit `talconfig.yaml`:
|
||
|
||
- `endpoint` (Kubernetes API VIP or LB IP)
|
||
- **`additionalApiServerCertSans`** / **`additionalMachineCertSans`**: must include the
|
||
**same VIP** (and DNS name, if you use one) that clients and `talosctl` use —
|
||
otherwise TLS to `https://<VIP>:6443` fails because the cert only lists node
|
||
IPs by default. This repo sets **`192.168.50.230`** (and
|
||
**`kube.noble.lab.pcenicni.dev`**) to match kube-vip.
|
||
- each node `ipAddress`
|
||
- each node `installDisk` (for example `/dev/sda`, `/dev/nvme0n1`)
|
||
- `talosVersion` / `kubernetesVersion` if desired
|
||
|
||
After changing SANs, run **`talhelper genconfig`**, re-**apply-config** to all
|
||
**control-plane** nodes (certs are regenerated), then refresh **`talosctl kubeconfig`**.
|
||
|
||
## 2) Generate cluster secrets and machine configs
|
||
|
||
From this directory:
|
||
|
||
```bash
|
||
talhelper gensecret > talsecret.sops.yaml
|
||
talhelper genconfig
|
||
```
|
||
|
||
Generated machine configs are written to `clusterconfig/`.
|
||
|
||
## 3) Apply Talos configs
|
||
|
||
Apply each node file to the matching node IP from `talconfig.yaml`:
|
||
|
||
```bash
|
||
talosctl apply-config --insecure -n 192.168.50.20 -f clusterconfig/noble-noble-cp-1.yaml
|
||
talosctl apply-config --insecure -n 192.168.50.30 -f clusterconfig/noble-noble-cp-2.yaml
|
||
talosctl apply-config --insecure -n 192.168.50.40 -f clusterconfig/noble-noble-cp-3.yaml
|
||
talosctl apply-config --insecure -n 192.168.50.10 -f clusterconfig/noble-noble-worker-1.yaml
|
||
```
|
||
|
||
## 4) Bootstrap the cluster
|
||
|
||
After all nodes are up (bootstrap once, from any control-plane node):
|
||
|
||
```bash
|
||
talosctl bootstrap -n 192.168.50.20 -e 192.168.50.230
|
||
talosctl kubeconfig -n 192.168.50.20 -e 192.168.50.230 .
|
||
```
|
||
|
||
## 5) Validate
|
||
|
||
```bash
|
||
talosctl -n 192.168.50.20 -e 192.168.50.230 health
|
||
kubectl get nodes -o wide
|
||
```
|
||
|
||
### `kubectl` errors: `lookup https: no such host` or `https://https/...`
|
||
|
||
That means the **active** kubeconfig has a broken `cluster.server` URL (often a
|
||
**double** `https://` or **duplicate** `:6443`). Kubernetes then tries to resolve
|
||
the hostname `https`, which fails.
|
||
|
||
Inspect what you are using:
|
||
|
||
```bash
|
||
kubectl config view --minify -o jsonpath='{.clusters[0].cluster.server}{"\n"}'
|
||
```
|
||
|
||
It must be a **single** valid URL, for example:
|
||
|
||
- `https://192.168.50.230:6443` (API VIP from `talconfig.yaml`), or
|
||
- `https://kube.noble.lab.pcenicni.dev:6443` (if DNS points at that VIP)
|
||
|
||
Fix the cluster entry (replace `noble` with your context’s cluster name if
|
||
different):
|
||
|
||
```bash
|
||
kubectl config set-cluster noble --server=https://192.168.50.230:6443
|
||
```
|
||
|
||
Or point `kubectl` at this repo’s kubeconfig (known-good server line):
|
||
|
||
```bash
|
||
export KUBECONFIG="$(pwd)/kubeconfig"
|
||
kubectl cluster-info
|
||
```
|
||
|
||
Avoid pasting `https://` twice when running `kubectl config set-cluster ... --server=...`.
|
||
|
||
### `kubectl apply` fails: `localhost:8080` / `openapi` connection refused
|
||
|
||
`kubectl` is **not** using a real cluster config; it falls back to the default
|
||
`http://localhost:8080` (no `KUBECONFIG`, empty file, or wrong file).
|
||
|
||
Fix:
|
||
|
||
```bash
|
||
cd talos
|
||
export KUBECONFIG="$(pwd)/kubeconfig"
|
||
kubectl config current-context
|
||
kubectl cluster-info
|
||
```
|
||
|
||
Then run `kubectl apply` from the **repository root** (parent of `talos/`) in
|
||
the same shell. Do **not** use a literal `cd /path/to/...` — that was only a
|
||
placeholder. Example (adjust to where you cloned this repo):
|
||
|
||
```bash
|
||
export KUBECONFIG="${HOME}/Developer/home-server/talos/kubeconfig"
|
||
```
|
||
|
||
`kubectl config set-cluster noble ...` only updates the file **`kubectl` is
|
||
actually reading** (often `~/.kube/config`). It does nothing if `KUBECONFIG`
|
||
points at another path.
|
||
|
||
## 6) GitOps-pinned Cilium values
|
||
|
||
The Cilium settings that worked for this Talos cluster are now persisted in:
|
||
|
||
- `clusters/noble/apps/cilium/helm-values.yaml`
|
||
- `clusters/noble/apps/cilium/application.yaml` (Helm chart + `valueFiles` from this repo)
|
||
|
||
That Argo CD `Application` pins chart `1.16.6` and uses the same values file
|
||
for API host/port, cgroup settings, IPAM CIDR, and security capabilities.
|
||
|
||
### Cilium before Argo CD (`cni: none`)
|
||
|
||
This cluster uses **`cniConfig.name: none`** in `talconfig.yaml` so Talos does
|
||
not install a CNI. **Argo CD pods cannot schedule** until some CNI makes nodes
|
||
`Ready` (otherwise the `node.kubernetes.io/not-ready` taint blocks scheduling).
|
||
|
||
Install Cilium **once** with Helm from your workstation (same chart and values
|
||
Argo will manage later), **then** bootstrap Argo CD:
|
||
|
||
```bash
|
||
helm repo add cilium https://helm.cilium.io/
|
||
helm repo update
|
||
helm upgrade --install cilium cilium/cilium \
|
||
--namespace kube-system \
|
||
--version 1.16.6 \
|
||
-f clusters/noble/apps/cilium/helm-values.yaml \
|
||
--wait --timeout 10m
|
||
kubectl get nodes
|
||
kubectl wait --for=condition=Ready nodes --all --timeout=300s
|
||
```
|
||
|
||
If **`helm --install` seems stuck** after “Installing it now”, it is usually still
|
||
pulling images (`quay.io/cilium/...`) or waiting for pods to become Ready. In
|
||
another terminal run `kubectl get pods -n kube-system -w` and check for
|
||
`ImagePullBackOff`, `Pending`, or `CrashLoopBackOff`. To avoid blocking on
|
||
Helm’s wait logic, install without `--wait`, confirm Cilium pods, then continue:
|
||
|
||
```bash
|
||
helm upgrade --install cilium cilium/cilium \
|
||
--namespace kube-system \
|
||
--version 1.16.6 \
|
||
-f clusters/noble/apps/cilium/helm-values.yaml
|
||
kubectl get pods -n kube-system -l app.kubernetes.io/part-of=cilium -w
|
||
```
|
||
|
||
`helm-values.yaml` sets **`operator.replicas: 1`** so the chart default (two
|
||
operators with hard anti-affinity) cannot deadlock `helm --wait` when only one
|
||
node can take the operator early in bootstrap.
|
||
|
||
If **`helm upgrade` fails** with server-side apply conflicts and
|
||
**`argocd-controller`**, Argo already synced Cilium and **owns those fields**
|
||
on live objects. Clearing **`syncPolicy`** on the Application does **not**
|
||
remove that ownership; Helm still conflicts until you **take over** the fields
|
||
or only use Argo.
|
||
|
||
**One-shot CLI fix** (Helm 3.13+): add **`--force-conflicts`** so SSA wins the
|
||
disputed fields:
|
||
|
||
```bash
|
||
helm upgrade --install cilium cilium/cilium \
|
||
--namespace kube-system \
|
||
--version 1.16.6 \
|
||
-f clusters/noble/apps/cilium/helm-values.yaml \
|
||
--force-conflicts
|
||
```
|
||
|
||
Typical conflicts: Secret **`hubble-server-certs`** (`.data` TLS) and
|
||
Deployment **`cilium-operator`** (`.spec.replicas`,
|
||
`.spec/strategy/rollingUpdate/maxUnavailable`). The **`cilium` Application**
|
||
lists **`ignoreDifferences`** for those paths plus **`RespectIgnoreDifferences`**
|
||
so later Argo syncs do not keep overwriting them. Apply the manifest after you
|
||
change it: **`kubectl apply -f clusters/noble/apps/cilium/application.yaml`**.
|
||
|
||
After bootstrap, prefer syncing Cilium **only through Argo** (from Git) instead
|
||
of ad hoc Helm, unless you suspend the **`cilium`** Application first.
|
||
|
||
Shell tip: a line like **`# comment`** must start with **`#`**; if the shell
|
||
reports **`command not found: #`**, the character is not a real hash or the
|
||
line was pasted wrong—run **`kubectl apply ...`** as its own command without a
|
||
leading comment on the same paste block.
|
||
|
||
If nodes were already `Ready`, you can skip straight to section 7.
|
||
|
||
## 7) Argo CD app-of-apps bootstrap
|
||
|
||
This repo includes an app-of-apps structure for cluster apps:
|
||
|
||
- Root app: `clusters/noble/root-application.yaml`
|
||
- Child apps index: `clusters/noble/apps/kustomization.yaml`
|
||
- Argo CD app: `clusters/noble/apps/argocd/application.yaml`
|
||
- Cilium app: `clusters/noble/apps/cilium/application.yaml`
|
||
|
||
Bootstrap once from your workstation:
|
||
|
||
```bash
|
||
kubectl apply -k clusters/noble/bootstrap/argocd
|
||
kubectl wait --for=condition=Established crd/appprojects.argoproj.io --timeout=120s
|
||
kubectl apply -f clusters/noble/bootstrap/argocd/default-appproject.yaml
|
||
kubectl apply -f clusters/noble/root-application.yaml
|
||
```
|
||
|
||
If the first command errors on `AppProject` (“no matches for kind `AppProject`”), the CRDs were not ready yet; run the `kubectl wait` and `kubectl apply -f .../default-appproject.yaml` lines, then continue.
|
||
|
||
After this, Argo CD continuously reconciles all applications under
|
||
`clusters/noble/apps/`.
|
||
|
||
## 8) kube-vip API VIP (`192.168.50.230`)
|
||
|
||
HAProxy has been removed in favor of `kube-vip` running on control-plane nodes.
|
||
|
||
Manifests are in:
|
||
|
||
- `clusters/noble/apps/kube-vip/application.yaml`
|
||
- `clusters/noble/apps/kube-vip/vip-rbac.yaml`
|
||
- `clusters/noble/apps/kube-vip/vip-daemonset.yaml`
|
||
|
||
The DaemonSet advertises `192.168.50.230` in ARP mode and fronts the Kubernetes
|
||
API on port `6443`.
|
||
|
||
Apply manually (or let Argo CD sync from root app):
|
||
|
||
```bash
|
||
kubectl apply -k clusters/noble/apps/kube-vip
|
||
```
|
||
|
||
Validate:
|
||
|
||
```bash
|
||
kubectl -n kube-system get pods -l app.kubernetes.io/name=kube-vip-ds -o wide
|
||
nc -vz 192.168.50.230 6443
|
||
```
|
||
|
||
If **`kube-vip-ds` pods are `CrashLoopBackOff`**, logs usually show
|
||
`could not get link for interface '…'`. kube-vip binds the VIP to
|
||
**`vip_interface`**; on Talos the uplink is often **`eno1`**, **`enp0s…`**, or
|
||
**`enx…`**, not **`eth0`**. On a control-plane node IP from `talconfig.yaml`:
|
||
|
||
```bash
|
||
talosctl -n 192.168.50.20 get links
|
||
```
|
||
|
||
Do **not** paste that command’s **table output** back into the shell: zsh runs
|
||
each line as a command (e.g. `192.168.50.20` → `command not found`), and a line
|
||
starting with **`NODE`** can be mistaken for the **`node`** binary and try to
|
||
load a file like **`NAMESPACE`** in the current directory. Also avoid pasting
|
||
the **prompt** (`(base) … %`) together with the command (duplicate prompt →
|
||
parse errors).
|
||
|
||
Set **`vip_interface`** in `clusters/noble/apps/kube-vip/vip-daemonset.yaml` to
|
||
that link’s **`metadata.id`**, commit, sync (or `kubectl apply -k
|
||
clusters/noble/apps/kube-vip`), and confirm pods go **`Running`**.
|
||
|
||
## 9) Argo CD via DNS host (no port)
|
||
|
||
Argo CD is exposed through a kube-vip managed LoadBalancer Service:
|
||
|
||
- `argo.noble.lab.pcenicni.dev`
|
||
|
||
Manifests:
|
||
|
||
- `clusters/noble/bootstrap/argocd/argocd-server-lb.yaml`
|
||
- `clusters/noble/apps/kube-vip/vip-daemonset.yaml` (`svc_enable: "true"`)
|
||
|
||
After syncing manifests, create a Pi-hole DNS A record:
|
||
|
||
- `argo.noble.lab.pcenicni.dev` -> `192.168.50.231`
|
||
|
||
## 10) Longhorn storage and extra disks
|
||
|
||
Longhorn is deployed from:
|
||
|
||
- `clusters/noble/apps/longhorn/application.yaml`
|
||
|
||
Monitoring apps are configured to use `storageClassName: longhorn`, so you can
|
||
persist Prometheus/Alertmanager/Loki data once Longhorn is healthy.
|
||
|
||
### Argo CD: `longhorn` OutOfSync, Health **Missing**, no `longhorn-role`
|
||
|
||
**Missing** means nothing has been applied yet, or a sync never completed. The
|
||
Helm chart creates `ClusterRole/longhorn-role` on a successful install.
|
||
|
||
1. See the failure reason:
|
||
|
||
```bash
|
||
kubectl describe application longhorn -n argocd
|
||
```
|
||
|
||
Check **Status → Conditions** and **Status → Operation State** for the error
|
||
(for example Helm render error, CRD apply failure, or repo-server cannot reach
|
||
`https://charts.longhorn.io`).
|
||
|
||
2. Trigger a sync (Argo CD UI **Sync**, or CLI):
|
||
|
||
```bash
|
||
argocd app sync longhorn
|
||
```
|
||
|
||
3. After a good sync, confirm:
|
||
|
||
```bash
|
||
kubectl get clusterrole longhorn-role
|
||
kubectl get pods -n longhorn-system
|
||
```
|
||
|
||
### Extra drive layout (this cluster)
|
||
|
||
Each node uses:
|
||
|
||
- `/dev/sda` — Talos install disk (`installDisk` in `talconfig.yaml`)
|
||
- `/dev/sdb` — dedicated Longhorn data disk
|
||
|
||
`talconfig.yaml` includes a global patch that partitions `/dev/sdb` and mounts it
|
||
at `/var/mnt/longhorn`, which matches Longhorn `defaultDataPath` in the Argo
|
||
Helm values.
|
||
|
||
After editing `talconfig.yaml`, regenerate and apply configs:
|
||
|
||
```bash
|
||
cd talos
|
||
talhelper genconfig
|
||
# apply each node’s YAML from clusterconfig/ with talosctl apply-config
|
||
```
|
||
|
||
Then reboot each node once so the new disk layout is applied.
|
||
|
||
### `talosctl` TLS errors (`unknown authority`, `Ed25519 verification failure`)
|
||
|
||
`talosctl` **does not** automatically use `talos/clusterconfig/talosconfig`. If you
|
||
omit it, the client falls back to **`~/.talos/config`**, which is usually a
|
||
**different** cluster CA — you then get TLS handshake failures against the noble
|
||
nodes.
|
||
|
||
**Always** set this in the shell where you run `talosctl` (use an absolute path
|
||
if you change directories):
|
||
|
||
```bash
|
||
cd talos
|
||
export TALOSCONFIG="$(pwd)/clusterconfig/talosconfig"
|
||
export ENDPOINT=192.168.50.230
|
||
```
|
||
|
||
Sanity check (should print Talos and Kubernetes versions, not TLS errors):
|
||
|
||
```bash
|
||
talosctl -e "${ENDPOINT}" -n 192.168.50.20 version
|
||
```
|
||
|
||
Then use the same shell for `apply-config`, `reboot`, and `health`.
|
||
|
||
If it **still** fails after `TALOSCONFIG` is set, the running cluster was likely
|
||
bootstrapped with **different** secrets than the ones in your current
|
||
`talsecret.sops.yaml` / regenerated `clusterconfig/`. In that case you need the
|
||
**original** `talosconfig` that matched the cluster when it was created, or you
|
||
must align secrets and cluster state (recovery / rebuild is a larger topic).
|
||
|
||
Keep **`talosctl`** roughly aligned with the node Talos version (for example
|
||
`v1.12.x` clients for `v1.12.5` nodes).
|
||
|
||
**Paste tip:** run **one** command per line. Pasting `...cp-3.yaml` and
|
||
`talosctl` on the same line breaks the filename and can confuse the shell.
|
||
|
||
### More than one extra disk per node
|
||
|
||
If you add a third disk later, extend `machine.disks` in `talconfig.yaml` (for
|
||
example `/dev/sdc` → `/var/mnt/longhorn-disk2`) and register that path in
|
||
Longhorn as an additional disk for that node.
|
||
|
||
Recommended:
|
||
|
||
- use one dedicated filesystem per Longhorn disk path
|
||
- avoid using the Talos system disk for heavy Longhorn data
|
||
- spread replicas across nodes for resiliency
|
||
|
||
## 11) Upgrade Talos to `v1.12.x`
|
||
|
||
This repo now pins:
|
||
|
||
- `talosVersion: v1.12.5` in `talconfig.yaml`
|
||
|
||
### Regenerate configs
|
||
|
||
From `talos/`:
|
||
|
||
```bash
|
||
talhelper genconfig
|
||
```
|
||
|
||
### Rolling upgrade order
|
||
|
||
Upgrade one node at a time, waiting for it to return healthy before moving on.
|
||
|
||
1. Control plane nodes (`noble-cp-1`, then `noble-cp-2`, then `noble-cp-3`)
|
||
2. Worker node (`noble-worker-1`)
|
||
|
||
Example commands (adjust node IP per step):
|
||
|
||
```bash
|
||
talosctl --talosconfig ./clusterconfig/talosconfig -n 192.168.50.20 upgrade --image ghcr.io/siderolabs/installer:v1.12.5
|
||
talosctl --talosconfig ./clusterconfig/talosconfig -n 192.168.50.20 reboot
|
||
talosctl --talosconfig ./clusterconfig/talosconfig -n 192.168.50.20 health
|
||
```
|
||
|
||
After all nodes are upgraded, verify:
|
||
|
||
```bash
|
||
talosctl --talosconfig ./clusterconfig/talosconfig version
|
||
kubectl get nodes -o wide
|
||
```
|
||
|
||
## 12) Destroy the cluster and rebuild from scratch
|
||
|
||
Use this when Kubernetes / etcd / Argo / Longhorn state is corrupted and you want a
|
||
**clean** cluster. This **wipes cluster state on the nodes** (etcd, workloads,
|
||
Longhorn data on cluster disks). Plan for **downtime** and **backup** anything
|
||
you must keep off-cluster first.
|
||
|
||
### 12.1 Reset every Talos node (Kubernetes is destroyed)
|
||
|
||
From `talos/` with a working **`talosconfig`** that matches the machines (same
|
||
`TALOSCONFIG` / `ENDPOINT` guidance as elsewhere in this README):
|
||
|
||
```bash
|
||
cd talos
|
||
export TALOSCONFIG="$(pwd)/clusterconfig/talosconfig"
|
||
export ENDPOINT=192.168.50.230
|
||
```
|
||
|
||
Reset **one node at a time**, waiting for each to reboot before the next. Order:
|
||
**worker first**, then **non-bootstrap control planes**, then the **bootstrap**
|
||
control plane **last** (`noble-cp-1` → `192.168.50.20`).
|
||
|
||
```bash
|
||
talosctl -e "${ENDPOINT}" -n 192.168.50.10 reset --graceful=false
|
||
talosctl -e "${ENDPOINT}" -n 192.168.50.30 reset --graceful=false
|
||
talosctl -e "${ENDPOINT}" -n 192.168.50.40 reset --graceful=false
|
||
talosctl -e "${ENDPOINT}" -n 192.168.50.20 reset --graceful=false
|
||
```
|
||
|
||
If the API VIP is already unreachable, target the **node IP** as endpoint for that
|
||
node, for example:
|
||
`talosctl -e 192.168.50.10 -n 192.168.50.10 reset --graceful=false`.
|
||
|
||
Your workstation **`kubeconfig`** will not work for the old cluster after this;
|
||
that is expected until you bootstrap again.
|
||
|
||
### 12.2 (Optional) New cluster secrets
|
||
|
||
For a fully fresh identity (new cluster CA and `talosconfig`):
|
||
|
||
```bash
|
||
cd talos
|
||
talhelper gensecret > talsecret.sops.yaml
|
||
# encrypt / store talsecret as you usually do, then:
|
||
talhelper genconfig
|
||
```
|
||
|
||
If you **keep** the existing `talsecret.sops.yaml`, still run **`talhelper genconfig`**
|
||
so `clusterconfig/` matches what you will apply.
|
||
|
||
### 12.3 Apply configs, bootstrap, kubeconfig
|
||
|
||
Repeat **§3 Apply Talos configs** and **§4 Bootstrap the cluster** (and **§5
|
||
Validate**) from the top of this README: `apply-config` each node, then
|
||
`talosctl bootstrap`, then `talosctl kubeconfig` into `talos/kubeconfig`.
|
||
|
||
### 12.4 Redeploy GitOps (Argo CD + apps)
|
||
|
||
From your workstation (repo root), with `KUBECONFIG` pointing at the new
|
||
`talos/kubeconfig`:
|
||
|
||
```bash
|
||
# Set REPO to the directory that contains both talos/ and clusters/ (not a literal "path/to")
|
||
REPO="${HOME}/Developer/home-server"
|
||
export KUBECONFIG="${REPO}/talos/kubeconfig"
|
||
cd "${REPO}"
|
||
kubectl apply -k clusters/noble/bootstrap/argocd
|
||
kubectl apply -f clusters/noble/root-application.yaml
|
||
```
|
||
|
||
Resolve **Argo CD admin** login (secret / password reset) as needed; then let
|
||
`noble-root` sync `clusters/noble/apps/`.
|
||
|
||
## 13) Mid-rebuild issues: etcd, bootstrap, and `apply-config`
|
||
|
||
### `tls: certificate required` when using `apply-config --insecure`
|
||
|
||
After a node has **joined** the cluster, the Talos API expects **client
|
||
certificates** from your `talosconfig`. `--insecure` only applies to **maintenance**
|
||
(before join / after a reset).
|
||
|
||
**Do one of:**
|
||
|
||
- Apply config **with** `talosconfig` (no `--insecure`):
|
||
|
||
```bash
|
||
cd talos
|
||
export TALOSCONFIG="$(pwd)/clusterconfig/talosconfig"
|
||
export ENDPOINT=192.168.50.230
|
||
talosctl -e "${ENDPOINT}" apply-config -n 192.168.50.30 -f clusterconfig/noble-noble-cp-2.yaml
|
||
```
|
||
|
||
- Or **`talosctl reset`** that node first (see §12.1), then use
|
||
`apply-config --insecure` again while it is in maintenance.
|
||
|
||
### `bootstrap`: `etcd data directory is not empty`
|
||
|
||
The bootstrap node (`192.168.50.20`) already has a **previous etcd** on disk (failed
|
||
or partial bootstrap). Kubernetes will not bootstrap again until that state is
|
||
**wiped**.
|
||
|
||
**Fix:** run **`talosctl reset --graceful=false`** on the **control plane nodes**
|
||
(at minimum the bootstrap node; often **all four nodes** is cleaner). See §12.1.
|
||
Then re-apply machine configs and run **`talosctl bootstrap` exactly once**.
|
||
|
||
### etcd unhealthy / “Preparing” on some control planes
|
||
|
||
Usually means **split or partial** cluster state. The reliable fix is the same
|
||
**full reset** (§12.1), then a single ordered bring-up: apply all configs →
|
||
bootstrap once → `talosctl health`.
|
||
|