Update Cilium application configuration to ignore differences for hubble-server-certs Secret, add Helm value files for better management, and enhance Argo CD kustomization with resource ordering and sync options.
This commit is contained in:
210
talos/README.md
210
talos/README.md
@@ -13,10 +13,18 @@ cluster:
|
||||
Edit `talconfig.yaml`:
|
||||
|
||||
- `endpoint` (Kubernetes API VIP or LB IP)
|
||||
- **`additionalApiServerCertSans`** / **`additionalMachineCertSans`**: must include the
|
||||
**same VIP** (and DNS name, if you use one) that clients and `talosctl` use —
|
||||
otherwise TLS to `https://<VIP>:6443` fails because the cert only lists node
|
||||
IPs by default. This repo sets **`192.168.50.230`** (and
|
||||
**`kube.noble.lab.pcenicni.dev`**) to match kube-vip.
|
||||
- each node `ipAddress`
|
||||
- each node `installDisk` (for example `/dev/sda`, `/dev/nvme0n1`)
|
||||
- `talosVersion` / `kubernetesVersion` if desired
|
||||
|
||||
After changing SANs, run **`talhelper genconfig`**, re-**apply-config** to all
|
||||
**control-plane** nodes (certs are regenerated), then refresh **`talosctl kubeconfig`**.
|
||||
|
||||
## 2) Generate cluster secrets and machine configs
|
||||
|
||||
From this directory:
|
||||
@@ -88,15 +96,94 @@ kubectl cluster-info
|
||||
|
||||
Avoid pasting `https://` twice when running `kubectl config set-cluster ... --server=...`.
|
||||
|
||||
### `kubectl apply` fails: `localhost:8080` / `openapi` connection refused
|
||||
|
||||
`kubectl` is **not** using a real cluster config; it falls back to the default
|
||||
`http://localhost:8080` (no `KUBECONFIG`, empty file, or wrong file).
|
||||
|
||||
Fix:
|
||||
|
||||
```bash
|
||||
cd talos
|
||||
export KUBECONFIG="$(pwd)/kubeconfig"
|
||||
kubectl config current-context
|
||||
kubectl cluster-info
|
||||
```
|
||||
|
||||
Then run `kubectl apply` from the **repository root** (parent of `talos/`) in
|
||||
the same shell. Do **not** use a literal `cd /path/to/...` — that was only a
|
||||
placeholder. Example (adjust to where you cloned this repo):
|
||||
|
||||
```bash
|
||||
export KUBECONFIG="${HOME}/Developer/home-server/talos/kubeconfig"
|
||||
```
|
||||
|
||||
`kubectl config set-cluster noble ...` only updates the file **`kubectl` is
|
||||
actually reading** (often `~/.kube/config`). It does nothing if `KUBECONFIG`
|
||||
points at another path.
|
||||
|
||||
## 6) GitOps-pinned Cilium values
|
||||
|
||||
The Cilium settings that worked for this Talos cluster are now persisted in:
|
||||
|
||||
- `clusters/noble/apps/cilium/application.yaml`
|
||||
- `clusters/noble/apps/cilium/helm-values.yaml`
|
||||
- `clusters/noble/apps/cilium/application.yaml` (Helm chart + `valueFiles` from this repo)
|
||||
|
||||
That Argo CD `Application` pins chart `1.16.6` and includes the required Helm
|
||||
values for this environment (API host/port, cgroup settings, IPAM CIDR, and
|
||||
security capabilities), so future reconciles do not drift back to defaults.
|
||||
That Argo CD `Application` pins chart `1.16.6` and uses the same values file
|
||||
for API host/port, cgroup settings, IPAM CIDR, and security capabilities.
|
||||
|
||||
### Cilium before Argo CD (`cni: none`)
|
||||
|
||||
This cluster uses **`cniConfig.name: none`** in `talconfig.yaml` so Talos does
|
||||
not install a CNI. **Argo CD pods cannot schedule** until some CNI makes nodes
|
||||
`Ready` (otherwise the `node.kubernetes.io/not-ready` taint blocks scheduling).
|
||||
|
||||
Install Cilium **once** with Helm from your workstation (same chart and values
|
||||
Argo will manage later), **then** bootstrap Argo CD:
|
||||
|
||||
```bash
|
||||
helm repo add cilium https://helm.cilium.io/
|
||||
helm repo update
|
||||
helm upgrade --install cilium cilium/cilium \
|
||||
--namespace kube-system \
|
||||
--version 1.16.6 \
|
||||
-f clusters/noble/apps/cilium/helm-values.yaml \
|
||||
--wait --timeout 10m
|
||||
kubectl get nodes
|
||||
kubectl wait --for=condition=Ready nodes --all --timeout=300s
|
||||
```
|
||||
|
||||
If **`helm --install` seems stuck** after “Installing it now”, it is usually still
|
||||
pulling images (`quay.io/cilium/...`) or waiting for pods to become Ready. In
|
||||
another terminal run `kubectl get pods -n kube-system -w` and check for
|
||||
`ImagePullBackOff`, `Pending`, or `CrashLoopBackOff`. To avoid blocking on
|
||||
Helm’s wait logic, install without `--wait`, confirm Cilium pods, then continue:
|
||||
|
||||
```bash
|
||||
helm upgrade --install cilium cilium/cilium \
|
||||
--namespace kube-system \
|
||||
--version 1.16.6 \
|
||||
-f clusters/noble/apps/cilium/helm-values.yaml
|
||||
kubectl get pods -n kube-system -l app.kubernetes.io/part-of=cilium -w
|
||||
```
|
||||
|
||||
`helm-values.yaml` sets **`operator.replicas: 1`** so the chart default (two
|
||||
operators with hard anti-affinity) cannot deadlock `helm --wait` when only one
|
||||
node can take the operator early in bootstrap.
|
||||
|
||||
If **`helm upgrade` fails** with a server-side apply conflict on
|
||||
`kube-system/hubble-server-certs` and **`argocd-controller`**, Argo already
|
||||
synced Cilium and owns that Secret’s TLS fields. The **`cilium` Application**
|
||||
uses **`ignoreDifferences`** on that Secret plus **`RespectIgnoreDifferences`**
|
||||
so GitOps and occasional CLI Helm runs do not fight over `.data`. Until that
|
||||
manifest is applied in the cluster, either **suspend** the `cilium` Application
|
||||
in Argo, or delete the Secret once (`kubectl delete secret
|
||||
hubble-server-certs -n kube-system`) and re-run **`helm upgrade --install`**
|
||||
before Argo reconciles again. After bootstrap, prefer **`kubectl -n argocd get
|
||||
application cilium -o yaml`** / Argo UI to sync Cilium instead of ad hoc
|
||||
Helm, unless you suspend the app first.
|
||||
|
||||
If nodes were already `Ready`, you can skip straight to section 7.
|
||||
|
||||
## 7) Argo CD app-of-apps bootstrap
|
||||
|
||||
@@ -111,9 +198,13 @@ Bootstrap once from your workstation:
|
||||
|
||||
```bash
|
||||
kubectl apply -k clusters/noble/bootstrap/argocd
|
||||
kubectl wait --for=condition=Established crd/appprojects.argoproj.io --timeout=120s
|
||||
kubectl apply -f clusters/noble/bootstrap/argocd/default-appproject.yaml
|
||||
kubectl apply -f clusters/noble/root-application.yaml
|
||||
```
|
||||
|
||||
If the first command errors on `AppProject` (“no matches for kind `AppProject`”), the CRDs were not ready yet; run the `kubectl wait` and `kubectl apply -f .../default-appproject.yaml` lines, then continue.
|
||||
|
||||
After this, Argo CD continuously reconciles all applications under
|
||||
`clusters/noble/apps/`.
|
||||
|
||||
@@ -300,3 +391,114 @@ talosctl --talosconfig ./clusterconfig/talosconfig version
|
||||
kubectl get nodes -o wide
|
||||
```
|
||||
|
||||
## 12) Destroy the cluster and rebuild from scratch
|
||||
|
||||
Use this when Kubernetes / etcd / Argo / Longhorn state is corrupted and you want a
|
||||
**clean** cluster. This **wipes cluster state on the nodes** (etcd, workloads,
|
||||
Longhorn data on cluster disks). Plan for **downtime** and **backup** anything
|
||||
you must keep off-cluster first.
|
||||
|
||||
### 12.1 Reset every Talos node (Kubernetes is destroyed)
|
||||
|
||||
From `talos/` with a working **`talosconfig`** that matches the machines (same
|
||||
`TALOSCONFIG` / `ENDPOINT` guidance as elsewhere in this README):
|
||||
|
||||
```bash
|
||||
cd talos
|
||||
export TALOSCONFIG="$(pwd)/clusterconfig/talosconfig"
|
||||
export ENDPOINT=192.168.50.230
|
||||
```
|
||||
|
||||
Reset **one node at a time**, waiting for each to reboot before the next. Order:
|
||||
**worker first**, then **non-bootstrap control planes**, then the **bootstrap**
|
||||
control plane **last** (`noble-cp-1` → `192.168.50.20`).
|
||||
|
||||
```bash
|
||||
talosctl -e "${ENDPOINT}" -n 192.168.50.10 reset --graceful=false
|
||||
talosctl -e "${ENDPOINT}" -n 192.168.50.30 reset --graceful=false
|
||||
talosctl -e "${ENDPOINT}" -n 192.168.50.40 reset --graceful=false
|
||||
talosctl -e "${ENDPOINT}" -n 192.168.50.20 reset --graceful=false
|
||||
```
|
||||
|
||||
If the API VIP is already unreachable, target the **node IP** as endpoint for that
|
||||
node, for example:
|
||||
`talosctl -e 192.168.50.10 -n 192.168.50.10 reset --graceful=false`.
|
||||
|
||||
Your workstation **`kubeconfig`** will not work for the old cluster after this;
|
||||
that is expected until you bootstrap again.
|
||||
|
||||
### 12.2 (Optional) New cluster secrets
|
||||
|
||||
For a fully fresh identity (new cluster CA and `talosconfig`):
|
||||
|
||||
```bash
|
||||
cd talos
|
||||
talhelper gensecret > talsecret.sops.yaml
|
||||
# encrypt / store talsecret as you usually do, then:
|
||||
talhelper genconfig
|
||||
```
|
||||
|
||||
If you **keep** the existing `talsecret.sops.yaml`, still run **`talhelper genconfig`**
|
||||
so `clusterconfig/` matches what you will apply.
|
||||
|
||||
### 12.3 Apply configs, bootstrap, kubeconfig
|
||||
|
||||
Repeat **§3 Apply Talos configs** and **§4 Bootstrap the cluster** (and **§5
|
||||
Validate**) from the top of this README: `apply-config` each node, then
|
||||
`talosctl bootstrap`, then `talosctl kubeconfig` into `talos/kubeconfig`.
|
||||
|
||||
### 12.4 Redeploy GitOps (Argo CD + apps)
|
||||
|
||||
From your workstation (repo root), with `KUBECONFIG` pointing at the new
|
||||
`talos/kubeconfig`:
|
||||
|
||||
```bash
|
||||
# Set REPO to the directory that contains both talos/ and clusters/ (not a literal "path/to")
|
||||
REPO="${HOME}/Developer/home-server"
|
||||
export KUBECONFIG="${REPO}/talos/kubeconfig"
|
||||
cd "${REPO}"
|
||||
kubectl apply -k clusters/noble/bootstrap/argocd
|
||||
kubectl apply -f clusters/noble/root-application.yaml
|
||||
```
|
||||
|
||||
Resolve **Argo CD admin** login (secret / password reset) as needed; then let
|
||||
`noble-root` sync `clusters/noble/apps/`.
|
||||
|
||||
## 13) Mid-rebuild issues: etcd, bootstrap, and `apply-config`
|
||||
|
||||
### `tls: certificate required` when using `apply-config --insecure`
|
||||
|
||||
After a node has **joined** the cluster, the Talos API expects **client
|
||||
certificates** from your `talosconfig`. `--insecure` only applies to **maintenance**
|
||||
(before join / after a reset).
|
||||
|
||||
**Do one of:**
|
||||
|
||||
- Apply config **with** `talosconfig` (no `--insecure`):
|
||||
|
||||
```bash
|
||||
cd talos
|
||||
export TALOSCONFIG="$(pwd)/clusterconfig/talosconfig"
|
||||
export ENDPOINT=192.168.50.230
|
||||
talosctl -e "${ENDPOINT}" apply-config -n 192.168.50.30 -f clusterconfig/noble-noble-cp-2.yaml
|
||||
```
|
||||
|
||||
- Or **`talosctl reset`** that node first (see §12.1), then use
|
||||
`apply-config --insecure` again while it is in maintenance.
|
||||
|
||||
### `bootstrap`: `etcd data directory is not empty`
|
||||
|
||||
The bootstrap node (`192.168.50.20`) already has a **previous etcd** on disk (failed
|
||||
or partial bootstrap). Kubernetes will not bootstrap again until that state is
|
||||
**wiped**.
|
||||
|
||||
**Fix:** run **`talosctl reset --graceful=false`** on the **control plane nodes**
|
||||
(at minimum the bootstrap node; often **all four nodes** is cleaner). See §12.1.
|
||||
Then re-apply machine configs and run **`talosctl bootstrap` exactly once**.
|
||||
|
||||
### etcd unhealthy / “Preparing” on some control planes
|
||||
|
||||
Usually means **split or partial** cluster state. The reliable fix is the same
|
||||
**full reset** (§12.1), then a single ordered bring-up: apply all configs →
|
||||
bootstrap once → `talosctl health`.
|
||||
|
||||
|
||||
Reference in New Issue
Block a user