Files
home-server/ansible/README.md

161 lines
10 KiB
Markdown

# Ansible — noble cluster
Automates [`talos/CLUSTER-BUILD.md`](../talos/CLUSTER-BUILD.md): optional **Talos Phase A** (genconfig → apply → bootstrap → kubeconfig), then **Phase B+** (CNI → add-ons → ingress → Argo CD → Kyverno → observability, etc.). **Argo CD** does not reconcile core charts — optional GitOps starts from an empty [`clusters/noble/apps/kustomization.yaml`](../clusters/noble/apps/kustomization.yaml).
## Order of operations
1. **From `talos/`:** `talhelper gensecret` / `talsecret` as in [`talos/README.md`](../talos/README.md) §1 (if not already done).
2. **Talos Phase A (automated):** run [`playbooks/talos_phase_a.yml`](playbooks/talos_phase_a.yml) **or** the full pipeline [`playbooks/deploy.yml`](playbooks/deploy.yml). This runs **`talhelper genconfig -o out`**, **`talosctl apply-config`** on each node, **`talosctl bootstrap`**, and **`talosctl kubeconfig`** → **`talos/kubeconfig`**.
3. **Platform stack:** [`playbooks/noble.yml`](playbooks/noble.yml) (included at the end of **`deploy.yml`**).
Your workstation must be able to reach **node IPs on the lab LAN** (Talos API **:50000** for `talosctl`, Kubernetes **:6443** for `kubectl` / Helm). If `kubectl` cannot reach the VIP (`192.168.50.230`), use `-e 'noble_k8s_api_server_override=https://<control-plane-ip>:6443'` on **`noble.yml`** (see `group_vars/all.yml`).
**One-shot full deploy** (after nodes are booted and reachable):
```bash
cd ansible
ansible-playbook playbooks/deploy.yml
```
## Deploy secrets (`.env`)
Copy **`.env.sample`** to **`.env`** at the repository root (`.env` is gitignored). At minimum set **`CLOUDFLARE_DNS_API_TOKEN`** for cert-manager DNS-01. The **cert-manager** role applies it automatically during **`noble.yml`**. See **`.env.sample`** for optional placeholders (e.g. Newt/Pangolin).
## Prerequisites
- `talosctl` (matches node Talos version), `talhelper`, `helm`, `kubectl`.
- **SOPS secrets:** `sops` and `age` on the control host if you use **`clusters/noble/secrets/`** with **`age-key.txt`** (see **`clusters/noble/secrets/README.md`**).
- **Phase A:** same LAN/VPN as nodes so **Talos :50000** and **Kubernetes :6443** are reachable (see [`talos/README.md`](../talos/README.md) §3).
- **noble.yml:** bootstrapped cluster and **`talos/kubeconfig`** (or `KUBECONFIG`).
## Playbooks
| Playbook | Purpose |
|----------|---------|
| [`playbooks/deploy.yml`](playbooks/deploy.yml) | **Talos Phase A** then **`noble.yml`** (full automation). |
| [`playbooks/talos_phase_a.yml`](playbooks/talos_phase_a.yml) | `genconfig``apply-config``bootstrap``kubeconfig` only. |
| [`playbooks/noble.yml`](playbooks/noble.yml) | Helm + `kubectl` platform (after Phase A). |
| [`playbooks/post_deploy.yml`](playbooks/post_deploy.yml) | SOPS reminders and optional Argo root Application note. |
| [`playbooks/talos_bootstrap.yml`](playbooks/talos_bootstrap.yml) | **`talhelper genconfig` only** (legacy shortcut; prefer **`talos_phase_a.yml`**). |
| [`playbooks/debian_harden.yml`](playbooks/debian_harden.yml) | Baseline hardening for Debian servers (SSH/sysctl/fail2ban/unattended-upgrades). |
| [`playbooks/debian_maintenance.yml`](playbooks/debian_maintenance.yml) | Debian maintenance run (apt upgrades, autoremove/autoclean, reboot when required). |
| [`playbooks/debian_rotate_ssh_keys.yml`](playbooks/debian_rotate_ssh_keys.yml) | Rotate managed users' `authorized_keys`. |
| [`playbooks/debian_ops.yml`](playbooks/debian_ops.yml) | Convenience pipeline: harden then maintenance for Debian servers. |
| [`playbooks/proxmox_prepare.yml`](playbooks/proxmox_prepare.yml) | Configure Proxmox community repos and disable no-subscription UI warning. |
| [`playbooks/proxmox_upgrade.yml`](playbooks/proxmox_upgrade.yml) | Proxmox maintenance run (apt dist-upgrade, cleanup, reboot when required). |
| [`playbooks/proxmox_cluster.yml`](playbooks/proxmox_cluster.yml) | Create a Proxmox cluster on the master and join additional hosts. |
| [`playbooks/proxmox_ops.yml`](playbooks/proxmox_ops.yml) | Convenience pipeline: prepare, upgrade, then cluster Proxmox hosts. |
```bash
cd ansible
export KUBECONFIG=/absolute/path/to/home-server/talos/kubeconfig
# noble.yml only — if VIP is unreachable from this host:
# ansible-playbook playbooks/noble.yml -e 'noble_k8s_api_server_override=https://192.168.50.20:6443'
ansible-playbook playbooks/noble.yml
ansible-playbook playbooks/post_deploy.yml
```
### Talos Phase A variables (role `talos_phase_a` defaults)
Override with `-e` when needed, e.g. **`-e noble_talos_skip_bootstrap=true`** if etcd is already initialized.
| Variable | Default | Meaning |
|----------|---------|---------|
| `noble_talos_genconfig` | `true` | Run **`talhelper genconfig -o out`** first. |
| `noble_talos_apply_mode` | `auto` | **`auto`** — **`talosctl apply-config --dry-run`** on the first node picks maintenance (**`--insecure`**) vs joined (**`TALOSCONFIG`**). **`insecure`** / **`secure`** force talos/README §2 A or B. |
| `noble_talos_skip_bootstrap` | `false` | Skip **`talosctl bootstrap`**. If etcd is **already** initialized, bootstrap is treated as a no-op (same as **`talosctl`** “etcd data directory is not empty”). |
| `noble_talos_apid_wait_delay` / `noble_talos_apid_wait_timeout` | `20` / `900` | Seconds to wait for **apid :50000** on the bootstrap node after **apply-config** (nodes reboot). Increase if bootstrap hits **connection refused** to `:50000`. |
| `noble_talos_nodes` | neon/argon/krypton/helium | IP + **`out/*.yaml`** filename — align with **`talos/talconfig.yaml`**. |
### Tags (partial runs)
```bash
ansible-playbook playbooks/noble.yml --tags cilium,metallb
ansible-playbook playbooks/noble.yml --skip-tags newt
ansible-playbook playbooks/noble.yml --tags velero -e noble_velero_install=true -e noble_velero_s3_bucket=... -e noble_velero_s3_url=...
```
### Variables — `group_vars/all.yml` and role defaults
- **`group_vars/all.yml`:** **`noble_newt_install`**, **`noble_velero_install`**, **`noble_cert_manager_require_cloudflare_secret`**, **`noble_argocd_apply_root_application`**, **`noble_argocd_apply_bootstrap_root_application`**, **`noble_k8s_api_server_override`**, **`noble_k8s_api_server_auto_fallback`**, **`noble_k8s_api_server_fallback`**, **`noble_skip_k8s_health_check`**
- **`roles/noble_platform/defaults/main.yml`:** **`noble_apply_sops_secrets`**, **`noble_sops_age_key_file`** (SOPS secrets under **`clusters/noble/secrets/`**)
## Roles
| Role | Contents |
|------|----------|
| `talos_phase_a` | Talos genconfig, apply-config, bootstrap, kubeconfig |
| `helm_repos` | `helm repo add` / `update` |
| `noble_*` | Cilium, CSI Volume Snapshot CRDs + controller, metrics-server, Longhorn, MetalLB (20m Helm wait), kube-vip, Traefik, cert-manager, Newt, Argo CD, Kyverno, platform stack, Velero (optional) |
| `noble_landing_urls` | Writes **`ansible/output/noble-lab-ui-urls.md`** — URLs, service names, and (optional) Argo/Grafana passwords from Secrets |
| `noble_post_deploy` | Post-install reminders |
| `talos_bootstrap` | Genconfig-only (used by older playbook) |
| `debian_baseline_hardening` | Baseline Debian hardening (SSH policy, sysctl profile, fail2ban, unattended upgrades) |
| `debian_maintenance` | Routine Debian maintenance tasks (updates, cleanup, reboot-on-required) |
| `debian_ssh_key_rotation` | Declarative `authorized_keys` rotation for server users |
| `proxmox_baseline` | Proxmox repo prep (community repos) and no-subscription warning suppression |
| `proxmox_maintenance` | Proxmox package maintenance (dist-upgrade, cleanup, reboot-on-required) |
| `proxmox_cluster` | Proxmox cluster bootstrap/join automation using `pvecm` |
## Debian server ops quick start
These playbooks are separate from the Talos/noble flow and target hosts in `debian_servers`.
1. Copy `inventory/debian.example.yml` to `inventory/debian.yml` and update hosts/users.
2. Update `group_vars/debian_servers.yml` with your allowed SSH users and real public keys.
3. Run with the Debian inventory:
```bash
cd ansible
ansible-playbook -i inventory/debian.yml playbooks/debian_harden.yml
ansible-playbook -i inventory/debian.yml playbooks/debian_rotate_ssh_keys.yml
ansible-playbook -i inventory/debian.yml playbooks/debian_maintenance.yml
```
Or run the combined maintenance pipeline:
```bash
cd ansible
ansible-playbook -i inventory/debian.yml playbooks/debian_ops.yml
```
## Proxmox host + cluster quick start
These playbooks are separate from the Talos/noble flow and target hosts in `proxmox_hosts`.
1. Copy `inventory/proxmox.example.yml` to `inventory/proxmox.yml` and update hosts/users.
2. Update `group_vars/proxmox_hosts.yml` with your cluster name (`proxmox_cluster_name`), chosen cluster master, and root public key file paths to install.
3. First run (no SSH keys yet): use `--ask-pass` **or** set `ansible_password` (prefer Ansible Vault). Keep `ansible_ssh_common_args: "-o StrictHostKeyChecking=accept-new"` in inventory for first-contact hosts.
4. Run prepare first to install your public keys on each host, then continue:
```bash
cd ansible
ansible-playbook -i inventory/proxmox.yml playbooks/proxmox_prepare.yml --ask-pass
ansible-playbook -i inventory/proxmox.yml playbooks/proxmox_upgrade.yml
ansible-playbook -i inventory/proxmox.yml playbooks/proxmox_cluster.yml
```
After `proxmox_prepare.yml` finishes, SSH key auth should work for root (keys from `proxmox_root_authorized_key_files`), so `--ask-pass` is usually no longer needed.
If `pvecm add` still prompts for the master root password during join, set `proxmox_cluster_master_root_password` (prefer Vault) to run join non-interactively.
Changing `proxmox_cluster_name` only affects new cluster creation; it does not rename an already-created cluster.
Or run the full Proxmox pipeline:
```bash
cd ansible
ansible-playbook -i inventory/proxmox.yml playbooks/proxmox_ops.yml
```
## Migrating from Argo-managed `noble-platform`
```bash
kubectl delete application -n argocd noble-platform noble-kyverno noble-kyverno-policies --ignore-not-found
kubectl apply -f clusters/noble/bootstrap/argocd/root-application.yaml
```
Then run `playbooks/noble.yml` so Helm state matches git values.