Files
home-server/docs/architecture.md

242 lines
10 KiB
Markdown
Raw Blame History

This file contains ambiguous Unicode characters
This file contains Unicode characters that might be confused with other characters. If you think that this is intentional, you can safely ignore this warning. Use the Escape button to reveal them.
# Noble platform architecture
This document describes the **noble** Talos lab cluster: node topology, networking, platform stack, observability, secrets/policy, and storage. Facts align with [`talos/CLUSTER-BUILD.md`](../talos/CLUSTER-BUILD.md), [`talos/talconfig.yaml`](../talos/talconfig.yaml), and manifests under [`clusters/noble/`](../clusters/noble/).
## Legend
| Shape / style | Meaning |
|---------------|---------|
| **Subgraph “Cluster”** | Kubernetes cluster boundary (`noble`) |
| **External / DNS / cloud** | Services outside the data plane (internet, registrar, Pangolin) |
| **Data store** | Durable data (etcd, Longhorn, Loki, Vault storage) |
| **Secrets / policy** | Secret material, Vault, admission policy |
| **LB / VIP** | Load balancer, MetalLB assignment, or API VIP |
---
## Physical / node topology
Four Talos nodes on **LAN `192.168.50.0/24`**: three control planes (**neon**, **argon**, **krypton**) and one worker (**helium**). `allowSchedulingOnControlPlanes: true` in `talconfig.yaml`. The Kubernetes API is fronted by **kube-vip** on **`192.168.50.230`** (not a separate hardware load balancer).
```mermaid
flowchart TB
subgraph LAN["LAN 192.168.50.0/24"]
subgraph CP["Control planes (kube-vip VIP 192.168.50.230:6443)"]
neon["neon<br/>192.168.50.20<br/>control-plane + schedulable"]
argon["argon<br/>192.168.50.30<br/>control-plane + schedulable"]
krypton["krypton<br/>192.168.50.40<br/>control-plane + schedulable"]
end
subgraph W["Worker"]
helium["helium<br/>192.168.50.10<br/>worker only"]
end
VIP["API VIP 192.168.50.230<br/>kube-vip on ens18<br/>→ apiserver :6443"]
end
neon --- VIP
argon --- VIP
krypton --- VIP
kubectl["kubectl / talosctl clients<br/>(workstation on LAN/VPN)"] -->|"HTTPS :6443"| VIP
```
---
## Network and ingress
**Northsouth (apps on LAN):** DNS for **`*.apps.noble.lab.pcenicni.dev`** → **Traefik** **`LoadBalancer` `192.168.50.211`**. **MetalLB** L2 pool **`192.168.50.210``192.168.50.229`**; **Argo CD** uses **`192.168.50.210`**. **Public** access is not in-cluster ExternalDNS: **Newt** (Pangolin tunnel) plus **CNAME** and **Integration API** per [`clusters/noble/apps/newt/README.md`](../clusters/noble/apps/newt/README.md).
```mermaid
flowchart TB
user["User"]
subgraph DNS["DNS"]
pub["Public: CNAME → Pangolin<br/>(per Newt README; not ExternalDNS)"]
split["LAN / split horizon:<br/>*.apps.noble.lab.pcenicni.dev<br/>→ 192.168.50.211"]
end
subgraph LAN["LAN"]
ML["MetalLB L2<br/>pool 192.168.50.210229<br/>IPAddressPool noble-l2"]
T["Traefik Service LoadBalancer<br/>192.168.50.211<br/>IngressClass: traefik"]
Argo["Argo CD server LoadBalancer<br/>192.168.50.210"]
Newt["Newt (Pangolin tunnel)<br/>outbound to Pangolin"]
end
subgraph Cluster["Cluster workloads"]
Ing["Ingress resources<br/>cert-manager HTTP-01"]
App["Apps / Grafana Ingress<br/>e.g. grafana.apps.noble.lab.pcenicni.dev"]
end
user --> pub
user --> split
split --> T
pub -.->|"tunnel path"| Newt
T --> Ing --> App
ML --- T
ML --- Argo
user -->|"optional direct to LB IP"| Argo
```
---
## Platform stack (bootstrap → workloads)
Order: **Talos****Cilium** (cluster uses `cni: none` until CNI is installed) → **metrics-server**, **Longhorn**, **MetalLB** + pool manifests, **kube-vip****Traefik**, **cert-manager****Argo CD** (Helm + app-of-apps under `clusters/noble/bootstrap/argocd/`). Platform namespaces include `cert-manager`, `traefik`, `metallb-system`, `longhorn-system`, `monitoring`, `loki`, `logging`, `argocd`, `vault`, `external-secrets`, `sealed-secrets`, `kyverno`, `newt`, and others as deployed.
```mermaid
flowchart TB
subgraph L0["OS / bootstrap"]
Talos["Talos v1.12.6<br/>Image Factory schematic"]
end
subgraph L1["CNI"]
Cilium["Cilium<br/>(cni: none until installed)"]
end
subgraph L2["Core add-ons"]
MS["metrics-server"]
LH["Longhorn + default StorageClass"]
MB["MetalLB + pool manifests"]
KV["kube-vip (API VIP)"]
end
subgraph L3["Ingress and TLS"]
Traefik["Traefik"]
CM["cert-manager + ClusterIssuers"]
end
subgraph L4["GitOps"]
Argo["Argo CD<br/>app-of-apps under bootstrap/argocd/"]
end
subgraph L5["Platform namespaces (examples)"]
NS["cert-manager, traefik, metallb-system,<br/>longhorn-system, monitoring, loki, logging,<br/>argocd, vault, external-secrets, sealed-secrets,<br/>kyverno, newt, …"]
end
Talos --> Cilium --> MS
Cilium --> LH
Cilium --> MB
Cilium --> KV
MB --> Traefik
Traefik --> CM
CM --> Argo
Argo --> NS
```
---
## Observability path
**kube-prometheus-stack** in **`monitoring`**: Prometheus, Grafana, Alertmanager, node-exporter, etc. **Loki** (SingleBinary) in **`loki`** with **Fluent Bit** in **`logging`** shipping to **`loki-gateway`**. Grafana Loki datasource is applied via **ConfigMap** [`clusters/noble/apps/grafana-loki-datasource/loki-datasource.yaml`](../clusters/noble/apps/grafana-loki-datasource/loki-datasource.yaml). Prometheus, Grafana, Alertmanager, and Loki use **Longhorn** PVCs where configured.
```mermaid
flowchart LR
subgraph Nodes["All nodes"]
NE["node-exporter DaemonSet"]
FB["Fluent Bit DaemonSet<br/>namespace: logging"]
end
subgraph mon["monitoring"]
PROM["Prometheus"]
AM["Alertmanager"]
GF["Grafana"]
SC["ServiceMonitors / kube-state-metrics / operator"]
end
subgraph lok["loki"]
LG["loki-gateway Service"]
LO["Loki SingleBinary"]
end
NE --> PROM
PROM --> GF
AM --> GF
FB -->|"to loki-gateway:80"| LG --> LO
GF -->|"Explore / datasource ConfigMap<br/>grafana-loki-datasource"| LO
subgraph PVC["Longhorn PVCs"]
P1["Prometheus / Grafana /<br/>Alertmanager PVCs"]
P2["Loki PVC"]
end
PROM --- P1
LO --- P2
```
---
## Secrets and policy
**Sealed Secrets** decrypts `SealedSecret` objects in-cluster. **External Secrets Operator** syncs from **Vault** using **`ClusterSecretStore`** (see [`examples/vault-cluster-secret-store.yaml`](../clusters/noble/apps/external-secrets/examples/vault-cluster-secret-store.yaml)). Trust is **cluster → Vault** (ESO calls Vault; Vault does not initiate cluster trust). **Kyverno** with **kyverno-policies** enforces **PSS baseline** in **Audit**.
```mermaid
flowchart LR
subgraph Git["Git repo"]
SSman["SealedSecret manifests<br/>(optional)"]
end
subgraph cluster["Cluster"]
SSC["Sealed Secrets controller<br/>sealed-secrets"]
ESO["External Secrets Operator<br/>external-secrets"]
V["Vault<br/>vault namespace<br/>HTTP listener"]
K["Kyverno + kyverno-policies<br/>PSS baseline Audit"]
end
SSman -->|"encrypted"| SSC -->|"decrypt to Secret"| workloads["Workload Secrets"]
ESO -->|"ClusterSecretStore →"| V
ESO -->|"sync ExternalSecret"| workloads
K -.->|"admission / audit<br/>(PSS baseline)"| workloads
```
---
## Data and storage
**StorageClass:** **`longhorn`** (default). Talos mounts **user volume** data at **`/var/mnt/longhorn`** (bind paths for Longhorn). Stateful consumers include **Vault**, **kube-prometheus-stack** PVCs, and **Loki**.
```mermaid
flowchart TB
subgraph disks["Per-node Longhorn data path"]
UD["Talos user volume →<br/>/var/mnt/longhorn (bind to Longhorn paths)"]
end
subgraph LH["Longhorn"]
SC["StorageClass: longhorn (default)"]
end
subgraph consumers["Stateful / durable consumers"]
V["Vault PVC data-vault-0"]
PGL["kube-prometheus-stack PVCs"]
L["Loki PVC"]
end
UD --> SC
SC --> V
SC --> PGL
SC --> L
```
---
## Component versions
See [`talos/CLUSTER-BUILD.md`](../talos/CLUSTER-BUILD.md) for the authoritative checklist. Summary:
| Component | Chart / app (from CLUSTER-BUILD.md) |
|-----------|-------------------------------------|
| Talos / Kubernetes | v1.12.6 / 1.35.2 bundled |
| Cilium | Helm 1.16.6 |
| MetalLB | 0.15.3 |
| Longhorn | 1.11.1 |
| Traefik | 39.0.6 / app v3.6.11 |
| cert-manager | v1.20.0 |
| Argo CD | 9.4.17 / app v3.3.6 |
| kube-prometheus-stack | 82.15.1 |
| Loki / Fluent Bit | 6.55.0 / 0.56.0 |
| Sealed Secrets / ESO / Vault | 2.18.4 / 2.2.0 / 0.32.0 |
| Kyverno | 3.7.1 / policies 3.7.1 |
| Newt | 1.2.0 / app 1.10.1 |
---
## Narrative
The **noble** environment is a **Talos** lab cluster on **`192.168.50.0/24`** with **three control plane nodes and one worker**, schedulable workloads on control planes enabled, and the Kubernetes API exposed through **kube-vip** at **`192.168.50.230`**. **Cilium** provides the CNI after Talos bootstrap with **`cni: none`**; **MetalLB** advertises **`192.168.50.210``192.168.50.229`**, pinning **Argo CD** to **`192.168.50.210`** and **Traefik** to **`192.168.50.211`** for **`*.apps.noble.lab.pcenicni.dev`**. **cert-manager** issues certificates for Traefik Ingresses; **GitOps** is **Helm plus Argo CD** with manifests under **`clusters/noble/`** and bootstrap under **`clusters/noble/bootstrap/argocd/`**. **Observability** uses **kube-prometheus-stack** in **`monitoring`**, **Loki** and **Fluent Bit** with Grafana wired via a **ConfigMap** datasource, with **Longhorn** PVCs for Prometheus, Grafana, Alertmanager, Loki, and **Vault**. **Secrets** combine **Sealed Secrets** for git-encrypted material, **Vault** with **External Secrets** for dynamic sync, and **Kyverno** enforces **Pod Security Standards baseline** in **Audit**. **Public** access uses **Newt** to **Pangolin** with **CNAME** and Integration API steps as documented—not generic in-cluster public DNS.
---
## Assumptions and open questions
**Assumptions**
- **Hypervisor vs bare metal:** Not fixed in inventory tables; `talconfig.yaml` comments mention Proxmox virtio disk paths as examples—treat actual host platform as **TBD** unless confirmed.
- **Workstation path:** Operators reach the VIP and node IPs from the **LAN or VPN** per [`talos/README.md`](../talos/README.md).
- **Optional components** (Headlamp, Renovate, Velero, Phase G hardening) are described in CLUSTER-BUILD.md; they are not required for the diagrams above until deployed.
**Open questions**
- **Split horizon:** Confirm whether only LAN DNS resolves `*.apps.noble.lab.pcenicni.dev` to **`192.168.50.211`** or whether public resolvers also point at that address.
- **Velero / S3:** **TBD** until an S3-compatible backend is configured.
- **Argo CD:** Confirm **`repoURL`** in `root-application.yaml` and what is actually applied on-cluster.
---
*Keep in sync with [`talos/CLUSTER-BUILD.md`](../talos/CLUSTER-BUILD.md) and manifests under [`clusters/noble/`](../clusters/noble/).*