Files
home-server/docs/architecture.md

10 KiB
Raw Blame History

Noble platform architecture

This document describes the noble Talos lab cluster: node topology, networking, platform stack, observability, secrets/policy, and storage. Facts align with talos/CLUSTER-BUILD.md, talos/talconfig.yaml, and manifests under clusters/noble/.

Legend

Shape / style Meaning
Subgraph “Cluster” Kubernetes cluster boundary (noble)
External / DNS / cloud Services outside the data plane (internet, registrar, Pangolin)
Data store Durable data (etcd, Longhorn, Loki)
Secrets / policy Secret material (SOPS in git), admission policy
LB / VIP Load balancer, MetalLB assignment, or API VIP

Physical / node topology

Four Talos nodes on LAN 192.168.50.0/24: three control planes (neon, argon, krypton) and one worker (helium). allowSchedulingOnControlPlanes: true in talconfig.yaml. The Kubernetes API is fronted by kube-vip on 192.168.50.230 (not a separate hardware load balancer).

flowchart TB
  subgraph LAN["LAN 192.168.50.0/24"]
    subgraph CP["Control planes (kube-vip VIP 192.168.50.230:6443)"]
      neon["neon<br/>192.168.50.20<br/>control-plane + schedulable"]
      argon["argon<br/>192.168.50.30<br/>control-plane + schedulable"]
      krypton["krypton<br/>192.168.50.40<br/>control-plane + schedulable"]
    end
    subgraph W["Worker"]
      helium["helium<br/>192.168.50.10<br/>worker only"]
    end
    VIP["API VIP 192.168.50.230<br/>kube-vip on ens18<br/>→ apiserver :6443"]
  end
  neon --- VIP
  argon --- VIP
  krypton --- VIP
  kubectl["kubectl / talosctl clients<br/>(workstation on LAN/VPN)"] -->|"HTTPS :6443"| VIP

Network and ingress

Northsouth (apps on LAN): DNS for *.apps.noble.lab.pcenicni.devTraefik LoadBalancer 192.168.50.211. MetalLB L2 pool 192.168.50.210192.168.50.229; Argo CD uses 192.168.50.210. Public access is not in-cluster ExternalDNS: Newt (Pangolin tunnel) plus CNAME and Integration API per clusters/noble/bootstrap/newt/README.md.

flowchart TB
  user["User"]
  subgraph DNS["DNS"]
    pub["Public: CNAME → Pangolin<br/>(per Newt README; not ExternalDNS)"]
    split["LAN / split horizon:<br/>*.apps.noble.lab.pcenicni.dev<br/>→ 192.168.50.211"]
  end
  subgraph LAN["LAN"]
    ML["MetalLB L2<br/>pool 192.168.50.210229<br/>IPAddressPool noble-l2"]
    T["Traefik Service LoadBalancer<br/>192.168.50.211<br/>IngressClass: traefik"]
    Argo["Argo CD server LoadBalancer<br/>192.168.50.210"]
    Newt["Newt (Pangolin tunnel)<br/>outbound to Pangolin"]
  end
  subgraph Cluster["Cluster workloads"]
    Ing["Ingress resources<br/>cert-manager HTTP-01"]
    App["Apps / Grafana Ingress<br/>e.g. grafana.apps.noble.lab.pcenicni.dev"]
  end
  user --> pub
  user --> split
  split --> T
  pub -.->|"tunnel path"| Newt
  T --> Ing --> App
  ML --- T
  ML --- Argo
  user -->|"optional direct to LB IP"| Argo

Platform stack (bootstrap → workloads)

Order: TalosCilium (cluster uses cni: none until CNI is installed) → metrics-server, Longhorn, MetalLB + pool manifests, kube-vipTraefik, cert-managerArgo CD (Helm only; optional empty app-of-apps). Automated install: ansible/playbooks/noble.yml (see ansible/README.md). Platform namespaces include cert-manager, traefik, metallb-system, longhorn-system, monitoring, loki, logging, argocd, kyverno, newt, and others as deployed.

flowchart TB
  subgraph L0["OS / bootstrap"]
    Talos["Talos v1.12.6<br/>Image Factory schematic"]
  end
  subgraph L1["CNI"]
    Cilium["Cilium<br/>(cni: none until installed)"]
  end
  subgraph L2["Core add-ons"]
    MS["metrics-server"]
    LH["Longhorn + default StorageClass"]
    MB["MetalLB + pool manifests"]
    KV["kube-vip (API VIP)"]
  end
  subgraph L3["Ingress and TLS"]
    Traefik["Traefik"]
    CM["cert-manager + ClusterIssuers"]
  end
  subgraph L4["GitOps"]
    Argo["Argo CD<br/>(optional app-of-apps; platform via Ansible)"]
  end
  subgraph L5["Platform namespaces (examples)"]
    NS["cert-manager, traefik, metallb-system,<br/>longhorn-system, monitoring, loki, logging,<br/>argocd, kyverno, newt, …"]
  end
  Talos --> Cilium --> MS
  Cilium --> LH
  Cilium --> MB
  Cilium --> KV
  MB --> Traefik
  Traefik --> CM
  CM --> Argo
  Argo --> NS

Observability path

kube-prometheus-stack in monitoring: Prometheus, Grafana, Alertmanager, node-exporter, etc. Loki (SingleBinary) in loki with Fluent Bit in logging shipping to loki-gateway. Grafana Loki datasource is applied via ConfigMap clusters/noble/bootstrap/grafana-loki-datasource/loki-datasource.yaml. Prometheus, Grafana, Alertmanager, and Loki use Longhorn PVCs where configured.

flowchart LR
  subgraph Nodes["All nodes"]
    NE["node-exporter DaemonSet"]
    FB["Fluent Bit DaemonSet<br/>namespace: logging"]
  end
  subgraph mon["monitoring"]
    PROM["Prometheus"]
    AM["Alertmanager"]
    GF["Grafana"]
    SC["ServiceMonitors / kube-state-metrics / operator"]
  end
  subgraph lok["loki"]
    LG["loki-gateway Service"]
    LO["Loki SingleBinary"]
  end
  NE --> PROM
  PROM --> GF
  AM --> GF
  FB -->|"to loki-gateway:80"| LG --> LO
  GF -->|"Explore / datasource ConfigMap<br/>grafana-loki-datasource"| LO
  subgraph PVC["Longhorn PVCs"]
    P1["Prometheus / Grafana /<br/>Alertmanager PVCs"]
    P2["Loki PVC"]
  end
  PROM --- P1
  LO --- P2

Secrets and policy

Mozilla SOPS with age encrypts plain Kubernetes Secret manifests under clusters/noble/secrets/; operators decrypt at apply time (ansible/playbooks/noble.yml or sops -d … | kubectl apply). The private key is age-key.txt at the repo root (gitignored). Kyverno with kyverno-policies enforces PSS baseline in Audit.

flowchart LR
  subgraph Git["Git repo"]
    SM["SOPS-encrypted Secret YAML<br/>clusters/noble/secrets/"]
  end
  subgraph ops["Apply path"]
    SOPS["sops -d + kubectl apply<br/>(or Ansible noble.yml)"]
  end
  subgraph cluster["Cluster"]
    K["Kyverno + kyverno-policies<br/>PSS baseline Audit"]
  end
  SM --> SOPS -->|"plain Secret"| workloads["Workload Secrets"]
  K -.->|"admission / audit<br/>(PSS baseline)"| workloads

Data and storage

StorageClass: longhorn (default). Talos mounts user volume data at /var/mnt/longhorn (bind paths for Longhorn). Stateful consumers include kube-prometheus-stack PVCs and Loki.

flowchart TB
  subgraph disks["Per-node Longhorn data path"]
    UD["Talos user volume →<br/>/var/mnt/longhorn (bind to Longhorn paths)"]
  end
  subgraph LH["Longhorn"]
    SC["StorageClass: longhorn (default)"]
  end
  subgraph consumers["Stateful / durable consumers"]
    PGL["kube-prometheus-stack PVCs"]
    L["Loki PVC"]
  end
  UD --> SC
  SC --> PGL
  SC --> L

Component versions

See talos/CLUSTER-BUILD.md for the authoritative checklist. Summary:

Component Chart / app (from CLUSTER-BUILD.md)
Talos / Kubernetes v1.12.6 / 1.35.2 bundled
Cilium Helm 1.16.6
MetalLB 0.15.3
Longhorn 1.11.1
Traefik 39.0.6 / app v3.6.11
cert-manager v1.20.0
Argo CD 9.4.17 / app v3.3.6
kube-prometheus-stack 82.15.1
Loki / Fluent Bit 6.55.0 / 0.56.0
SOPS (client tooling) see clusters/noble/secrets/README.md
Kyverno 3.7.1 / policies 3.7.1
Newt 1.2.0 / app 1.10.1

Narrative

The noble environment is a Talos lab cluster on 192.168.50.0/24 with three control plane nodes and one worker, schedulable workloads on control planes enabled, and the Kubernetes API exposed through kube-vip at 192.168.50.230. Cilium provides the CNI after Talos bootstrap with cni: none; MetalLB advertises 192.168.50.210192.168.50.229, pinning Argo CD to 192.168.50.210 and Traefik to 192.168.50.211 for *.apps.noble.lab.pcenicni.dev. cert-manager issues certificates for Traefik Ingresses; GitOps is Ansible-driven Helm for the platform (clusters/noble/bootstrap/) plus optional Argo CD app-of-apps (clusters/noble/apps/, clusters/noble/bootstrap/argocd/). Observability uses kube-prometheus-stack in monitoring, Loki and Fluent Bit with Grafana wired via a ConfigMap datasource, with Longhorn PVCs for Prometheus, Grafana, Alertmanager, and Loki. Secrets in git use SOPS + age under clusters/noble/secrets/; Kyverno enforces Pod Security Standards baseline in Audit. Public access uses Newt to Pangolin with CNAME and Integration API steps as documented—not generic in-cluster public DNS.


Assumptions and open questions

Assumptions

  • Hypervisor vs bare metal: Not fixed in inventory tables; talconfig.yaml comments mention Proxmox virtio disk paths as examples—treat actual host platform as TBD unless confirmed.
  • Workstation path: Operators reach the VIP and node IPs from the LAN or VPN per talos/README.md.
  • Optional components (Headlamp, Renovate, Velero, Phase G hardening) are described in CLUSTER-BUILD.md; they are not required for the diagrams above until deployed.

Open questions

  • Split horizon: Confirm whether only LAN DNS resolves *.apps.noble.lab.pcenicni.dev to 192.168.50.211 or whether public resolvers also point at that address.
  • Velero / S3: optional Ansible install (noble_velero_install) from clusters/noble/bootstrap/velero/ once an S3-compatible backend and credentials exist (see talos/CLUSTER-BUILD.md Phase F).
  • Argo CD: Confirm repoURL in root-application.yaml and what is actually applied on-cluster.

Keep in sync with talos/CLUSTER-BUILD.md and manifests under clusters/noble/.