Skip to content

feat(stack): pluggable backend system with native k3s support#135

Open
bussyjd wants to merge 6 commits intointegration-okr-1from
feature/k3s-backend
Open

feat(stack): pluggable backend system with native k3s support#135
bussyjd wants to merge 6 commits intointegration-okr-1from
feature/k3s-backend

Conversation

@bussyjd
Copy link
Collaborator

@bussyjd bussyjd commented Feb 6, 2026

Summary

  • Introduces a Backend interface that abstracts cluster lifecycle, enabling both k3d (default) and native k3s backends
  • Native k3s is a prerequisite for TEE/Confidential Computing — k3d cannot provide the direct hardware access needed for AMD SEV-SNP, Intel TDX, or GPU TEE workloads
  • Fixes pre-existing helmfile template issues (eRPC secretEnv type mismatch, obol-frontend escaped quotes, .Values.* unavailable during gotmpl first-pass rendering)

What changed

Area Change
Backend interface New Backend with Init, Up, Down, Destroy, IsRunning, DataDir — k3d extracted into K3dBackend, new K3sBackend added
k3s process management PID tracking, sudo kill -0 liveness checks, process group signals, k3s-killall.sh cleanup, API server readiness polling
Helmfile templates helmfile.yamlhelmfile.yaml.gotmpl, env vars replace .Values.* references, KUBECONFIG propagated to hooks
eRPC values secretEnv changed from nested map to {}, secret injected via extraEnv with valueFrom.secretKeyRef
obol-frontend values Replaced {{ printf \"...\" }} with direct interpolation and single-quoted env calls
Tests 26 unit tests (backend selection, PID parsing, config, Init templates) + 10 integration test scenarios behind //go:build integration

Test results

  • Unit tests: 26/26 pass with -race
  • K3s integration: 32/33 flow tests pass (11 scenarios: init, up, kubectl, idempotent, down, restart, purge)
  • Helmfile deploy: All 10 releases succeed on k3s (base, reloader, monitoring, gateway-api-crds, traefik, cloudflared, erpc, erpc-httproute, obol-frontend, obol-frontend-httproute)

Test plan

  • Unit tests pass (go test -race ./internal/stack/)
  • K3s: stack init --backend k3sstack up → full helmfile deploy
  • K3s: stack downstack up restart cycle
  • K3s: stack purge --force full cleanup
  • K3d: backward compatibility (blocked by local Docker/kernel issue, not code-related)
  • Network install on k3s backend

Closes #134

bussyjd and others added 4 commits February 7, 2026 01:16
Introduce a Backend interface that abstracts cluster lifecycle management,
enabling both k3d (Docker-based, default) and k3s (native bare-metal) backends.
This is a prerequisite for TEE/Confidential Computing workloads which require
direct hardware access that k3d cannot provide.

Changes:
- Add Backend interface (Init, Up, Down, Destroy, IsRunning, DataDir)
- Extract k3d logic into K3dBackend with backward-compatible fallback
- Add K3sBackend with sudo process management, PID tracking, and
  API server readiness checks
- Convert helmfile.yaml to helmfile.yaml.gotmpl using env vars instead
  of .Values references (fixes first-pass template rendering)
- Fix eRPC secretEnv type mismatch (map vs string for b64enc)
- Fix obol-frontend escaped quotes in gotmpl expressions
- Add KUBECONFIG env var to helmfile command for hook compatibility
- Add 26 unit tests and 10 integration test scenarios

Closes #134
Adds a Claude Code skill (`/test-backend`) with bash scripts that
exercise the full backend lifecycle: init, up, kubectl, down, restart,
and purge for both k3d and k3s backends.
Update dependency versions to latest stable releases:
- kubectl: 1.31.0 → 1.35.0
- helm: 3.19.1 → 3.19.4
- helmfile: 1.2.2 → 1.2.3
- k9s: 0.32.5 → 0.50.18
- helm-diff: 3.9.11 → 3.14.1

k3d remains at 5.8.3 (already current).
Update documentation to reflect the upgraded dependency versions
in obolup.sh. This keeps the documentation in sync with the actual
pinned versions used by the bootstrap installer.
Comment on lines 1 to 10
{{- $network := .Values.network -}}
{{- $network := env "STACK_NETWORK" | default "mainnet" -}}
{{- $publicDomain := env "STACK_PUBLIC_DOMAIN" | default "obol.stack" -}}
{{- $chainId := 1 -}} {{/* Default: mainnet */}}
{{- if eq $network "hoodi" -}}
{{- $chainId = 560048 -}}
{{- else if eq $network "sepolia" -}}
{{- $chainId = 11155111 -}}
{{- else if ne $network "mainnet" -}}
{{- fail (printf "Unknown network: %s. Supported networks: mainnet, hoodi, sepolia" $network) -}}
{{- end -}}
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Why do we have to chose only one l1? can't we have all of these wired up? (well hoodi and mainnet is what we host on the dv labs side)

Comment on lines 20 to 21
- name: stakater
url: https://stakater.github.io/stakater-charts
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

what is this for?

The k3s Down() method was using kill -TERM with a negative PID (process
group kill), which could kill unrelated system processes like
systemd-logind sharing the same process group as the sudo wrapper. This
caused the entire desktop session to crash.

Changes:
- Kill only the specific sudo/k3s process, not the process group
- Remove unused Setpgid/syscall since we no longer use process groups
- Add containerd-shim cleanup fallback for binary-only k3s installs
- Add 600s helm timeout for kube-prometheus-stack deployment
- Disable admission webhook pre-install hooks that timeout on fresh k3s
- Fix flaky test: replace fixed sleep with polling loop for API shutdown
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

2 participants