Control Plane Internals

The control plane is a single Rust service that manages project lifecycle on Fly.io. It handles API requests, routing via fly-replay headers, warm pool management, and dormancy sweeps. It runs as 2+ instances on Fly.io with PostgreSQL advisory locks for HA. The code lives in valet-control-plane/src/.

valet-control-plane/src/
├── main.rs — Entry point, router setup, background task spawning
├── config.rs — Config struct (clap + env vars)
├── lib.rs — AppState, module re-exports
│
├── fly.rs — Fly Machines API client (volumes, machines, lifecycle)
├── routing.rs — fly-replay routing, subdomain/path parsing, wake logic
├── warm_pool.rs — Pre-provisioned machine pool, claim + configure
├── dormancy.rs — Hourly sweep, R2 archival, machine/volume cleanup
├── r2.rs — Cloudflare R2 storage client
├── project_id.rs — Human-readable project ID generation
├── health.rs — Health check endpoint
│
├── api/
├── projects.rs — POST /api/projects, GET /api/projects/:id, wake, activity
├── auth.rs — Bearer token middleware (CONTROL_PLANE_SECRET)
├── routing.rs — GET /api/route/:project_id (route lookup)
├── rate_limit.rs — Per-IP rate limiting for project creation
└── error.rs — API error types
│
└── db/
├── queries.rs — PostgreSQL queries (projects, warm pool, dormancy)
├── models.rs — Row types (Project, WarmMachine)
└── migrations/ — SQL migration files

Project lifecycle

Projects move through three states:

States
├── active — Fly Machine running + Volume attached
│           Fly auto-suspends after idle timeout
├── suspended — Machine suspended + Volume retained
│              Fly auto-starts on next request via fly-replay
└── dormant — No Fly resources. DB archived to R2
             Wake restores from R2, provisions new machine
Transitions
├── active → suspended — Fly auto-suspend (idle timeout)
├── suspended → active — Fly auto-start (next request)
├── active → dormant — Dormancy sweep (inactive > 7 days)
└── dormant → active — Wake (restore from R2, provision machine)

fly.rs

Thin wrapper around the Fly.io Machines REST API (https://api.machines.dev). Handles volumes, machines, and their lifecycle.

FlyClient
├── base_url: String (https://api.machines.dev)
├── app_name: String (from VALET_FLY_APP)
├── token: String (from VALET_FLY_TOKEN)
└── client: reqwest::Client
Volumes
├── create_volume(name, region, size_gb) → Volume
└── delete_volume(volume_id)
Machines
├── create_machine(config, region, skip_launch) → Machine
├── update_machine(machine_id, config) → Machine
├── delete_machine(machine_id)
├── get_machine(machine_id) → Machine
├── list_machines() → Vec<Machine>
├── start_machine(machine_id)
├── stop_machine(machine_id)
├── wait_for_machine(machine_id, timeout)
└── wait_for_state(machine_id, state, timeout)
MachineConfig
├── image: String
├── size: Option<String>
├── env: Option<HashMap<String, String>>
├── mounts: Option<Vec<MachineMount>>
├── guest: Option<GuestConfig>
├── services: Option<Vec<ServiceConfig>>
├── restart: Option<RestartPolicy>
└── auto_destroy: Option<bool>

All requests use bearer token auth. Errors are captured as FlyError::Api { status, body } or FlyError::Http.

routing.rs

Handles all incoming requests that are not matched by explicit API routes. Parses the project ID, looks up the project state, and returns a fly-replay header to route the request to the correct Fly Machine.

handle_request(state, request) — catch-all handler
├── /health → 200 OK
├── OPTIONS → CORS preflight response
├── Parse project ID:
├── Subdomain: demo.fly.valet.host → "demo"
└── Path: /projects/demo/ws → "demo"
├── Look up project in PostgreSQL
├── If state == "active":
└── Return fly-replay: instance=<machine_id>;app=<fly_app>
├── If state == "dormant":
├── wake_dormant_project(state, project_id)
│   ├── Acquire per-project wake lock
│   ├── Claim machine from warm pool (or cold-create)
│   ├── Restore from R2 if r2_key exists
│   ├── Update DB state to active
│   └── Trigger async pool replenishment
└── Return fly-replay header
└── Otherwise → error response

The fly-replay header tells Fly's edge proxy to replay the request to the specified Machine instance in the target app. The format is instance={machine_id};app={app_name}, where app is needed because the server machines live in a different Fly app than the control plane.

Subdomain parsing accepts only single-level subdomains (no a.b.fly.valet.host). The ROUTER_DOMAIN env var controls the base domain for subdomain routing.

warm_pool.rs

Maintains a pool of pre-provisioned running Fly Machines with attached volumes. When a project is created or woken from dormancy, a warm machine is claimed and configured in-place via POST /internal/configure on the running machine -- no restart needed.

replenish_pool(pool, fly, config)
├── reconcile() — remove stale DB rows (machines no longer in Fly)
├── Count current warm machines in target region
├── If current < target (default 2):
├── Create volume (1 GB)
├── Create machine in started state (with VALET_INTERNAL_SECRET)
├── Wait for machine to reach "started"
└── Register in warm_pool table
└── Runs every 30s (WARM_POOL_REPLENISH_INTERVAL_SECS)
claim_and_configure(pool, fly, config, region, project_id, deploy_key)
├── Try warm pool: claim_warm_machine(pool, region)
│   ├── Found: configure_machine() via POST /internal/configure
│   │   └── Sends project_id + deploy_key to running machine
│   └── On failure: cleanup + trigger pool replenishment
└── Cold path (no warm machines available):
├── Create volume
├── Create machine with project env vars
└── Wait for "started" state
Timing
├── Warm claim: ~800ms (machine already running)
└── Cold create: ~4s (volume + machine + startup)

Internal requests to machines use Fly's private 6PN networking: http://{machine_id}.vm.{app}.internal:3000. Requests retry up to 5 times with exponential backoff.

dormancy.rs

Hourly sweep that archives inactive projects to Cloudflare R2 and releases their Fly resources.

run_dormancy_sweep(pool, fly, config)
├── Query for active projects with last_active_at < threshold
│   (default: 7 days, configurable via DORMANCY_THRESHOLD_DAYS)
├── For each candidate:
├── Start the machine (if suspended)
├── Wait for "started" state
├── POST /internal/archive-to-r2 on the machine
│   └── Returns r2_key + r2_checksum
├── Update DB: set state = dormant, store r2_key + r2_checksum
├── Delete Fly Machine
└── Delete Fly Volume
└── Returns count of archived projects

The archive endpoint is implemented in valet-server and uploads the project's SQLite database to R2. The control plane only orchestrates the process.

Multi-instance safety

The control plane runs as 2+ Fly instances for availability. PostgreSQL advisory locks prevent concurrent execution of background tasks:

Advisory locks
├── Lock #100 — Warm pool replenishment (every 30s)
│   └── pg_try_advisory_lock(100) — non-blocking, skip if held
├── Lock #101 — Dormancy sweep (every hour)
│   └── pg_try_advisory_lock(101) — non-blocking, skip if held
└── Per-project wake locks — prevent concurrent wake of same project
└── Acquired/released via queries::try_acquire_wake_lock

Both instances can serve API requests and routing. Only one runs each background task at a time.

Background tasks

Spawned in main.rs via tokio::spawn:

1. Warm pool replenisher
├── Runs immediately on startup, then every 30s
├── Acquires advisory lock #100
└── Calls warm_pool::replenish_pool()
2. Dormancy sweep
├── Runs every hour
├── Acquires advisory lock #101
└── Calls dormancy::run_dormancy_sweep()

API routes

Public (no auth)
├── POST /api/projects — Create project (rate limited: 3 burst, 6/15min, 15/hr per IP)
├── GET  /api/projects/:id — Get project info
└── GET  /api/route/:project_id — Route lookup (state + machine_id)
Protected (requires CONTROL_PLANE_SECRET)
├── POST /api/projects/:id/wake — Wake a dormant project
└── POST /api/projects/:id/activity — Touch last_active_at
Catch-all (routing)
└── * — fly-replay routing to project machines

Configuration

All configuration is via environment variables (parsed by clap):

Core
├── CONTROL_PLANE_PORT (default 7000)
├── CONTROL_PLANE_HOST (default 0.0.0.0)
├── DATABASE_URL — PostgreSQL connection string
├── CONTROL_PLANE_SECRET — Shared secret for protected API endpoints
└── CONTROL_PLANE_INTERNAL_SECRET — Override for internal machine auth (falls back to CP secret)
Fly.io
├── VALET_FLY_TOKEN — Fly API token
├── VALET_FLY_APP — Fly app name for project machines
├── VALET_FLY_REGION (default sjc)
├── VALET_FLY_VM_SIZE (default shared-cpu-1x)
├── VALET_FLY_VM_MEMORY_MB (default 256)
└── VALET_FLY_IMAGE — Docker image for project machines
Warm Pool
├── WARM_POOL_SIZE (default 2)
└── WARM_POOL_REPLENISH_INTERVAL_SECS (default 30)
R2 / Dormancy
├── R2_ENDPOINT, R2_ACCESS_KEY_ID, R2_SECRET_ACCESS_KEY
├── R2_BUCKET (default valet-dormant)
└── DORMANCY_THRESHOLD_DAYS (default 7)
Routing
└── ROUTER_DOMAIN — Base domain for subdomain routing (e.g. fly.valet.host)

The VALET_FLY_* env vars are prefixed to avoid collision with Fly's auto-injected FLY_API_TOKEN, FLY_APP_NAME, and FLY_REGION.

Shared state

AppState
├── pool: PgPool — PostgreSQL connection pool
├── config: Arc<Config> — Parsed configuration
└── fly: Arc<FlyClient> — Fly Machines API client

All handlers and background tasks share AppState through Axum's state extraction. PostgreSQL handles concurrency via connection pooling and advisory locks, so there is no in-process mutex contention.

Control Plane Internals

On this page