Kernel Decisions
Ten decisions that define how Rumpk works — and why it doesn't work like Linux.
K1: Unikernel Over Monolithic Kernel
Status: Accepted
Context
Traditional monolithic kernels mix isolation boundaries with scheduling logic: 2000+ cycle context switches, thousands of syscalls, ambient authority security. Microkernels (seL4, MINIX) trade performance for formal verification. Neither fits a system designed for 8-bit MCUs and Mars rovers in the same binary.
Decision
Rumpk is a modular unikernel: single address space, no process boundary, application carries its own OS logic via libnexus.a (the Membrane layer).
Alternatives Rejected
| Option | Why Not |
|---|---|
| Linux monolith | ~2000 cycle hypercall overhead, /proc complexity, ambient authority model |
| Full microkernel (seL4) | 10-100ms latencies, formal verification overhead not justified for commercial systems |
| Container stack (Docker) | Requires underlying OS, 200MB+ runtime overhead per container |
Consequences
- Zero kernel context switches for I/O (2-5 cycle latency)
- Deterministic, verifiable execution paths
- POSIX compatibility requires Membrane translation layer
- Each application instance includes its own kernel code (~280KB)
K2: Zig HAL + Nim Kernel Logic
Status: Accepted
Context
Hardware interactions demand precise memory layouts, volatile access, and atomic operations. Kernel scheduling and fiber management benefit from higher-level abstractions like pattern matching and ARC memory management. One language forces compromise in either direction.
Decision
Split the kernel into two language domains:
hal/(Zig): Direct hardware — UART, VirtIO, GIC, MMU, interrupt vectorscore/(Nim): Scheduler, fibers, channels, VFS, capability enforcement
They meet at the HAL struct — a C-compatible function pointer table.
Alternatives Rejected
| Option | Why Not |
|---|---|
| All Zig | Requires custom async/await runtime; loses Nim's fiber elegance |
| All Nim | Low-level operations become brittle; inline asm escapes are unsafe |
| C for HAL | Equally verbose as Zig but more fragile (no comptime, no safety) |
Consequences
- Clean separation: Physics (Zig) vs. Logic (Nim)
- Nim fibers survive HAL panics (channels persist at fixed addresses)
- Future Janus language can slot into either layer
- Two runtime ecosystems to maintain; ABI discipline required
K3: 12 Frozen Syscalls
Status: Accepted
Context
POSIX defines 400+ syscalls. Each is a security boundary crossing and a context switch point. Most are vestigial. Nexus routes all bulk I/O through ION Ring buffers — the kernel only needs a handful of control operations.
Decision
Exactly 12 syscalls + 1 meta-slot. Frozen forever:
pledge, unveil, read(0), write(1), map, unmap, spawn, kill, yield, time, random, halt
All data I/O goes through ION Rings (zero context switch). The syscall ABI is compiled into binaries at link time — no versioning, no compat layers.
Alternatives Rejected
| Option | Why Not |
|---|---|
| Full POSIX | 400+ audit points, each with versioning and compat requirements |
| seL4 model | 100+ capability operations, more complex interface |
| Plan 9 RPC | Everything-as-RPC is flexible but adds latency |
Consequences
- Formal verification feasible (12 entry points vs. 400+)
- Security surface trivially auditable
- Binary ABI stable forever (no libc version wars)
- Legacy applications require Membrane translation
- No backfill possible — missed use cases stay missed
K4: ION Rings Over Pipes and Sockets
Status: Accepted
Context
POSIX pipes are byte-streams requiring syscalls per read/write. Sockets add TCP/IP overhead for local communication. Financial exchanges proved the Disruptor pattern achieves nanosecond latencies with zero allocations.
Decision
All IPC uses Disruptor-style ring buffers (ION Rings):
- 128-byte fixed header + dynamic payload in shared memory
- Pre-allocated at boot — no dynamic allocation
- Each fiber gets dedicated RX/TX rings
- Backpressure via atomic head/tail pointers
Alternatives Rejected
| Option | Why Not |
|---|---|
| POSIX pipes | Byte-oriented, syscall per operation, 2-5 µs minimum latency |
| POSIX sockets | TCP overhead, kernel routing, ARP lookups for local IPC |
| Actor model (Erlang) | Elegant but requires heavyweight runtime (50MB Erlang VM) |
Consequences
- ~100ns latency for local IPC (vs. 2-5 µs for pipes)
- Zero-copy with proper buffer ownership
- Automatic backpressure (no overflow, no unbounded queues)
- Fixed-size rings require capacity pre-planning
- Buffer ownership semantics require discipline
K5: Tickless Event-Driven Scheduler
Status: Accepted
Context
Traditional schedulers wake the CPU every 1-10ms via timer interrupt. This prevents deep sleep states and wastes power even when the system is completely idle. Modern hardware (V-Sync, DMA completion, network IRQs) provides natural event signals.
Decision
No periodic scheduler tick. The scheduler is an ISR, not a loop:
- When
RunQueueis empty: CPU executesWFI, power drops to leakage levels - Scheduling occurs only as side-effect of hardware interrupts
- Wakes on V-Sync (120Hz), packet arrival, DMA completion
- Software timeouts registered as explicit deadline events
Alternatives Rejected
| Option | Why Not |
|---|---|
| Fixed tick (1ms) | Forces idle wake-ups, 1-3% CPU overhead on idle |
| Dynamic tick (Linux NO_HZ) | Heuristic tuning adds complexity without dramatic benefit |
| Per-app event loop | Requires application awareness, not transparent |
Consequences
- 5-10% power reduction on idle (mobile/embedded)
- Events trigger immediate scheduler response (deterministic latency)
- No fairness pathology (Spectrum model prevents starvation)
- Software timeouts must be explicitly registered
- Some batch workloads lose natural fairness preemption points
K6: Fibers Over Processes and Threads
Status: Accepted
Context
Processes cost 1-5 MB RAM and 1-5 µs per context switch. Threads share memory but invite race conditions and deadlocks. Nexus operates in a single address space — process isolation provides no benefit, only cost.
Decision
All concurrency uses fibers (Nim async/await):
- ~1 KB RAM per fiber (vs. 1-5 MB per process)
- Cooperative switching: ~100-200 ns (vs. 1-5 µs for context switch)
- No preemption within a priority level — no race conditions by construction
- Spectrum-based fairness: Photon > Matter > Gravity > Void
Alternatives Rejected
| Option | Why Not |
|---|---|
| Heavy processes | Isolation already provided by capabilities; still pay full context switch |
| Shared-memory threads | Lock-free data structures are hard; deadlocks and races omnipresent |
| Actor model (Erlang) | Elegant but 50MB base runtime; too heavy for 280KB kernel |
Consequences
- Thousands of fibers practical (MCUs can run 20+, desktops run 10,000+)
- No race conditions — shared memory with cooperative scheduling
- 4KB default stack per fiber
- No true parallelism on single core (concurrency only)
- Long-running fibers must yield voluntarily
A1: Single Address Space
Status: Accepted
Context
Multiple address spaces (one per process) provide isolation but require TLB flushes on context switch (100-500 cycles). In a unikernel where isolation comes from capabilities, not virtual address separation, the overhead is pure waste.
Decision
One virtual address range for the entire system:
- Cell A:
0x08000000–0x0FFFFFFF - Cell B:
0x10000000–0x17FFFFFF - Isolation is physical (PMP registers / MMU page tables), not VA ranges
- No TLB flush needed when switching between cells
Alternatives Rejected
| Option | Why Not |
|---|---|
| Per-process address spaces | TLB flush = 100-500 cycles per switch |
| ASID tagging | Mitigates flushes but complex hardware requirement, not universal |
| No isolation | Fast but insecure |
Consequences
- Zero TLB flush overhead on cell switches
- Simpler kernel (no VA→PA remapping per cell)
- Shared read-only pages between cells (deduplication)
- Address space must be statically partitioned at boot
- Some hardware (ARM Realm Management) conflicts with this model
A2: SysTable Frozen ABI
Status: Accepted
Context
Dynamic registration (UEFI-style) is flexible but requires boot negotiation. Fixed addresses are simple and can be compiled into binaries at link time. The kernel only needs 12 function pointers.
Decision
SysTable at a fixed physical address, immutable layout:
| Architecture | Address |
|---|---|
| RISC-V 64 | 0x83000000 |
| ARM64 | 0x50000000 |
240 bytes. 12 syscall pointers + 1 meta-slot + ring buffer descriptors + framebuffer info. The layout never changes across versions.
Alternatives Rejected
| Option | Why Not |
|---|---|
| Dynamic registration | Boot complexity, discovery protocol needed |
| Multiple SysTables per cell | Memory overhead, harder to debug |
| Syscalls only (no table) | Loses zero-copy ring optimization |
Consequences
- Apps hardcode SysTable address — no discovery code
- Binary ABI stable forever (NPLs compiled against v0.1 run on v99.0)
- Runtime structure validation trivial (size check)
- Cannot extend beyond 240 bytes
- Wrong address = hard crash (no fallback)
A3: DragonflyBSD LWKT Scheduler Model
Status: Accepted
Context
Linux CFS optimizes for fairness (O(log n), preemptive). DragonflyBSD LWKT optimizes for predictability (priority-based, cooperative). Nexus needs deterministic latency, not max fairness.
Decision
Adopt DragonflyBSD's Lightweight Kernel Thread model:
- Fixed priority per fiber (Spectrum tier: Photon > Matter > Gravity > Void)
- Round-robin within same priority level
- Cooperative: fibers yield voluntarily, no preemption overhead
- Data moves between cores via ION Ring messages (no shared-memory mutexes)
Alternatives Rejected
| Option | Why Not |
|---|---|
| Linux CFS | Designed for multitenant fairness; preemption adds jitter |
| Rate Monotonic | Requires static scheduling; less flexible for dynamic workloads |
| Custom from scratch | High development risk, debugging nightmare |
Consequences
- Simpler scheduler (fewer lines = fewer bugs)
- Predictable: high-priority fibers always run first
- No starvation: Gravity/Void get time after Photon/Matter sleep
- Not perfectly fair (background jobs may starve under sustained foreground load)
- Long-running fibers must yield() voluntarily
A4: No Microkernel Message-Passing
Status: Accepted
Context
seL4 routes all inter-process communication through the kernel via formal RPC. This enables formal verification but adds ~1 µs overhead per message. Nexus already separates capability checking from data transfer.
Decision
The kernel does not implement RPC or message-passing:
- Kernel only performs capability checks + side effects (memory allocation, pledge verification)
- Data exchange happens via shared ION Rings — userland responsibility
- Kernel is completely passive (no mediator, no routing, no serialization)
Alternatives Rejected
| Option | Why Not |
|---|---|
| seL4-style RPC | Every message = hypercall, ~1 µs overhead each |
| Shared memory without kernel help | Easy to deadlock, hard to debug |
| Hybrid (RPC for control, rings for data) | Complexity without clear benefit over pure rings |
Consequences
- Kernel stays tiny (no RPC logic, no message routing)
- Verification scope smaller (only capability checks, not datapath)
- IPC latency is nanoseconds, not microseconds
- Applications must implement their own wire protocols
- Debugging requires understanding both capability system and app-level protocols