Kernel Decisions

Ten decisions that define how Rumpk works — and why it doesn't work like Linux.

K1: Unikernel Over Monolithic Kernel

Status: Accepted

Context

Traditional monolithic kernels mix isolation boundaries with scheduling logic: 2000+ cycle context switches, thousands of syscalls, ambient authority security. Microkernels (seL4, MINIX) trade performance for formal verification. Neither fits a system designed for 8-bit MCUs and Mars rovers in the same binary.

Decision

Rumpk is a modular unikernel: single address space, no process boundary, application carries its own OS logic via libnexus.a (the Membrane layer).

Alternatives Rejected

Option	Why Not
Linux monolith	~2000 cycle hypercall overhead, /proc complexity, ambient authority model
Full microkernel (seL4)	10-100ms latencies, formal verification overhead not justified for commercial systems
Container stack (Docker)	Requires underlying OS, 200MB+ runtime overhead per container

Consequences

Zero kernel context switches for I/O (2-5 cycle latency)
Deterministic, verifiable execution paths
POSIX compatibility requires Membrane translation layer
Each application instance includes its own kernel code (~280KB)

K2: Zig HAL + Nim Kernel Logic

Status: Accepted

Context

Hardware interactions demand precise memory layouts, volatile access, and atomic operations. Kernel scheduling and fiber management benefit from higher-level abstractions like pattern matching and ARC memory management. One language forces compromise in either direction.

Decision

Split the kernel into two language domains:

hal/ (Zig): Direct hardware — UART, VirtIO, GIC, MMU, interrupt vectors
core/ (Nim): Scheduler, fibers, channels, VFS, capability enforcement

They meet at the HAL struct — a C-compatible function pointer table.

Alternatives Rejected

Option	Why Not
All Zig	Requires custom async/await runtime; loses Nim's fiber elegance
All Nim	Low-level operations become brittle; inline asm escapes are unsafe
C for HAL	Equally verbose as Zig but more fragile (no comptime, no safety)

Consequences

Clean separation: Physics (Zig) vs. Logic (Nim)
Nim fibers survive HAL panics (channels persist at fixed addresses)
Future Janus language can slot into either layer
Two runtime ecosystems to maintain; ABI discipline required

K3: 12 Frozen Syscalls

Status: Accepted

Context

POSIX defines 400+ syscalls. Each is a security boundary crossing and a context switch point. Most are vestigial. Nexus routes all bulk I/O through ION Ring buffers — the kernel only needs a handful of control operations.

Decision

Exactly 12 syscalls + 1 meta-slot. Frozen forever:

pledge, unveil, read(0), write(1), map, unmap, spawn, kill, yield, time, random, halt

All data I/O goes through ION Rings (zero context switch). The syscall ABI is compiled into binaries at link time — no versioning, no compat layers.

Alternatives Rejected

Option	Why Not
Full POSIX	400+ audit points, each with versioning and compat requirements
seL4 model	100+ capability operations, more complex interface
Plan 9 RPC	Everything-as-RPC is flexible but adds latency

Consequences

Formal verification feasible (12 entry points vs. 400+)
Security surface trivially auditable
Binary ABI stable forever (no libc version wars)
Legacy applications require Membrane translation
No backfill possible — missed use cases stay missed

K4: ION Rings Over Pipes and Sockets

Status: Accepted

Context

POSIX pipes are byte-streams requiring syscalls per read/write. Sockets add TCP/IP overhead for local communication. Financial exchanges proved the Disruptor pattern achieves nanosecond latencies with zero allocations.

Decision

All IPC uses Disruptor-style ring buffers (ION Rings):

128-byte fixed header + dynamic payload in shared memory
Pre-allocated at boot — no dynamic allocation
Each fiber gets dedicated RX/TX rings
Backpressure via atomic head/tail pointers

Alternatives Rejected

Option	Why Not
POSIX pipes	Byte-oriented, syscall per operation, 2-5 µs minimum latency
POSIX sockets	TCP overhead, kernel routing, ARP lookups for local IPC
Actor model (Erlang)	Elegant but requires heavyweight runtime (50MB Erlang VM)

Consequences

~100ns latency for local IPC (vs. 2-5 µs for pipes)
Zero-copy with proper buffer ownership
Automatic backpressure (no overflow, no unbounded queues)
Fixed-size rings require capacity pre-planning
Buffer ownership semantics require discipline

K5: Tickless Event-Driven Scheduler

Status: Accepted

Context

Traditional schedulers wake the CPU every 1-10ms via timer interrupt. This prevents deep sleep states and wastes power even when the system is completely idle. Modern hardware (V-Sync, DMA completion, network IRQs) provides natural event signals.

Decision

No periodic scheduler tick. The scheduler is an ISR, not a loop:

When RunQueue is empty: CPU executes WFI, power drops to leakage levels
Scheduling occurs only as side-effect of hardware interrupts
Wakes on V-Sync (120Hz), packet arrival, DMA completion
Software timeouts registered as explicit deadline events

Alternatives Rejected

Option	Why Not
Fixed tick (1ms)	Forces idle wake-ups, 1-3% CPU overhead on idle
Dynamic tick (Linux NO_HZ)	Heuristic tuning adds complexity without dramatic benefit
Per-app event loop	Requires application awareness, not transparent

Consequences

5-10% power reduction on idle (mobile/embedded)
Events trigger immediate scheduler response (deterministic latency)
No fairness pathology (Spectrum model prevents starvation)
Software timeouts must be explicitly registered
Some batch workloads lose natural fairness preemption points

K6: Fibers Over Processes and Threads

Status: Accepted

Context

Processes cost 1-5 MB RAM and 1-5 µs per context switch. Threads share memory but invite race conditions and deadlocks. Nexus operates in a single address space — process isolation provides no benefit, only cost.

Decision

All concurrency uses fibers (Nim async/await):

~1 KB RAM per fiber (vs. 1-5 MB per process)
Cooperative switching: ~100-200 ns (vs. 1-5 µs for context switch)
No preemption within a priority level — no race conditions by construction
Spectrum-based fairness: Photon > Matter > Gravity > Void

Alternatives Rejected

Option	Why Not
Heavy processes	Isolation already provided by capabilities; still pay full context switch
Shared-memory threads	Lock-free data structures are hard; deadlocks and races omnipresent
Actor model (Erlang)	Elegant but 50MB base runtime; too heavy for 280KB kernel

Consequences

Thousands of fibers practical (MCUs can run 20+, desktops run 10,000+)
No race conditions — shared memory with cooperative scheduling
4KB default stack per fiber
No true parallelism on single core (concurrency only)
Long-running fibers must yield voluntarily

A1: Single Address Space

Status: Accepted

Context

Multiple address spaces (one per process) provide isolation but require TLB flushes on context switch (100-500 cycles). In a unikernel where isolation comes from capabilities, not virtual address separation, the overhead is pure waste.

Decision

One virtual address range for the entire system:

Cell A: 0x08000000–0x0FFFFFFF
Cell B: 0x10000000–0x17FFFFFF
Isolation is physical (PMP registers / MMU page tables), not VA ranges
No TLB flush needed when switching between cells

Alternatives Rejected

Option	Why Not
Per-process address spaces	TLB flush = 100-500 cycles per switch
ASID tagging	Mitigates flushes but complex hardware requirement, not universal
No isolation	Fast but insecure

Consequences

Zero TLB flush overhead on cell switches
Simpler kernel (no VA→PA remapping per cell)
Shared read-only pages between cells (deduplication)
Address space must be statically partitioned at boot
Some hardware (ARM Realm Management) conflicts with this model

A2: SysTable Frozen ABI

Status: Accepted

Context

Dynamic registration (UEFI-style) is flexible but requires boot negotiation. Fixed addresses are simple and can be compiled into binaries at link time. The kernel only needs 12 function pointers.

Decision

SysTable at a fixed physical address, immutable layout:

Architecture	Address
RISC-V 64	`0x83000000`
ARM64	`0x50000000`

240 bytes. 12 syscall pointers + 1 meta-slot + ring buffer descriptors + framebuffer info. The layout never changes across versions.

Alternatives Rejected

Option	Why Not
Dynamic registration	Boot complexity, discovery protocol needed
Multiple SysTables per cell	Memory overhead, harder to debug
Syscalls only (no table)	Loses zero-copy ring optimization

Consequences

Apps hardcode SysTable address — no discovery code
Binary ABI stable forever (NPLs compiled against v0.1 run on v99.0)
Runtime structure validation trivial (size check)
Cannot extend beyond 240 bytes
Wrong address = hard crash (no fallback)

A3: DragonflyBSD LWKT Scheduler Model

Status: Accepted

Context

Linux CFS optimizes for fairness (O(log n), preemptive). DragonflyBSD LWKT optimizes for predictability (priority-based, cooperative). Nexus needs deterministic latency, not max fairness.

Decision

Adopt DragonflyBSD's Lightweight Kernel Thread model:

Fixed priority per fiber (Spectrum tier: Photon > Matter > Gravity > Void)
Round-robin within same priority level
Cooperative: fibers yield voluntarily, no preemption overhead
Data moves between cores via ION Ring messages (no shared-memory mutexes)

Alternatives Rejected

Option	Why Not
Linux CFS	Designed for multitenant fairness; preemption adds jitter
Rate Monotonic	Requires static scheduling; less flexible for dynamic workloads
Custom from scratch	High development risk, debugging nightmare

Consequences

Simpler scheduler (fewer lines = fewer bugs)
Predictable: high-priority fibers always run first
No starvation: Gravity/Void get time after Photon/Matter sleep
Not perfectly fair (background jobs may starve under sustained foreground load)
Long-running fibers must yield() voluntarily

A4: No Microkernel Message-Passing

Status: Accepted

Context

seL4 routes all inter-process communication through the kernel via formal RPC. This enables formal verification but adds ~1 µs overhead per message. Nexus already separates capability checking from data transfer.

Decision

The kernel does not implement RPC or message-passing:

Kernel only performs capability checks + side effects (memory allocation, pledge verification)
Data exchange happens via shared ION Rings — userland responsibility
Kernel is completely passive (no mediator, no routing, no serialization)

Alternatives Rejected

Option	Why Not
seL4-style RPC	Every message = hypercall, ~1 µs overhead each
Shared memory without kernel help	Easy to deadlock, hard to debug
Hybrid (RPC for control, rings for data)	Complexity without clear benefit over pure rings

Consequences

Kernel stays tiny (no RPC logic, no message routing)
Verification scope smaller (only capability checks, not datapath)
IPC latency is nanoseconds, not microseconds
Applications must implement their own wire protocols
Debugging requires understanding both capability system and app-level protocols

Kernel Decisions ​

K1: Unikernel Over Monolithic Kernel ​

Context ​

Decision ​

Alternatives Rejected ​

Consequences ​

K2: Zig HAL + Nim Kernel Logic ​

Context ​

Decision ​

Alternatives Rejected ​

Consequences ​

K3: 12 Frozen Syscalls ​

Context ​

Decision ​

Alternatives Rejected ​

Consequences ​

K4: ION Rings Over Pipes and Sockets ​

Context ​

Decision ​

Alternatives Rejected ​

Consequences ​

K5: Tickless Event-Driven Scheduler ​

Context ​

Decision ​

Alternatives Rejected ​

Consequences ​

K6: Fibers Over Processes and Threads ​

Context ​

Decision ​

Alternatives Rejected ​

Consequences ​

A1: Single Address Space ​

Context ​

Decision ​

Alternatives Rejected ​

Consequences ​

A2: SysTable Frozen ABI ​

Context ​

Decision ​

Alternatives Rejected ​

Consequences ​

A3: DragonflyBSD LWKT Scheduler Model ​

Context ​

Decision ​

Alternatives Rejected ​

Consequences ​

A4: No Microkernel Message-Passing ​

Context ​

Decision ​

Alternatives Rejected ​

Consequences ​

Kernel Decisions

K1: Unikernel Over Monolithic Kernel

Context

Decision

Alternatives Rejected

Consequences

K2: Zig HAL + Nim Kernel Logic

Context

Decision

Alternatives Rejected

Consequences

K3: 12 Frozen Syscalls

Context

Decision

Alternatives Rejected

Consequences

K4: ION Rings Over Pipes and Sockets

Context

Decision

Alternatives Rejected

Consequences

K5: Tickless Event-Driven Scheduler

Context

Decision

Alternatives Rejected

Consequences

K6: Fibers Over Processes and Threads

Context

Decision

Alternatives Rejected

Consequences

A1: Single Address Space

Context

Decision

Alternatives Rejected

Consequences

A2: SysTable Frozen ABI

Context

Decision

Alternatives Rejected

Consequences

A3: DragonflyBSD LWKT Scheduler Model

Context

Decision

Alternatives Rejected

Consequences

A4: No Microkernel Message-Passing

Context

Decision

Alternatives Rejected

Consequences