Skip to content

Kernel Decisions

Ten decisions that define how Rumpk works — and why it doesn't work like Linux.

K1: Unikernel Over Monolithic Kernel

Status: Accepted

Context

Traditional monolithic kernels mix isolation boundaries with scheduling logic: 2000+ cycle context switches, thousands of syscalls, ambient authority security. Microkernels (seL4, MINIX) trade performance for formal verification. Neither fits a system designed for 8-bit MCUs and Mars rovers in the same binary.

Decision

Rumpk is a modular unikernel: single address space, no process boundary, application carries its own OS logic via libnexus.a (the Membrane layer).

Alternatives Rejected

OptionWhy Not
Linux monolith~2000 cycle hypercall overhead, /proc complexity, ambient authority model
Full microkernel (seL4)10-100ms latencies, formal verification overhead not justified for commercial systems
Container stack (Docker)Requires underlying OS, 200MB+ runtime overhead per container

Consequences

  • Zero kernel context switches for I/O (2-5 cycle latency)
  • Deterministic, verifiable execution paths
  • POSIX compatibility requires Membrane translation layer
  • Each application instance includes its own kernel code (~280KB)

K2: Zig HAL + Nim Kernel Logic

Status: Accepted

Context

Hardware interactions demand precise memory layouts, volatile access, and atomic operations. Kernel scheduling and fiber management benefit from higher-level abstractions like pattern matching and ARC memory management. One language forces compromise in either direction.

Decision

Split the kernel into two language domains:

  • hal/ (Zig): Direct hardware — UART, VirtIO, GIC, MMU, interrupt vectors
  • core/ (Nim): Scheduler, fibers, channels, VFS, capability enforcement

They meet at the HAL struct — a C-compatible function pointer table.

Alternatives Rejected

OptionWhy Not
All ZigRequires custom async/await runtime; loses Nim's fiber elegance
All NimLow-level operations become brittle; inline asm escapes are unsafe
C for HALEqually verbose as Zig but more fragile (no comptime, no safety)

Consequences

  • Clean separation: Physics (Zig) vs. Logic (Nim)
  • Nim fibers survive HAL panics (channels persist at fixed addresses)
  • Future Janus language can slot into either layer
  • Two runtime ecosystems to maintain; ABI discipline required

K3: 12 Frozen Syscalls

Status: Accepted

Context

POSIX defines 400+ syscalls. Each is a security boundary crossing and a context switch point. Most are vestigial. Nexus routes all bulk I/O through ION Ring buffers — the kernel only needs a handful of control operations.

Decision

Exactly 12 syscalls + 1 meta-slot. Frozen forever:

pledge, unveil, read(0), write(1), map, unmap, spawn, kill, yield, time, random, halt

All data I/O goes through ION Rings (zero context switch). The syscall ABI is compiled into binaries at link time — no versioning, no compat layers.

Alternatives Rejected

OptionWhy Not
Full POSIX400+ audit points, each with versioning and compat requirements
seL4 model100+ capability operations, more complex interface
Plan 9 RPCEverything-as-RPC is flexible but adds latency

Consequences

  • Formal verification feasible (12 entry points vs. 400+)
  • Security surface trivially auditable
  • Binary ABI stable forever (no libc version wars)
  • Legacy applications require Membrane translation
  • No backfill possible — missed use cases stay missed

K4: ION Rings Over Pipes and Sockets

Status: Accepted

Context

POSIX pipes are byte-streams requiring syscalls per read/write. Sockets add TCP/IP overhead for local communication. Financial exchanges proved the Disruptor pattern achieves nanosecond latencies with zero allocations.

Decision

All IPC uses Disruptor-style ring buffers (ION Rings):

  • 128-byte fixed header + dynamic payload in shared memory
  • Pre-allocated at boot — no dynamic allocation
  • Each fiber gets dedicated RX/TX rings
  • Backpressure via atomic head/tail pointers

Alternatives Rejected

OptionWhy Not
POSIX pipesByte-oriented, syscall per operation, 2-5 µs minimum latency
POSIX socketsTCP overhead, kernel routing, ARP lookups for local IPC
Actor model (Erlang)Elegant but requires heavyweight runtime (50MB Erlang VM)

Consequences

  • ~100ns latency for local IPC (vs. 2-5 µs for pipes)
  • Zero-copy with proper buffer ownership
  • Automatic backpressure (no overflow, no unbounded queues)
  • Fixed-size rings require capacity pre-planning
  • Buffer ownership semantics require discipline

K5: Tickless Event-Driven Scheduler

Status: Accepted

Context

Traditional schedulers wake the CPU every 1-10ms via timer interrupt. This prevents deep sleep states and wastes power even when the system is completely idle. Modern hardware (V-Sync, DMA completion, network IRQs) provides natural event signals.

Decision

No periodic scheduler tick. The scheduler is an ISR, not a loop:

  • When RunQueue is empty: CPU executes WFI, power drops to leakage levels
  • Scheduling occurs only as side-effect of hardware interrupts
  • Wakes on V-Sync (120Hz), packet arrival, DMA completion
  • Software timeouts registered as explicit deadline events

Alternatives Rejected

OptionWhy Not
Fixed tick (1ms)Forces idle wake-ups, 1-3% CPU overhead on idle
Dynamic tick (Linux NO_HZ)Heuristic tuning adds complexity without dramatic benefit
Per-app event loopRequires application awareness, not transparent

Consequences

  • 5-10% power reduction on idle (mobile/embedded)
  • Events trigger immediate scheduler response (deterministic latency)
  • No fairness pathology (Spectrum model prevents starvation)
  • Software timeouts must be explicitly registered
  • Some batch workloads lose natural fairness preemption points

K6: Fibers Over Processes and Threads

Status: Accepted

Context

Processes cost 1-5 MB RAM and 1-5 µs per context switch. Threads share memory but invite race conditions and deadlocks. Nexus operates in a single address space — process isolation provides no benefit, only cost.

Decision

All concurrency uses fibers (Nim async/await):

  • ~1 KB RAM per fiber (vs. 1-5 MB per process)
  • Cooperative switching: ~100-200 ns (vs. 1-5 µs for context switch)
  • No preemption within a priority level — no race conditions by construction
  • Spectrum-based fairness: Photon > Matter > Gravity > Void

Alternatives Rejected

OptionWhy Not
Heavy processesIsolation already provided by capabilities; still pay full context switch
Shared-memory threadsLock-free data structures are hard; deadlocks and races omnipresent
Actor model (Erlang)Elegant but 50MB base runtime; too heavy for 280KB kernel

Consequences

  • Thousands of fibers practical (MCUs can run 20+, desktops run 10,000+)
  • No race conditions — shared memory with cooperative scheduling
  • 4KB default stack per fiber
  • No true parallelism on single core (concurrency only)
  • Long-running fibers must yield voluntarily

A1: Single Address Space

Status: Accepted

Context

Multiple address spaces (one per process) provide isolation but require TLB flushes on context switch (100-500 cycles). In a unikernel where isolation comes from capabilities, not virtual address separation, the overhead is pure waste.

Decision

One virtual address range for the entire system:

  • Cell A: 0x080000000x0FFFFFFF
  • Cell B: 0x100000000x17FFFFFF
  • Isolation is physical (PMP registers / MMU page tables), not VA ranges
  • No TLB flush needed when switching between cells

Alternatives Rejected

OptionWhy Not
Per-process address spacesTLB flush = 100-500 cycles per switch
ASID taggingMitigates flushes but complex hardware requirement, not universal
No isolationFast but insecure

Consequences

  • Zero TLB flush overhead on cell switches
  • Simpler kernel (no VA→PA remapping per cell)
  • Shared read-only pages between cells (deduplication)
  • Address space must be statically partitioned at boot
  • Some hardware (ARM Realm Management) conflicts with this model

A2: SysTable Frozen ABI

Status: Accepted

Context

Dynamic registration (UEFI-style) is flexible but requires boot negotiation. Fixed addresses are simple and can be compiled into binaries at link time. The kernel only needs 12 function pointers.

Decision

SysTable at a fixed physical address, immutable layout:

ArchitectureAddress
RISC-V 640x83000000
ARM640x50000000

240 bytes. 12 syscall pointers + 1 meta-slot + ring buffer descriptors + framebuffer info. The layout never changes across versions.

Alternatives Rejected

OptionWhy Not
Dynamic registrationBoot complexity, discovery protocol needed
Multiple SysTables per cellMemory overhead, harder to debug
Syscalls only (no table)Loses zero-copy ring optimization

Consequences

  • Apps hardcode SysTable address — no discovery code
  • Binary ABI stable forever (NPLs compiled against v0.1 run on v99.0)
  • Runtime structure validation trivial (size check)
  • Cannot extend beyond 240 bytes
  • Wrong address = hard crash (no fallback)

A3: DragonflyBSD LWKT Scheduler Model

Status: Accepted

Context

Linux CFS optimizes for fairness (O(log n), preemptive). DragonflyBSD LWKT optimizes for predictability (priority-based, cooperative). Nexus needs deterministic latency, not max fairness.

Decision

Adopt DragonflyBSD's Lightweight Kernel Thread model:

  • Fixed priority per fiber (Spectrum tier: Photon > Matter > Gravity > Void)
  • Round-robin within same priority level
  • Cooperative: fibers yield voluntarily, no preemption overhead
  • Data moves between cores via ION Ring messages (no shared-memory mutexes)

Alternatives Rejected

OptionWhy Not
Linux CFSDesigned for multitenant fairness; preemption adds jitter
Rate MonotonicRequires static scheduling; less flexible for dynamic workloads
Custom from scratchHigh development risk, debugging nightmare

Consequences

  • Simpler scheduler (fewer lines = fewer bugs)
  • Predictable: high-priority fibers always run first
  • No starvation: Gravity/Void get time after Photon/Matter sleep
  • Not perfectly fair (background jobs may starve under sustained foreground load)
  • Long-running fibers must yield() voluntarily

A4: No Microkernel Message-Passing

Status: Accepted

Context

seL4 routes all inter-process communication through the kernel via formal RPC. This enables formal verification but adds ~1 µs overhead per message. Nexus already separates capability checking from data transfer.

Decision

The kernel does not implement RPC or message-passing:

  • Kernel only performs capability checks + side effects (memory allocation, pledge verification)
  • Data exchange happens via shared ION Rings — userland responsibility
  • Kernel is completely passive (no mediator, no routing, no serialization)

Alternatives Rejected

OptionWhy Not
seL4-style RPCEvery message = hypercall, ~1 µs overhead each
Shared memory without kernel helpEasy to deadlock, hard to debug
Hybrid (RPC for control, rings for data)Complexity without clear benefit over pure rings

Consequences

  • Kernel stays tiny (no RPC logic, no message routing)
  • Verification scope smaller (only capability checks, not datapath)
  • IPC latency is nanoseconds, not microseconds
  • Applications must implement their own wire protocols
  • Debugging requires understanding both capability system and app-level protocols