Skip to content

Restart Trap — NPL Immortality Core

Phase 6.2

The Restart Trap is the survivability subsystem that makes NPL fibers unkillable. When a fiber panics – whether from a radiation-induced bitflip, a page fault, or a budget violation – the system traps the failure, logs forensic data, and respawns the fiber from its immutable .npk image with a fresh capability space. No halt. No reboot. No operator intervention.

This is the core of the "unkillable" promise for satellite and aerospace deployments where physical access is impossible and downtime is mission failure.

Design Principles

  1. L0 independence. The trap handler and error buffer live entirely in Zig (L0 HAL). They have zero Nim dependencies. If the L1 kernel runtime is broken, the trap handler still functions.
  2. Deterministic recovery. The respawn sequence is fixed, bounded, and auditable. No dynamic allocation, no garbage collection, no runtime decisions that vary between executions.
  3. Zero capability leak. When a fiber restarts, its old capabilities are atomically revoked via epoch bump before any new capabilities are granted. There is no window where stale capabilities could be exercised.
  4. Circuit breaker. A fiber that panics repeatedly within a time window is quarantined, not restarted. This prevents death spirals in power-constrained environments.

Architecture

The Restart Trap is a three-component pipeline in L0, with notification hooks into L1:

┌─────────────────────────────────────────────────────┐
│  NPL Fiber (User Mode)                              │
│  ┌───────────────┐                                  │
│  │ Application   │ ── panic/fault ──┐               │
│  └───────────────┘                  │               │
├─────────────────────────────────────┼───────────────┤
│  L0 TRAP LAYER (Zig)               ▼               │
│  ┌──────────────────────────────────────┐           │
│  │  trap.zig — Panic Trap Handler       │           │
│  │  Classifies fault, checks circuit    │           │
│  │  breaker, returns verdict            │           │
│  └──────────┬───────────────────────────┘           │
│             │                                       │
│  ┌──────────▼───────────────────────────┐           │
│  │  beb.zig — Boot Error Buffer         │           │
│  │  Append-only, tamper-evident ring    │           │
│  │  XXH64-chained entries, TMR metadata │           │
│  └──────────────────────────────────────┘           │
│                                                     │
│  ┌──────────────────────────────────────┐           │
│  │  respawn.zig — Respawn Engine        │           │
│  │  Sterilize → Verify → Reload →       │           │
│  │  Re-enter (cold restart)             │           │
│  └──────────────────────────────────────┘           │
├─────────────────────────────────────────────────────┤
│  L1 KERNEL (Nim)                                    │
│  Notified after the fact via STL events.            │
│  Does not authorize or block respawn.               │
└─────────────────────────────────────────────────────┘

Exception Routing

The existing trap handler in entry_riscv.zig (rss_trap_handler) currently routes all exceptions to the Nim kernel's k_handle_exception, which halts the system in an infinite wfi/wfe loop (it never returns). The Restart Trap changes this routing based on the faulting privilege level.

RISC-V Routing (entry_riscv.zig)

The SPP bit in the saved sstatus register (available in frame.sstatus) indicates the privilege level at the time of the trap:

zig
// In rss_trap_handler, AFTER existing interrupt handling,
// REPLACE the unconditional k_handle_exception call:

const spp = (frame.sstatus >> 8) & 1;

if (spp == 0) {
    // User-mode fault (NPL fiber) — recoverable via L0
    const fiber_id = trap.get_current_fiber_id();
    const verdict = trap.npl_trap_handler(
        fiber_id, scause, frame.sepc, frame.stval, hal_get_time_ns()
    );
    switch (verdict) {
        .Respawn  => respawn.respawn_fiber(fiber_id),
        .Quarantine => trap.mark_fiber_dormant(fiber_id),
    }
    // Return from trap — scheduler picks up next fiber
} else {
    // Kernel-mode fault — not recoverable
    // k_handle_exception halts; this is intentional
    k_handle_exception(scause, frame.sepc, frame.stval);
}

Key point: for user-mode faults, k_handle_exception is never called. The L0 trap handler owns the entire recovery flow. L1 observes the result via STL events after the fact.

ARM64 Routing (entry_aarch64.zig)

The equivalent check on ARM64 reads SPSR_EL1.M[3:0]:

zig
const spsr = frame.spsr_el1;
const mode = spsr & 0xF;

if (mode == 0x0) {
    // EL0 fault (NPL fiber) — recoverable via L0
    // Same verdict logic as RISC-V
} else {
    // EL1 fault (kernel) — not recoverable
    k_handle_exception(esr_el1, frame.elr_el1, frame.far_el1);
}

ARM64 uses different register names (ESR_EL1 for syndrome, ELR_EL1 for return address, FAR_EL1 for fault address) but the flow is identical. Full ARM64 routing is a follow-up deliverable after RISC-V is validated on QEMU.

Fiber ID in L0 — The Handoff Variable

The trap handler must know which fiber was running when the fault occurred, without calling into Nim. This is solved with a static handoff variable in L0:

zig
// In trap.zig — written by L1, read by L0
pub var current_fiber_id: u64 = 0;

/// Called by L1 (Nim) before entering userland
pub export fn trap_set_current_fiber(fiber_id: u64) void {
    current_fiber_id = fiber_id;
}

/// Called by L0 trap handler — no Nim dependency
pub fn get_current_fiber_id() u64 {
    return current_fiber_id;
}

L1 calls trap_set_current_fiber(fiber.id) immediately before hal_enter_userland(). The trap handler reads it without crossing the L0/L1 boundary.

Bounds check: The trap handler validates current_fiber_id < MAX_FIBERS before any array access. An out-of-bounds fiber ID indicates kernel memory corruption — the system halts via k_handle_exception since recovery is not possible.

Boot Error Buffer (BEB)

The BEB is a fixed-size, append-only ring buffer in a compile-time known physical memory region. It records every panic, respawn attempt, signature failure, and checkpoint rejection. It survives soft reboots and can be downlinked for ground station analysis.

Memory Layout

BEB Region (32 KB fixed, no dynamic allocation)
┌─────────────────────────────────────────────┐
│  Header (TMR — 3 identical copies)          │
│  ┌───────────┬───────────┬───────────┐      │
│  │ Copy A    │ Copy B    │ Copy C    │      │
│  └───────────┴───────────┴───────────┘      │
│  64 bytes (3 × 20 bytes + 4 pad)            │
├─────────────────────────────────────────────┤
│  Entry Ring [0..510]                        │
│  Each entry: 64 bytes (cache-line aligned)  │
│  Total: 511 entries                         │
│  (32768 - 64) / 64 = 511                   │
└─────────────────────────────────────────────┘

TMR Header Structure

Each of the three header copies has this layout:

FieldTypeSizePurpose
magicu324 bytes0x42454221 ("BEB!")
headu162 bytesNext write index (0..510)
countu324 bytesTotal panics since boot (may exceed ring size)
_reservedu162 bytesAlignment padding
chain_hashu648 bytesXXH64 of the most recent entry (chain tail)
Per-copy total20 bytes

Three copies = 60 bytes. Padded to 64 bytes for cache-line alignment.

Majority vote: On read, all three copies are compared. If two agree, their values are used and the third is corrected. If all three disagree, the BEB is marked corrupt and a BebCorrupt event is emitted — the system can still append new entries but old data is untrusted.

BEB Entry Structure

FieldTypeSizePurpose
prev_hashu648 bytesXXH64 chain link to previous entry
fiber_idu648 bytesWhich fiber panicked
panic_classu81 byteClassified fault type (PanicClass enum)
_pad[3]u83 bytesAlignment
respawn_countu324 bytesHow many times this fiber has restarted
scauseu648 bytesRaw hardware cause register
sepcu648 bytesFaulting program counter
stvalu648 bytesFault address or instruction
timestamp_nsu648 bytesNanoseconds since boot
entry_hashu648 bytesXXH64 of prev_hash || entry_data
Total64 bytesCache-line aligned

Integrity Guarantees

Hash chain. Each entry's entry_hash = XXH64(prev_hash || entry_data[0..56]). If any entry is corrupted by a bitflip, walking the chain detects the exact corruption point. Ground station tooling verifies the chain on downlink.

Why XXH64, not a cryptographic hash. The threat model is radiation-induced random corruption, not adversarial forgery. XXH64 provides excellent avalanche properties (single bitflip changes ~50% of hash bits) with a collision probability of ~2^-64 per entry — astronomically sufficient for detecting random corruption in a 511-entry ring. Zig's std.hash.XxHash3 returns u64 natively, making this zero-overhead. If BEB entries are exported for ProvChain audit, Ed25519 signatures are added at export time — not in the hot path.

TMR metadata. The header (head index, total count, chain tail hash) is stored in triplicate. Reads use majority voting — two copies must agree. This protects the navigation structure of the BEB. Entry corruption is caught by the hash chain; metadata corruption is caught by TMR.

No locks. The BEB is single-writer (the trap handler runs in exception context with interrupts disabled). Readers walk the chain without synchronization.

Panic Trap Handler (trap.zig)

The trap handler classifies the fault, checks the circuit breaker, logs to BEB, emits STL events, and returns a verdict.

Panic Classification

PanicClassValueSource
PageFault0Load/store/instruction page fault
IllegalInsn1Illegal instruction (possible bitflip in code segment)
Alignment2Misaligned memory access
BudgetExhausted3Spectrum Ratchet kill — fiber exceeded budget
StackOverflow4Guard page hit at stack boundary
NimPanic5L1-forwarded panic from Nim runtime
Ecall6Invalid syscall number
Unknown7Unrecognized scause value
8-15Reserved for future classification

Reserved values for future anomaly detection integration (Phase 6.3+):

PanicClassValueSource
BehaviorAnomaly16Statistical deviation detected
IntegrityFault17CSpace or capability corruption

Verdict

The trap handler returns one of two verdicts:

VerdictMeaningAction
RespawnUnder circuit breaker thresholdrespawn.zig restarts the fiber
QuarantineOver threshold — death spiral detectedFiber marked dormant, FiberQuarantine event emitted

Kernel faults (SPP=1) are routed to k_handle_exception before the verdict is computed — they never reach npl_trap_handler. Similarly, an out-of-bounds fiber_id triggers an immediate halt before the verdict path.

Circuit Breaker

The circuit breaker prevents infinite restart loops. It tracks per-fiber panic frequency within a sliding time window.

Circuit breaker entry structure:

FieldTypeSizePurpose
panic_countu324 bytesPanics within current window
first_panic_nsu648 bytesWindow start timestamp
last_panic_nsu648 bytesMost recent panic timestamp
quarantinedbool1 byteCurrently quarantined?
_pad[3]u83 bytesAlignment
Total24 bytes

Statically allocated: var breakers: [16]CircuitBreaker — one per fiber, indexed by fiber_id. 384 bytes total. No heap.

Rules:

  • Threshold: 5 panics within 10 seconds triggers quarantine
  • Window sliding: When now - first_panic_ns > 10s, reset the window. Old panics age out. A fiber that panics once per hour restarts indefinitely — that is the correct behavior for transient radiation events.
  • Quarantine release: Manual (ground station command) or timed (configurable hold-off period). Released via FiberUnquarantine event. The release mechanism is specified in the telemetry command interface (out of scope for this spec — see Capsule LCC protocol).

Respawn Engine (respawn.zig)

When the trap handler returns Respawn, the respawn engine executes a deterministic four-step recovery. (Checkpoint restore is deferred to Phase 6.3 — this phase implements cold restart only.)

Step 1: Sterilize

  • Revoke all capabilities in the fiber's CSpace via revoke_all() (epoch bump)
  • Zero the fiber's user-mode stack pages
  • Flush TLB entries for the fiber's SATP

Why sterilize first: The old fiber's capabilities might include Channel access to other fibers' ring buffers. If we reload ELF first, there is a window where the half-initialized fiber could be preempted and use stale capabilities to write garbage into another fiber's channel. revoke_all() with epoch bump atomically invalidates everything.

Step 2: Verify

  • Locate the fiber's .npk image in the immutable store (initrd or flash) via the NPK Registry
  • Verify Ed25519 signature over the entire ELF binary using hal_crypto_ed25519_verify
  • If signature fails: log to BEB (SignatureReject event), quarantine the fiber, stop

A signature failure is a critical anomaly — either flash memory is degrading or the image has been tampered with. Either way, that fiber must not run.

Step 3: Reload

  • Load ELF segments into the fiber's physical memory via the L0 ELF loader (see sub-deliverable below)
  • Extract BKDL manifest from .nexus.manifest ELF section
  • Apply capabilities from the manifest into a fresh CSpace
  • Reset fiber metadata: state=Ready, budget=default, violations=0

After this step, the fiber is indistinguishable from a first boot.

Step 4: Re-enter

  • Call hal_enter_userland(entry, SYSTABLE_BASE, sp) to transition to user mode
  • Fiber begins execution from its ELF entry point (cold restart)

Sub-deliverable: L0 ELF Loader

The existing ELF loader lives in Nim (kernel.nim:kload_phys). Since the respawn engine must operate when L1 is potentially broken, a minimal Zig ELF loader is required as a sub-deliverable of this phase. Scope:

  • Parse ELF header and program headers
  • Walk PT_LOAD segments only (no dynamic linking, no relocations beyond simple delta)
  • Copy segments to physical memory at fiber's offset
  • Zero BSS
  • Flush I-cache (fence.i on RISC-V, ic iallu on ARM64)
  • Return entry point address

Estimated: ~150 lines of Zig. The Nim loader remains the primary path for initial boot; the L0 loader is the fallback path for respawn.

Checkpoint Support (Phase 6.4 — Deferred)

Warm restart from checkpoints is deferred to Phase 6.4 (Phase 6.3 is Adaptive FDIR). The checkpoint subsystem is fully specified in checkpoint-warm-restart.md and defines:

  • Checkpoint format (registers + stack + optional 32KB application blob, dual-copy BLAKE3)
  • Creation mechanism (SYS_CHECKPOINT syscall 0x200, fiber-initiated at safe points)
  • Storage location (dedicated 1.26 MB region at 0x8600_0000)
  • BLAKE3 integrity verification before restore

Phase 6.2 implements cold restart only. The respawn code paths include stubs that return no_checkpoint, ensuring the architecture is ready for Phase 6.4 without rework.

NPK Registry

The respawn engine needs to know which .npk image belongs to which fiber. The NPK Registry is a static mapping table populated at boot time:

FieldTypeSizePurpose
fiber_idu648 bytesOwning fiber
image_offsetu648 bytesOffset into initrd/flash
image_sizeu648 bytesSize of the ELF binary
pubkey[32]u832 bytesEd25519 public key for this .npk
path_hashu648 bytesXXH64 of the package path (for lookup)

Fixed array of 16 entries (matching MAX_FIBERS). Immutable after boot.

Ontology Extensions

The survivability subsystem adds new event kinds to the System Truth Ledger. These occupy the 60-69 range (survivability) and 80-89 range (reserved for anomaly detection).

Survivability Events (60-69)

EventKindValuedata0data1data2
FiberPanic60scausesepcstval
FiberRespawn61respawn_count
FiberQuarantine62panic_countwindow_ns
FiberUnquarantine63
CheckpointWrite64xxh64_hash
CheckpointReject65
SignatureReject66
BebOverflow67overwritten_count

Note: BebOverflow fires when the ring wraps and the oldest entry is overwritten. overwritten_count is the total number of entries overwritten since boot — this is expected behavior for a ring buffer, but the event signals to ground station that forensic data from the oldest panics is no longer available.

Anomaly Detection Seams (80-89, reserved for Phase 6.3+)

EventKindValuePurpose
AnomalyDetected80Behavioral deviation score
BehaviorBaseline81New baseline established
BehaviorViolation82Pattern deviation without panic
IntegrityAudit83Periodic CSpace integrity check

SystemStats Extension

The SystemStats struct in ontology.zig gains a new field:

zig
survivability_events: u32,  // Count of EventKind 60-67

The stl_get_stats switch statement handles the 60-67 and 80-83 ranges, incrementing survivability_events.

Causal Chain

Every survivability event carries cause_id pointing to the triggering event:

  • FiberRespawn.cause_id → the FiberPanic event that triggered it
  • FiberQuarantine.cause_id → the Nth FiberPanic that tripped the breaker
  • CheckpointReject.cause_id → the FiberRespawn that attempted restore

Ground station tooling reconstructs the full causal DAG from these links.

Telemetry Value

For satellite operations, the survivability events provide critical mission data:

  • Restart frequency per fiber correlated with orbital position maps radiation belt exposure
  • SignatureReject is a red alert — flash memory degradation or tampering requires immediate mission planning response
  • CheckpointReject means the fiber recovered but lost state — operations team must assess data continuity
  • BebOverflow indicates the system is panicking faster than the buffer can hold — a system-wide health signal
  • Event IDs (epoch << 32 | index) are monotonically increasing, enabling exact ordering reconstruction even if downlink packets arrive out of sequence

Files

New Files

FilePurposeEst. LOCDependencies
hal/trap.zigPanic classification, circuit breaker, verdict~200beb.zig, ontology.zig
hal/beb.zigBoot Error Buffer (TMR header, XXH64 chain, ring)~250std.hash.XxHash3
hal/respawn.zigSterilize, verify, reload, re-enter~300cspace.zig, crypto.zig, abi.zig
hal/elf_loader.zigMinimal ELF PT_LOAD segment loader for L0~150(standalone)

Modified Files

FileChange
hal/entry_riscv.zigAdd SPP check in exception path; route user-mode faults to npl_trap_handler instead of k_handle_exception. See Exception Routing for diff sketch.
hal/entry_aarch64.zigAdd SPSR_EL1.M check; same routing for ARM64. Follow-up after RISC-V validation.
hal/ontology.zigAdd EventKind values 60-67 and 80-83 to the enum. Add survivability_events to SystemStats. Handle new ranges in stl_get_stats.
hal/abi.zigAdd to comptime block: _ = @import("trap.zig"); _ = @import("beb.zig"); _ = @import("respawn.zig"); _ = @import("elf_loader.zig");
build.zigNo changes needed — new files are pulled in via abi.zig comptime imports.
core/kernel.nimCall trap_set_current_fiber(fiber.id) before hal_enter_userland(). No changes to k_handle_exception — it is simply not called for user-mode faults.

Estimated total: ~900 lines of new Zig code (750 + 150 for L0 ELF loader).

Testing Strategy

Unit Tests (per-file, zig build test)

beb.zig:

  • Append entry, verify chain hash
  • Fill ring to capacity (511 entries), verify BebOverflow event on wrap
  • TMR metadata: corrupt one copy, verify majority vote reads correct value
  • TMR metadata: corrupt two copies, verify detection (no silent wrong answer)
  • Entry with known bitflip: chain walk detects corruption location
  • Verify entry size is exactly 64 bytes (comptime assert)

trap.zig:

  • Each PanicClass from each scause value (table-driven, every RISC-V exception code)
  • fiber_id bounds check: value 16 → immediate halt path, not verdict
  • Circuit breaker: 4 panics in window → Respawn verdict
  • Circuit breaker: 5 panics in window → Quarantine verdict
  • Circuit breaker: panics spread across two windows → both Respawn (no false quarantine)
  • Window expiry: old panics age out after 10s, counter resets
  • Already-quarantined fiber: verify Quarantine verdict without incrementing count

respawn.zig:

  • CSpace empty after sterilize (all 64 slots Null, epoch incremented)
  • Valid Ed25519 signature → Success
  • Invalid signature (one bit flipped) → SignatureInvalid, BEB entry written
  • Out-of-bounds fiber_id → LoaderFailed
  • CSpace full after manifest apply → CSpaceFull result

elf_loader.zig:

  • Valid ELF with single PT_LOAD → segments copied, entry returned
  • Valid ELF with multiple PT_LOAD + BSS → all segments, BSS zeroed
  • Invalid ELF magic → error
  • Truncated program headers → error
  • Entry point within expected range → pass; outside → error

Integration Tests

  • Full panic → respawn cycle: inject page fault → verify BEB entry → verify fiber respawned → verify fresh CSpace → verify STL causal chain (FiberPanic.event_id == FiberRespawn.cause_id)
  • Death spiral quarantine: 5 rapid panics → verify fiber quarantined → verify no 6th respawn → verify FiberQuarantine event
  • Signature corruption: flip one bit in .npk → trigger respawn → verify SignatureReject → verify quarantine
  • BEB chain integrity: write 20 entries, corrupt entry 10, verify chain walk identifies entry 10

Architecture Coverage

All tests run on RISC-V 64 and ARM64 targets via QEMU. The classify_panic function has architecture-specific scause/ESR mappings — both architectures are tested.

  • SPEC-020: Capability Algebra — CSpace revocation on respawn
  • SPEC-051: Capability Space — Per-fiber capability tables
  • SPEC-060: System Ontology — STL event framework
  • RFC-0649: EPOE — System survival through restart
  • RFC-0110: Membrane Agent — Future anomaly detection integration (Phase 6.3+)