Autonomous FDIR — Adaptive Dual-Layer Anomaly Detection
Phase 6.3
Fault Detection, Isolation, and Recovery (FDIR) with independent multi-layer verification. Two observers – L0 (Zig) and L1 (Nim) – reach conclusions independently, neither trusting the other. Maps to ECSS-E-ST-70-41C requirements for onboard autonomy.
Design Principles
- L0 verdicts are authoritative. If L0 says "quarantine fiber X" and L1 disagrees, L0 wins. Always. L1 can recommend actions via the verdict ring; L0 has veto power.
- Trust flows down. Hardware timer → L0 → L1 → userland. Information flows up, authority flows down.
- Zero cost when disabled. Every FDIR check gates on a cached feature mask. Disabled features cost one register comparison – not a function call, not a branch prediction miss.
- Boot profile is the floor. Runtime escalation can raise the paranoia level but never drop below the boot profile. A ground station cannot remotely disable your heartbeat monitor.
- Adaptive, not static. The system escalates its own FDIR profile based on observed conditions. A laptop that starts seeing ECC errors auto-escalates. A satellite in calm space doesn't waste cycles on full TMR voting.
Boot Profiles
Set in boot KDL. Determines the initial feature mask and the de-escalation floor.
survivability {
profile "sovereign"
// profile "hardened"
// profile "radiation"
}Feature Matrix
| Bit | Feature | Sovereign | Hardened | Radiation |
|---|---|---|---|---|
| 0 | Restart Trap (trap.zig) | on | on | on |
| 1 | Circuit Breaker | on | on | on |
| 2 | BEB Logging | on | on | on |
| 3 | L1 Heartbeat Monitor | off | on | on |
| 4 | Baseline Learning (L1) | off | on | on |
| 5 | CSpace Integrity Audit | off | off | on |
| 6 | Panic Correlator | off | off | on |
| 7 | Verdict Ring | off | off | on |
pub const Profile = enum(u8) {
sovereign = 0b0000_0111,
hardened = 0b0001_1111,
radiation = 0b1111_1111,
};
// INVARIANT: Profiles must be strict supersets (numerically increasing).
// Each higher profile enables all bits of the lower profiles plus additional ones.
// This enables simple numeric comparison for escalation/de-escalation.
comptime {
std.debug.assert(@intFromEnum(Profile.sovereign) < @intFromEnum(Profile.hardened));
std.debug.assert(@intFromEnum(Profile.hardened) < @intFromEnum(Profile.radiation));
// Every higher profile is a bitmask superset of the lower:
const s = @intFromEnum(Profile.sovereign);
const h = @intFromEnum(Profile.hardened);
const r = @intFromEnum(Profile.radiation);
std.debug.assert((h & s) == s); // hardened ⊇ sovereign
std.debug.assert((r & h) == h); // radiation ⊇ hardened
}
pub var feature_mask: u8 = @intFromEnum(Profile.sovereign);
pub var boot_floor: u8 = @intFromEnum(Profile.sovereign);
/// De-escalation step mapping. Profiles form an ordered chain.
/// Returns the next lower profile, or the same profile if already at minimum.
pub fn deescalate_one_step(mask: u8) u8 {
if (mask == @intFromEnum(Profile.radiation)) return @intFromEnum(Profile.hardened);
if (mask == @intFromEnum(Profile.hardened)) return @intFromEnum(Profile.sovereign);
return mask; // Already at sovereign or unknown — no change
}The mask is read once per check path and cached in a register. Feature checks are single-instruction:
inline fn feature_enabled(bit: u3) bool {
return (feature_mask >> bit) & 1 != 0;
}
/// Export for L1 (Nim) to read the current feature mask for gating its own checks.
pub export fn anomaly_get_feature_mask() u8 {
return feature_mask;
}Configurable Thresholds
Profiles carry default thresholds; missions override them in boot KDL. Different hardware, different missions, different worlds, different settings.
survivability {
profile "radiation"
thresholds {
heartbeat-warn-ns 500000000 // 500ms (default for radiation)
heartbeat-soft-mult 5 // soft restart at 5× warn
heartbeat-kexec-mult 20 // kexec at 20× warn
correlation-window-ns 100000000 // 100ms panic correlation window
correlation-threshold 3 // minimum distinct fibers
audit-interval-ticks 1024 // CSpace audit every N heartbeats
baseline-lock-samples 1000 // lock baseline after N samples
deviation-sigma 3 // BehaviorViolation threshold (σ)
deescalation-cooldown-s 300 // clean window before de-escalation
hold-timeout-s 21600 // max FDIR_HOLD duration (6 hours)
}
}All thresholds have compile-time defaults per profile. KDL overrides are applied at anomaly_init() time. L0 stores them in a static FdirConfig struct – no heap, no parsing after boot. KDL values suffixed with -s (seconds) are multiplied by 1_000_000_000 during anomaly_init() to convert to the nanosecond representation used internally.
pub const FdirConfig = struct {
heartbeat_warn_ns: u64,
heartbeat_soft_mult: u8,
heartbeat_kexec_mult: u8,
correlation_window_ns: u64,
correlation_threshold: u8,
audit_interval_ticks: u32,
baseline_lock_samples: u32,
deviation_sigma: u8,
deescalation_cooldown_ns: u64,
hold_timeout_ns: u64,
l1_correlation_window_ns: u64, // L1 cross-fiber correlation window (default 500ms)
};Default configurations per profile:
| Threshold | Sovereign | Hardened | Radiation |
|---|---|---|---|
| heartbeat_warn_ns | n/a | 1_000_000_000 (1s) | 500_000_000 (500ms) |
| heartbeat_soft_mult | n/a | 5 | 5 |
| heartbeat_kexec_mult | n/a | 20 | 20 |
| correlation_window_ns | n/a | n/a | 100_000_000 (100ms) |
| correlation_threshold | n/a | n/a | 3 |
| audit_interval_ticks | n/a | n/a | 1024 (~10s) |
| baseline_lock_samples | n/a | 1000 | 1000 |
| deviation_sigma | n/a | 3 | 3 |
| deescalation_cooldown_ns | n/a | 300_000_000_000 (300s) | 300_000_000_000 (300s) |
| hold_timeout_ns | n/a | n/a | 21_600_000_000_000 (6h) |
| l1_correlation_window_ns | n/a | 500_000_000 (500ms) | 500_000_000 (500ms) |
Architecture
┌─────────────────────────────────────────────────────────────┐
│ USERLAND │
│ ┌───────────────────────────────────────────┐ │
│ │ Membrane Fiber │ │
│ │ Consumes AnomalyDetected events │ │
│ │ Relays telemetry to ground station │ │
│ │ Forwards FDIR commands → ION → L1 │ │
│ └───────────────────┬───────────────────────┘ │
├──────────────────────┼──────────────────────────────────────┤
│ L1 KERNEL (Nim) │ │
│ ┌───────────────────▼───────────────────────┐ │
│ │ Behavioral Analyst (baseline.nim) │ │
│ │ Welford baselines per fiber │ │
│ │ Cross-fiber correlation │ │
│ │ Writes recommendations → Verdict Ring │ │
│ └───────────────────┬───────────────────────┘ │
│ │ verdict ring (shared memory) │
├──────────────────────┼──────────────────────────────────────┤
│ L0 HAL (Zig) ▼ │
│ ┌──────────────────────────────────────────────────┐ │
│ │ Paranoid Watchdog (anomaly.zig) │ │
│ │ │ │
│ │ ┌─────────────┐ ┌──────────────┐ │ │
│ │ │ Panic │ │ CSpace │ │ │
│ │ │ Correlator │ │ Integrity │ │ │
│ │ │ (bit 6) │ │ Audit │ │ │
│ │ └─────────────┘ │ (bit 5) │ │ │
│ │ └──────────────┘ │ │
│ │ ┌─────────────┐ ┌──────────────┐ │ │
│ │ │ Heartbeat │ │ Verdict │ │ │
│ │ │ Monitor │ │ Ring │ │ │
│ │ │ (bit 3) │ │ Reader │ │ │
│ │ └─────────────┘ │ (bit 7) │ │ │
│ │ └──────────────┘ │ │
│ │ ┌────────────────────────────────┐ │ │
│ │ │ Profile Manager │ │ │
│ │ │ feature_mask + escalation │ │ │
│ │ └────────────────────────────────┘ │ │
│ └──────────────────────────────────────────────────┘ │
└─────────────────────────────────────────────────────────────┘L0 Paranoid Watchdog (hal/anomaly.zig)
Temporal Panic Correlator (bit 6)
Detects correlated multi-fiber panics that indicate environmental events (radiation) rather than independent software bugs.
const PanicRecord = struct {
fiber_id: u64,
timestamp_ns: u64,
};
pub const PanicCorrelator = struct {
recent: [32]PanicRecord,
head: u8,
/// Number of valid entries in the ring. Capped at 32 (ring capacity).
/// This is the walk depth, not a lifetime counter — it never overflows.
count: u8, // invariant: count <= 32
/// Radiation suppression window end timestamp. While active,
/// circuit breaker quarantine is overridden to Respawn.
radiation_suppression_ns: u64,
};Called from npl_trap_handler after every panic. Appends to the ring (wrapping at 32), caps count at 32. Walks the ring backward counting distinct fiber IDs within the correlation window.
Correlation vs. independent failure:
| Observation | Diagnosis | Response |
|---|---|---|
| ≥ N distinct fibers panic within window | Correlated radiation event | Suppress individual quarantine, emit AnomalyDetected(80) with data0 = fiber_count, data1 = CORRELATION_RADIATION |
| Single fiber repeated panics | Independent bug | Per-fiber circuit breaker as usual |
When a correlated event is detected, L0 sets a radiation_suppression_ns timestamp. For the next correlation window duration, the circuit breaker's quarantine verdict is overridden to Respawn – the fibers are fine, the environment is hostile. Individual panic logging to BEB continues unchanged.
CSpace Integrity Audit (bit 5)
Periodic walk of all MAX_FIBERS CSpaces × CSPACE_SIZE slots.
| Check | Detects |
|---|---|
| Epoch monotonically non-decreasing between audits | Epoch counter corruption |
No Null-typed cap with non-zero object_id | Ghost capabilities |
No bounds where end < start | Boundary inversion (bitflip) |
No cap_type value outside the CapType enum range (0–6) | Type field corruption |
Triggered by heartbeat tick counter at the configurable audit_interval_ticks (default ~1024 ticks ≈ 10s).
On failure:
- Emit
IntegrityAudit(83)withdata0 = fiber_id, data1 = slot_index, data2 = check_type - Log
IntegrityFault(17)to BEB for the owning fiber - Quarantine the owning fiber
Previous epoch values are stored in audit_epochs: [MAX_FIBERS]u32 and updated after each clean audit pass.
Epoch wrapping: CSpace revoke_all() uses epoch +%= 1 (wrapping add). At u32, this wraps after ~4 billion revocations – not a realistic concern in practice. The audit checks current_epoch != audit_epochs[fiber_id] (not >=) to detect any change. If the epoch changed but the CSpace is consistent, the audit passes and records the new epoch. An unexpected epoch change (no corresponding FiberRespawn event in the STL) is flagged as corruption.
L1 Heartbeat Monitor (bit 3)
Two distinct operations, two distinct call sites:
- L1 writes the heartbeat. L1's scheduler calls
anomaly_heartbeat()every tick. This is an exported C ABI function that L1 invokes — it updatesheartbeat_seqandlast_heartbeat_nsin shared L0 memory. L1 is the writer; it proves it's alive by advancing the sequence. - L0 checks the heartbeat. In
entry_riscv.zig, the timer interrupt handler (intr_id == 5 path) callsanomaly_check_heartbeat()— a pure L0 function that reads the heartbeat state and compares againstrumpk_timer_now_ns()(the canonical L0 time source). L0 is the reader; it independently verifies L1's liveness.
L0 checks both a timestamp and a sequence number independently.
pub const HeartbeatMonitor = struct {
last_heartbeat_ns: u64,
heartbeat_seq: u64,
last_checked_seq: u64,
last_checked_ns: u64,
};Why sequence number + timestamp (not just timestamp):
last_heartbeat_ns alone cannot distinguish "L1 is alive but the clock source is corrupted" from "L1 is alive and ticking normally." If the timer register gets hit by a bitflip and starts returning garbage timestamps, you see either a massive gap (false kexec) or no gap at all (false calm).
L1 increments heartbeat_seq every tick. L0 checks both:
- Sequence must advance AND timestamp must be plausible
- If sequence advances but timestamp is stuck or jumping → clock corruption event → recalibrate clock source, do NOT restart L1
- If sequence is stuck AND timestamp is stuck → L1 hang → escalation ladder
Escalation Ladder:
| Gap | Multiplier | Time (radiation) | Time (hardened) | Action |
|---|---|---|---|---|
| Warn | 1× | 500ms | 1s | Emit AnomalyDetected(80) with data0 = gap_ns, auto-escalate profile |
| Soft restart | soft_mult× | 2.5s | 5s | L1 soft restart sequence (see below) |
| Kexec | kexec_mult× | 10s | 20s | Full hal_kexec from immutable kernel image (Phoenix Protocol). No negotiation. |
Soft restart sequence (5 steps, all in L0):
- Quarantine all fibers. Walk
breakers[0..MAX_FIBERS], setquarantined = true. No fiber can be scheduled. - CSpace audit. Run
cspace_audit_all(). If corruption found, log to BEB but continue — soft restart is the remediation. - Sterilize all CSpaces. Call
revoke_all()on every CSpace. Epoch bumps atomically invalidate all stale capabilities. - Emit event.
BehaviorViolation(82)withdata0 = gap_ns, data1 = heartbeat_seq_delta. - Re-enter L1. Call
NimMain()thenkmain(). The Nim runtime re-initializes, the scheduler restarts, fibers are respawned from their NPK images via the existing Phase 6.2 respawn engine. The verdict ring is zeroed. BEB and STL state are preserved.
If soft restart succeeds (L1 heartbeat resumes within 1× threshold after re-init), the system continues. If the heartbeat gap reaches kexec_mult×, the soft restart is considered failed and Phoenix Protocol fires.
Why these numbers:
- 500ms warn gives 50 missed ticks at 10ms scheduler rate. No transient GCR event lasts that long. Fast enough that ground station telemetry barely notices before the system self-corrects.
- 5× soft restart (2.5s) gives the system time to self-heal from transient deadlocks. A soft restart at 300ms would pre-empt a deadlock that would have resolved at 400ms.
- 20× kexec (10s) confirms this isn't transient, isn't a deadlock, and soft restart failed. On a Mars rover, kexec kills the navigation fiber tracking position relative to a cliff edge – 10s confirms you're confident.
All multipliers and base thresholds are configurable per profile via KDL.
Verdict Ring (bit 7)
Shared memory ring where L1 recommends actions to L0. Single-writer (L1), single-reader (L0). No locks.
pub const VerdictAction = enum(u8) {
Noop = 0,
Quarantine = 1,
Demote = 2,
Release = 3,
Escalate = 4, // L1 recommends profile escalation
FdirHold = 5, // Ground station: freeze current profile
FdirRelease = 6, // Ground station: resume auto-de-escalation
};
pub const VerdictEntry = struct {
fiber_id: u64,
action: VerdictAction,
reason: u64, // STL event_id that triggered recommendation
timestamp_ns: u64,
};
pub const VerdictRing = struct {
entries: [16]VerdictEntry,
l1_head: u8, // Written by L1 only (mod 16)
l0_tail: u8, // Read by L0 only (mod 16)
};Memory ordering (RISC-V weak ordering): L1 must write the entry data before advancing l1_head. On RISC-V this requires a fence w,w (write-write barrier) between the entry store and the head update. L0 must read l1_head before reading entry data — fence r,r (read-read barrier). In Zig: @atomicStore(.release) for L1's head write, @atomicLoad(.acquire) for L0's head read. Entry data itself does not need atomics since it is fully written before the head advances.
L0 processes the ring during heartbeat checks:
- If L0's own sensors agree with the recommendation → execute
- If L0's sensors disagree → veto, emit
BehaviorViolation(82)noting the disagreement FdirHoldandEscalatefrom ground station are honored unconditionally (ground station outranks L1)
Ground Station Override Channel
Three commands flow through UTCP → userland membrane → ION → L1 → verdict ring → L0:
| Command | VerdictAction | Behavior |
|---|---|---|
FDIR_ESCALATE <profile> | Escalate | Force profile escalation. L0 honors unconditionally. |
FDIR_HOLD | FdirHold | Freeze current profile, prevent auto-de-escalation. Timeout after hold_timeout_ns (default 6h). |
FDIR_RELEASE | FdirRelease | Resume automatic de-escalation. Respects floor constraint. |
Ground station outranks L1 but cannot violate the boot floor. FDIR_ESCALATE sovereign when booted as hardened is a no-op.
Utility Functions
Time source: All timestamps in anomaly.zig use rumpk_timer_now_ns() — the same canonical time source used by the trap handler and BEB. Declared as extern fn rumpk_timer_now_ns() u64 in anomaly.zig (resolved at link time from entry_riscv.zig).
BEB window query:
/// Count BEB entries whose timestamp_ns falls within [now - window_ns, now].
/// Walks the ring backward from head. Returns 0 if BEB is empty or TMR corrupt.
pub fn beb_entries_in_window(window_ns: u64) u32 {
var head: u16 = 0;
var count: u32 = 0;
var chain: u64 = 0;
if (!beb.tmr_read(&head, &count, &chain)) return 0;
const now = rumpk_timer_now_ns();
const cutoff = now -% window_ns;
const depth = @min(count, beb.RING_CAPACITY);
var idx: usize = (@as(usize, head) + beb.RING_CAPACITY - 1) % beb.RING_CAPACITY;
var found: u32 = 0;
var i: usize = 0;
while (i < depth) : (i += 1) {
const entry = &beb.beb_buf.ring[idx];
if (entry.timestamp_ns < cutoff) break; // entries are chronological; stop early
found += 1;
idx = (idx + beb.RING_CAPACITY - 1) % beb.RING_CAPACITY;
}
return found;
}Feature mask FFI for L1:
L1 (Nim) reads the feature mask via anomaly_get_feature_mask() (exported C ABI) to gate its own checks. The scheduler calls this once per tick and caches the result locally — not once per feature check.
Dynamic Escalation
Escalation (anomaly_escalate)
/// Escalate the FDIR profile. Exported for L1 FFI (verdict ring path).
/// Internal L0 callers (panic correlator, heartbeat monitor) call this
/// directly via Zig's normal call convention — the `export` keyword
/// only affects the C ABI symbol generation, not the internal call path.
pub export fn anomaly_escalate(new_profile: u8) void {
// Escalation is one atomic write
// New profile must be > current mask (strict superset, numerically greater)
// New profile must be >= boot floor
if (new_profile > feature_mask and new_profile >= boot_floor) {
const old = feature_mask;
feature_mask = new_profile;
last_escalation_ns = rumpk_timer_now_ns();
_ = emit_stl(80, 0, 0, 0, new_profile, old, 0);
}
}Automatic escalation triggers:
| Trigger | Source | Target Profile |
|---|---|---|
| Panic correlator fires (≥N fibers in window) | L0 | radiation |
| L1 heartbeat gap > 1× threshold | L0 | hardened |
| 3+ ECC corrections in 60s (from HAL memory scrubber) | L0 | radiation |
10+ BEB entries with same panic_class in 60s window | L0 | hardened |
| L1 cross-fiber correlation (≥2 fibers deviating) | L1 via verdict ring | hardened |
De-escalation
One step at a time (radiation → hardened → sovereign). Requires ALL of:
- Cooldown period:
deescalation_cooldown_ns(default 300s) since last escalation or anomaly event - CSpace audit pass: Clean walk with zero integrity faults
- Zero BEB entries in the cooldown window
- No active FDIR_HOLD
- Target >= boot floor (non-negotiable)
pub fn anomaly_try_deescalate() void {
if (fdir_hold_active) return;
const now = get_time_ns();
if (now -% last_escalation_ns < config.deescalation_cooldown_ns) return;
if (beb_entries_in_window(config.deescalation_cooldown_ns) > 0) return;
if (!last_cspace_audit_clean) return;
const target = deescalate_one_step(feature_mask);
if (target >= boot_floor) {
feature_mask = target;
_ = emit_stl(81, 0, 0, 0, target, 0, 0); // BehaviorBaseline = new stable state
}
}L1 Behavioral Analyst (core/baseline.nim)
Baseline Learning (Welford's Online Algorithm)
Per-fiber running statistics – O(1) per sample, no buffering, no batch computation:
type BehaviorBaseline* = object
burst_mean*: float64 # Running mean of burst_ns
burst_m2*: float64 # Running M2 for variance (Welford)
syscall_rate*: float64 # Syscalls per scheduling quantum
channel_ratio*: float64 # send / (send + recv) ratio
sample_count*: uint32 # Total samples collected
locked*: bool # Baseline solidified after N samplesUpdated every sched_analyze_burst() call – already in the scheduler hot path:
proc baseline_update*(b: var BehaviorBaseline, burst_ns: uint64) =
b.sample_count += 1
let n = b.sample_count.float64
let x = burst_ns.float64
let delta = x - b.burst_mean
b.burst_mean += delta / n
let delta2 = x - b.burst_mean
b.burst_m2 += delta * delta2
if b.sample_count >= config.baseline_lock_samples:
b.locked = trueAfter locked = true:
- σ = sqrt(M2 / (sample_count - 1))
- If |current_burst - mean| > Nσ (configurable, default 3σ) → emit
BehaviorViolation(82) - N is configurable via
deviation_sigmain KDL
Cross-Fiber Correlation
var anomaly_scores: array[MAX_FIBERS, float64]
var anomaly_window_start_ns: uint64
var anomalous_fiber_count: uint32- Each
BehaviorViolationincrements the fiber's anomaly score - If ≥2 fibers exceed threshold within
l1_correlation_window_ns(configurable, default 500ms) → systemic event - L1 writes recommendation to verdict ring:
Escalatewith reason = STL event_id - L0 independently validates via its own panic correlator
- Two independent observers reaching the same conclusion = high-confidence anomaly
BEB Preservation Guarantee
The BEB lives in a fixed physical memory region. hal_kexec loads a new kernel via the L0 ELF loader. The ELF loader must refuse to map any PT_LOAD segment that overlaps the BEB region.
In elf_loader.zig, add a BEB exclusion check:
// Before memcpy in the load loop:
if (ranges_overlap(dest_addr, dest_addr + p_memsz, BEB_BASE, BEB_BASE + BEB_SIZE)) {
return ElfError.SegmentOverlap; // Refuse to overwrite BEB
}This requires adding SegmentOverlap to the ElfError enum in elf_loader.zig.
The BEB base address is known at compile time. This is one branch per PT_LOAD segment – zero cost in practice since kernels have 2–4 segments.
Additionally, anomaly_init() must verify the BEB is intact after a kexec by checking the TMR header magic bytes. If the BEB survived, the chain is walked to establish the prev_hash for continued logging. If the BEB is corrupted, it is re-initialized (accepting the forensic data loss) and a BebOverflow(67) event is emitted to signal the discontinuity.
Ontology Extensions
All event kinds were pre-wired in Phase 6.2. No new EventKind values needed.
| EventKind | Value | Emitter | data0 | data1 | data2 |
|---|---|---|---|---|---|
AnomalyDetected | 80 | L0 | context-dependent | context-dependent | — |
BehaviorBaseline | 81 | L0/L1 | new_profile (de-escalation) | — | — |
BehaviorViolation | 82 | L0/L1 | fiber_id or gap_ns | deviation_score | — |
IntegrityAudit | 83 | L0 | fiber_id | slot_index | check_type |
AnomalyDetected(80) context by source:
| Source | data0 | data1 |
|---|---|---|
| Panic correlator | fiber_count | CORRELATION_RADIATION (1) |
| Heartbeat warn | gap_ns | heartbeat_seq delta |
| ECC escalation | ecc_count | window_ns |
| Profile escalation | new_profile | old_profile |
Files
New Files
| File | Purpose | Est. LOC | Layer |
|---|---|---|---|
hal/anomaly.zig | Paranoid Watchdog – correlator, audit, heartbeat, verdict ring, profile manager, escalation | ~400 | L0 |
core/baseline.nim | Behavioral Analyst – Welford baselines, cross-fiber correlation, verdict ring writer | ~200 | L1 |
Modified Files
| File | Change |
|---|---|
hal/trap.zig | Call anomaly_record_panic() after verdict (gated on bit 6) |
hal/entry_riscv.zig | Heartbeat check in timer interrupt path (gated on bit 3) |
hal/abi.zig | Comptime import anomaly.zig + FFI re-exports for anomaly_heartbeat, anomaly_escalate, anomaly_init, anomaly_get_feature_mask |
hal/elf_loader.zig | Add SegmentOverlap to ElfError enum; BEB exclusion range check in PT_LOAD loop |
hal/cspace.zig | Add cspace_audit_all() function for L0 integrity walk |
core/sched.nim | Call baseline_update() in sched_analyze_burst() (gated on bit 4); call anomaly_heartbeat() every tick (gated on bit 3) |
core/fiber.nim | Add baseline: BehaviorBaseline field to FiberObject |
Testing Strategy
Unit Tests (per-file, zig build test)
anomaly.zig:
- Panic correlator: 2 fibers in window → no trigger; 3 fibers → trigger; fibers outside window → no trigger
- CSpace audit: clean CSpace → pass; corrupted epoch → fail with correct fiber_id; ghost cap → fail
- Heartbeat: sequence advances with plausible timestamp → alive; sequence stuck → hang detected; sequence advances but timestamp stuck → clock corruption (different from hang)
- Verdict ring: L1 writes, L0 reads in order; ring wrap; L0 veto on disagreement
- Profile escalation: escalate above current → success; escalate below floor → no-op; de-escalation with dirty BEB → blocked; de-escalation with clean window → success
- Feature mask gating: disabled feature check returns false; enabled returns true
- Config: default thresholds per profile are correct; KDL override applies
baseline.nim:
- Welford: known sequence → correct mean and variance; locked after N samples; deviation detection at 3σ
- Cross-fiber: single fiber anomaly → no systemic flag; 2+ fibers → systemic; verdict ring write
Integration Tests
- Full panic → correlator → escalation: inject 3 page faults from 3 different fibers within 100ms → verify AnomalyDetected(80) → verify profile escalated → verify radiation suppression active
- Heartbeat gap → soft restart: stop heartbeat ticks → verify warn at 1× → verify soft restart at 5×
- CSpace corruption → quarantine: flip a bit in a capability → verify IntegrityAudit(83) → verify fiber quarantined
- Ground station override: inject FDIR_HOLD → verify no de-escalation during hold → inject FDIR_RELEASE → verify de-escalation resumes
- De-escalation: escalate to radiation → wait cooldown → clean audit → verify step-down to hardened
- BEB preservation: verify elf_loader rejects PT_LOAD overlapping BEB region
Related Specifications
- Phase 6.2: Restart Trap – trap.zig, beb.zig, respawn.zig, elf_loader.zig
- SPEC-020: Capability Algebra – CSpace integrity model
- SPEC-060: System Ontology – STL event framework (EventKind 80–83)
- RFC-0110: Membrane Agent – telemetry relay (consumer of anomaly events)
- RFC-0649: EPOE – System survival through restart
- ECSS-E-ST-70-41C: Onboard autonomy requirements for FDIR