NexFS
v1.3.0 — 375+ Tests Passing
NexFS is the Nexus sovereign filesystem — a flash-native, graph-structured storage layer implemented in Zig with zero dynamic allocation. It runs on everything from bare-metal microcontrollers with 64 KB RAM to multi-hundred-terabyte Homenodes. No libc. No OS required.
Objects exist in a directed graph — location is a relationship, not an identity.
Address Space (v4)
NexFS v4 uses u64 block addressing throughout. At 4 KB block size, a single volume addresses 64 ZB — enough for a Homenode with 352 TB of mixed NVMe + SAS storage, a 1,000-node Chapter at 352 PB, or planetwide mesh at exabyte-to-zettabyte scale over a 15+ year horizon.
| Field | Width | Ceiling @ 4 KB blocks |
|---|---|---|
| BlockAddr | u64 | 64 ZB per volume |
| block_count (Superblock) | u64 | 64 ZB |
| bucket_id (BamEntry) | u32 | 4B buckets (~8 EB at Hydra bucket size) |
| data_count (Allocator) | u64 | 64 ZB |
| Inode.block_count | u64 | 64 ZB per file |
| InodeId | u32 | 4B inodes per volume (cross-node uses CAS CIDs) |
| Inode.size | u64 | 16 EB per file |
| device_size | u64 | 16 EB |
BamEntry overhead: ~0.4% per TB. Negligible. Superblock at 256 bytes still fits in a single 512-byte minimum block with room for future mesh and sovereignty fields.
Architecture
NexFS is built from 18 composable modules. Each module is a self-contained compilation unit with no hidden state:
superblock ─── inode ─── dir ─── dir_ops ─── path ─── graph
│ │ │
format file ── cow alloc ─── bam
│ │
checkpoint journal ── cas ── cas_journal
│ │
scrub health features compress
│
xattr ── lock ── watchOn-device layout (block-linear):
┌──────────────┬──────────────┬──────┬─────────────────┬───────────────┐
│ Superblock 0 │ Superblock 1 │ BAM │ Inode Table │ Data Blocks │
│ (256B, dual) │ (256B, dual) │ │ (4 blocks) │ │
└──────────────┴──────────────┴──────┴─────────────────┴───────────────┘
Block 0 Block 1 Blk 2 Blocks 3-6 Block 7+Core Format
Fixed-size, alignment-safe structures for deterministic flash access:
| Structure | Size | Purpose |
|---|---|---|
| Superblock | 256 bytes (dual + scattered) | Volume identity, format v5, generation counter, feature flags, mesh fields |
| Inode | 128 bytes | File/directory metadata, inline extent, checksum |
| Extent | 16 bytes | Contiguous block range (logical → physical mapping) |
| DirEntry | 48 bytes + name + tag | Directory entry with edge type and optional tag |
| BAM Entry | 16 bytes | Block Allocation Map — state, erase count, owner (u32 bucket_id) |
Superblock Resilience: Dual + Scattered Replicas
NexFS writes superblocks at blocks 0 and 1 as primary and backup. v1.3.0 adds scattered superblock replicas — up to 16 additional copies spread deterministically across the data region. On mount, if blocks 0 and 1 are both corrupt, NexFS calculates all replica positions from volume geometry alone and recovers from the highest-generation valid replica.
| Profile | Replicas | Convergence |
|---|---|---|
| Core | 4 (2 scattered) | 1 checkpoint |
| Sovereign | 8 (6 scattered) | 3 checkpoints |
| Mesh | 10 (8 scattered) | 4 checkpoints |
Each checkpoint writes blocks 0+1 plus K=2 rotating scattered replicas. All replicas converge over ceil((N-2)/K) checkpoints. Replica positions are derived from volume geometry — no stored map, no chicken-and-egg.
Per-Inode Checksums
Every inode carries a checksum computed over all metadata fields. The hash algorithm is selectable per individual inode — not per-filesystem. A sensor logging temperature data can use XXH3 for speed, while a signing key stored on the same volume uses BLAKE3 for cryptographic integrity.
| Hash | Enum | Output | Use Case |
|---|---|---|---|
| BLAKE3-256 | 0x00 | 256-bit | Default. CAS addressing, Merkle DAG, provenance chains |
| BLAKE3-128 | 0x01 | 128-bit (padded) | Constrained devices needing crypto guarantees |
| XXH3-64 | 0x02 | 64-bit (padded) | Ultra-fast integrity for high-throughput embedded |
Graph-Native Directories
NexFS directories are not flat lists of names — they are typed edge sets in a directed graph. Every directory entry carries an explicit edge type that defines the relationship between parent and child:
| Edge Type | Value | Semantics |
|---|---|---|
| primary | 0x00 | Normal parent-child ownership. GC follows these. |
| reference | 0x01 | Non-owning "also lives here" — like a symlink that doesn't break |
| pin | 0x02 | Anchor edge. Prevents garbage collection even if all other edges are removed |
| projection | 0x03 | Auto-generated. Smart folder / query result |
Edge Tags
Each directory entry can carry an arbitrary tag (up to 63 bytes) — metadata attached to the relationship, not the file. Use cases:
- Capability tokens on mount edges
- Version labels on DAG edges
- Role annotations ("owner", "reviewer") on reference edges
RevMap: Reverse Edge Index
The RevMap is an in-memory reverse index built lazily by scanning the directory tree. It maps every inode to all edges pointing at it — enabling O(1) answers to "who references this file?" without a full tree walk.
- Orphan detection:
revmap.isOrphan(inode_id)— zero incoming edges - Edge queries:
revmap.queryEdges(inode_id)— all parents and edge types - GC safety:
removeEdge()refuses to create orphans unlessforce=true - Depth-limited scan: Recursion capped at 64 levels to prevent stack overflow from cycles
File Operations
Extent-based file I/O with seek support:
| Operation | Function | Notes |
|---|---|---|
| Open | FileOps.open() | Returns handle with position tracking |
| Read | FileOps.read() | Extent-resolved block reads. Returns 0 at EOF. |
| Write | FileOps.write() | Allocates blocks via BAM, extends inode |
| Seek | FileOps.seek() | Absolute, relative, or from-end positioning |
| Truncate | truncateInode() | Shrink or extend. Frees released blocks. |
| Close | FileOps.close() | Releases handle state |
Copy-on-Write Cloning
cloneInode() creates a deep copy of a file's extent tree. The clone gets its own inode with FLAG_COW set. Data blocks are physically copied (not shared), so the clone is fully independent. Maximum 64 extents per clone operation.
Subsystems
Checkpoint
Atomic metadata flush to persistent storage:
- Seal all open allocation buckets
- Flush the Block Allocation Map
- Increment the superblock generation counter
- Write both superblocks (primary + backup)
- Write K=2 rotating scattered replicas (if
REQ_SCATTERED_SB) - Sync to flash — errors propagate (not silently swallowed)
Write-Ahead Journal
Intent-logging journal for crash-safe multi-step operations:
begin() → recordIntent(addr, data) → commit() → recover()Recovery semantics: entries in committed state represent writes that were already executed before commit() was called. The journal is cleared on recovery.
Integrity Scrub
Background integrity scanner that walks every allocated inode in the table:
- Validates inode checksums against stored hashes
- Counts total inodes scanned and checksum errors found
- Returns
ScrubResultwith error counts — zero errors means clean volume - Available via C-FFI as
nexfs_scrub()
Extended Attributes
Typed attribute slots on inodes (not POSIX xattr — purpose-built for Nexus capabilities):
| Type | Value | Purpose |
|---|---|---|
capability_perms | 0x01 | Capability permission bitfield (SPEC-051) |
cas_cid | 0x02 | CAS content ID, 32 bytes |
encryption_flags | 0x03 | Encryption configuration |
provenance_hash | 0x04 | Provenance chain hash |
custom | 0xFF | Arbitrary key-value (63-byte key, 255-byte value) |
Block Allocation
Bitmap-based allocator with bucket lifecycle tracking:
| Bucket State | Value | Meaning |
|---|---|---|
free | 0x00 | Available for allocation |
writing | 0x01 | Currently being written to |
full | 0x02 | Sealed, contains live data |
evacuating | 0x03 | Being emptied by GC (Hydra phase) |
parity | 0x04 | Erasure coding data (Hydra phase) |
Additional Modules
| Module | Purpose |
|---|---|
| Compress | ZSTD (1–22) + RLE compression with per-node/per-chunk level selection and double checksum (FLAG_COMPRESSED) |
| Lock | Advisory inode locking with table-based tracking |
| Watch | Inode change notification (create, modify, delete events) |
| Health | Flash health statistics from BAM erase counts |
| Path | Full path resolution with loadSafe() bounds checking |
Volume Profiles
NexFS scales via the Baukasten (building-block) model — three profiles activate different module sets:
| Profile | Storage Class | Footprint | Features |
|---|---|---|---|
| Core | 0x00 | ~40 KB | Block I/O, inodes, BAM, checksums, scrub, per-bucket RLE block compression |
| Sovereign | 0x01 | ~400 KB | + CAS, CDC, DAG versioning, TimeWarp snapshots, ZSTD compression (1–22) |
| Mesh | 0x02 | ~480 KB | + Wire protocol, peer sync, gossip, ZSTD compression (1–22) |
Compression (v1.1.0)
NexFS supports four compression granularity modes, selectable at format time:
| Mode | Profile | Granularity | Use Case |
|---|---|---|---|
| per_bucket | Core | Per-block via BAM entry | IoT/satellite – RLE per block, radiation-tolerant |
| per_dag_node | Sovereign | Per-file via DAG metadata | File-level default algo+level |
| per_cas_chunk | Sovereign | Per-CAS-chunk | Chunk-level granularity |
| per_dag_and_chunk | Sovereign | DAG default + chunk override | ZSTD:2 on system, ZSTD:18 on archives – same volume |
Double-checksum integrity (ZSTD only): XXH3-64 of compressed data is computed during the streaming compress pass (piped, zero extra cost) and verified before decompression. Bit flips are caught before ZSTD touches the data.
Per-bucket block compression operates at the file I/O layer. Each block's BAM entry carries comp_algo and comp_level. On write, the full block is compressed and stored with a 2-byte length prefix. On read, the block is decompressed before extracting the requested portion. Partial-block writes use read-modify-write to preserve existing data.
Feature Flags & Runtime Tuning (v1.3.0)
Two 32-bit bitmasks in the superblock control mount-time compatibility:
required_features— unknown set bits prevent mounting (forward compatibility)optional_features— unknown set bits are safe to ignore
| Required Flag | Bit | Effect |
|---|---|---|
REQ_COMPRESSION | 0 | Block-level compression active |
REQ_ZSTD | 1 | ZSTD compression (implies REQ_COMPRESSION) |
REQ_DOUBLE_CHECKSUM | 2 | XXH3-64 secondary hash on compressed blocks |
REQ_CAS_DEDUP | 3 | Content-addressable deduplication |
REQ_SCATTERED_SB | 6 | Superblock scattered across volume |
| Optional Flag | Bit | Effect |
|---|---|---|
OPT_RECOMPRESSION | 0 | Background recompression enabled |
OPT_MESH_SYNC | 1 | Mesh synchronisation active |
OPT_TIMEWARP | 4 | TimeWarp snapshot layer |
Feature parameters at superblock offsets 0x64–0x68 provide scalar configuration for active features:
| Param | Offset | Gate | Default (Sovereign) |
|---|---|---|---|
sb_replica_count | 0x64 | REQ_SCATTERED_SB | 8 |
sb_checkpoint_batch | 0x65 | REQ_SCATTERED_SB | 2 |
recompress_target_algo | 0x66 | OPT_RECOMPRESSION | ZSTD (0x02) |
recompress_target_level | 0x67 | OPT_RECOMPRESSION | 3 |
recompress_free_floor | 0x68 | OPT_RECOMPRESSION | 10% |
All parameters are runtime-tunable via nexfs_tune() without unmounting.
COW Re-Compression (v1.3.0)
Background re-compression upgrades stored chunks to better algorithms without data risk. The filesystem provides the mechanism; policy (aging, scheduling, pacing) lives in the Nim Membrane daemon.
The invariant: never modify a committed block. Old data survives until new data is written and journaled.
- Read chunk → decompress → verify BLAKE3 CID
- Recompress with target algo/level
- COW: allocate new block(s), write, journal, update CAS entry
- Free old block(s)
Batch mode groups up to 32 chunks in a single journal transaction with full rollback support. A free-space floor (default 10%) prevents recompression from filling the volume — the guard is a crash-safety invariant, not optional policy.
Sovereign Extensions
When running with the Sovereign profile:
- Content-Addressable Store (CAS): Files addressed by BLAKE3 hash. Automatic deduplication. ZSTD compression with double-checksum integrity on put/get.
- Content-Defined Chunking (CDC): Large files split at content-determined boundaries. Small edits re-store only changed chunks.
- DAG Versioning: File history as a Merkle DAG. Efficient branching, merging, and cryptographic history verification.
- TimeWarp Snapshots: Instant O(1) filesystem snapshots via copy-on-write DAG nodes.
C FFI
NexFS exposes a complete C-compatible API for integration with the Rumpk kernel and other system components:
// Lifecycle
nexfs_format(cfg) // Format a flash volume
nexfs_mount(cfg) // Mount (dual-SB failover)
nexfs_unmount() // Unmount
nexfs_is_mounted() // Check mount status
nexfs_sync() // Flush pending writes
nexfs_checkpoint() // Atomic metadata checkpoint
// File operations
nexfs_create(path, mode) // Create file, returns inode ID
nexfs_read(path, buf, len) // Read file data
nexfs_write(path, buf, len) // Write file data
nexfs_delete(path) // Delete file
nexfs_truncate(path, new_size) // Truncate/extend file
nexfs_rename(old_path, new_path) // Move/rename
nexfs_clone(src_path, dst_path) // CoW deep copy
nexfs_shred(path) // Secure erase (overwrite + flash erase)
// Directory operations
nexfs_mkdir(path, mode) // Create directory
nexfs_rmdir(path) // Remove empty directory
// System
nexfs_scrub(result) // Integrity scan
nexfs_health(stats) // Flash health statistics
nexfs_lock(inode_id) // Advisory lock (non-blocking)
nexfs_unlock(inode_id) // Release lock
// Recompression
nexfs_recompress(cid, algo, level) // COW recompress single chunk
nexfs_recompress_batch_begin() // Start batch transaction
nexfs_recompress_enqueue(cid, a, l) // Enqueue chunk
nexfs_recompress_batch_commit() // Commit all + free old blocks
nexfs_recompress_batch_rollback() // Undo all
// Tuning
nexfs_tune(param, value) // Runtime parameter adjustmentAll functions return 0 on success or a negative error code. Read/write return byte counts.
Wire Protocol
For networked volumes (Mesh profile), NexFS uses a wire protocol over UTCP:
| Message | Purpose |
|---|---|
BLOCK_WANT | Request a block by hash |
BLOCK_PUT | Deliver a block |
DAG_SYNC | Synchronize DAG heads between peers |
Design Principles
| Principle | Implementation |
|---|---|
| Zero dynamic allocation | All buffers caller-provided. No malloc, no heap. |
| No-std compatible | Zig with no libc. Runs on bare metal. |
| Flash-native | Designed for NOR/NAND characteristics. No FTL assumption. |
| Graph-first | Directories are edge sets, not flat lists. |
| Integrity by default | Per-inode checksums. Dual superblock. Scrub. |
| Fail-safe writes | Checkpoint propagates errors. Journal for multi-step ops. |
Comparison
For an honest, detailed comparison of NexFS against ext4, F2FS, XFS, ZFS, Btrfs, bcachefs, and HAMMER2, see NexFS vs. The Field.
License
LSL-1.0 (Libertaria Sovereign License)