Skip to content

Storage & Network Decisions

Six decisions spanning how Nexus stores data and moves packets — and why neither looks like Linux.

ST1: Graph-Native Filesystem

Status: Accepted

Context

POSIX filesystems are hierarchical trees: one parent per directory entry, metadata coupled to allocation. Content-addressed systems (IPFS, Git) decouple storage from naming but lack directory hierarchy. NexFS needed both — a filesystem that's content-addressed at the block level but navigable as a directory graph.

Decision

NexFS uses content-addressed blocks with graph-native directories:

  • Every block is BLAKE3-hashed at write time (per-inode selectable: BLAKE3-256, BLAKE3-128, XXH3-64)
  • Directories store typed edges: primary (ownership), reference (hard link), pin (immutable ref), projection (computed view)
  • Identical blocks share storage transparently (deduplication)
  • Dual superblock with automatic failover on corruption
  • Merkle tree from superblock to blocks enables incremental verification

Alternatives Rejected

OptionWhy Not
ext4 / BTRFSBlock allocation opaque, no content addressing, no typed edges
ZFSFeature-rich but 100MB+ kernel footprint, CDDL licensing concerns
IPFS-style flat storeNo directory hierarchy, slow metadata lookup
SQLite-as-filesystemACID overhead per write, not suitable for flash-native operation

Consequences

  • Deduplication automatic and transparent (identical data stored once)
  • Corruption detectable on every read (hash mismatch)
  • COW cloning is O(1) — just add a reference edge
  • RevMap enables reverse traversal (who links to this inode?)
  • Block allocation requires hash computation (10-50 µs per write)
  • No POSIX semantics (Membrane translates for legacy apps)

ST2: No /dev, /proc, /sys

Status: Accepted

Context

Linux pseudo-filesystems (/dev, /proc, /sys) evolved organically over 30 years. They're fragile to parse (heuristic text scraping), prone to information leaks (rootkits hide in /proc), and inconsistent (each subsystem formats differently). They're also the #1 source of container escape bugs.

Decision

Replace pseudo-filesystems with explicit interfaces:

  • /Bus/: Device discovery via symlinks to driver NPI objects (e.g., /Bus/Net/eth0/Cell/Driver/net-intel/npi)
  • ProvChain ledger: Process introspection via formal append-only event log (not file scraping)
  • KDL manifests: System configuration via declarative files (not sysctl)

Alternatives Rejected

OptionWhy Not
Keep /devDevice node permission management is fragile, hot-plug handling awkward
Enhanced /procEncourages heuristic monitoring, harder to formalize and verify
Plan 9 namespacesElegant but requires discipline across the entire stack; hard in mixed systems

Consequences

  • Service discovery is declarative (stat() on /Bus/ reveals available hardware)
  • Device state observable via formal ledger (structured, not heuristic)
  • No pseudo-filesystem to spoof (eliminates entire class of rootkit techniques)
  • Legacy tools expecting /proc/cpuinfo need Membrane translation
  • Standard monitoring tools (htop, ps) need reimplementation

ST3: CBOR Wire Format

Status: Accepted

Context

Object serialization needs a binary format. Protocol Buffers require code generation and .proto files. MessagePack lacks schema semantics. JSON is text (2x larger, slower parsing). Nexus serializes objects extensively — CAS objects, ION Ring messages, ProvChain entries.

Decision

CBOR (RFC 8949) for all serialization:

  • Binary CIDs (BLAKE3 hashes) embedded as native CBOR bstr (byte strings)
  • Objects tagged with CBOR semantic tags (e.g., tag 42 for Fiber, tag 43 for Event)
  • Optional fields map to CBOR null (backward compatible without schema versioning)
  • Self-describing: readers auto-detect types via tags

Alternatives Rejected

OptionWhy Not
Protocol BuffersRequires .proto files, code generation build step, Google ecosystem dependency
MessagePackNo schema semantics, no standard way to embed cryptographic hashes
JSONText format, ~2x larger, slower parsing, no binary types
FlatBuffersZero-copy but complex schema management, limited language support

Consequences

  • No code generation needed (CBOR is intrinsic to data)
  • Self-describing (new fields added without breaking old readers)
  • Backward compatible (optional fields = null)
  • Native binary type for CIDs (no hex encoding overhead)
  • Larger than hand-optimized binary formats (~10-15% overhead)
  • CBOR tooling less mature than Protocol Buffers ecosystem

N1: TCP/IP in Userland

Status: Accepted

Context

Linux puts TCP/IP in the kernel: complex, hard to modify, context switch overhead for every socket operation. Dedicated TCP offload hardware is expensive and inflexible. LwIP (Lightweight IP) is a BSD-licensed embedded stack at 20KB+ core.

Decision

LwIP runs in userland as part of the Membrane (libnexus.a):

  • Linked into every application binary as a static library
  • ION Rings carry raw Ethernet frames: NIC RX → kernel (NetSwitch) → app's LwIP instance
  • Kernel never parses IP headers — only L2 frame forwarding
  • Each application controls its own TCP behavior (congestion, windows, timeouts)

Alternatives Rejected

OptionWhy Not
Kernel TCP (Linux model)Context switch overhead, complex kernel code, one-size-fits-all configuration
Custom TCP in Nim5000+ lines of careful state machine; LwIP is proven and maintained
eBPF TCPBleeding edge, ecosystem immature, licensing complexity

Consequences

  • Per-app networking isolation (one app's corruption can't crash another's TCP stack)
  • No kernel context switch for network I/O
  • Apps can tune TCP independently (different congestion algorithms per workload)
  • Memory overhead: 50-100 KB TCP/IP code per application
  • Requires LwIP integration with fiber scheduler (no blocking syscalls)

N2: UTCP Over QUIC

Status: Accepted

Context

QUIC is designed for Internet-scale web traffic: 5000+ lines of state machine, complex connection migration, certificate management. Nexus clusters operate on local networks with known peers, exchanging messages (not streams). QUIC is massive overkill.

Decision

UTCP (sovereign transport) for Nexus-to-Nexus communication:

  • Identity-centric: SipHash-128 CellID addressing, not IP:port
  • Message-native: Datagram framing, not byte streams
  • NACK-based: Assume good network, retransmit only on gap detection
  • L2 fork: EtherType 0x88B5 (UTCP) vs. 0x0800 (IPv4) at NetSwitch
  • Survives interface migration (WiFi→Ethernet) without session loss

TCP/IP (via LwIP) remains available for Internet-facing traffic. UTCP handles intra-cluster only.

Alternatives Rejected

OptionWhy Not
Full QUIC10x complexity for cluster communication; certificate management overhead
Raw TCP/UDPRequires IP configuration, routing, ARP; not message-native
gRPC/HTTPApplication-layer complexity, inefficient binary serialization
Custom over UDPStill requires IP stack; UTCP operates at L2 directly

Consequences

  • Native message boundaries (no stream reassembly)
  • Direct CellID addressing (no DNS, no IP routing for local peers)
  • 100x simpler state machine (100 LOC vs. 5000+ for QUIC)
  • Session survives network interface changes
  • Limited to Nexus ecosystem (can't talk to legacy Internet services)
  • NACK model assumes low-loss network (degrades on congested WAN)

N3: L2 Switching Only

Status: Accepted

Context

Kernel-level IP routing adds complexity: routing tables, path selection, ARP caching, ICMP handling. Each piece is a security surface and a maintenance burden. In Nexus, each cell has its own LwIP instance — it handles its own IP.

Decision

The kernel (NetSwitch) operates at Layer 2 only:

  • Frame forwarding based on EtherType
  • 0x0800 (IPv4) / 0x86DD (IPv6) → route to cell's Membrane
  • 0x88B5 (UTCP) → route to UTCP handler fiber
  • 0x4C57 (LWF) → route to Libertaria Wire Frame handler
  • No IP parsing, no routing tables, no ARP cache in kernel

Alternatives Rejected

OptionWhy Not
Kernel L3 routingCouples cells together, harder to isolate, adds kernel complexity
No switching (direct NIC per cell)Requires multiple NICs or SRIOV; not always available
Virtual switch (OVS-style)Learning tables, spanning tree, broadcast handling — overkill for single host

Consequences

  • Kernel is IP-unaware (smaller, easier to verify, smaller attack surface)
  • Each cell's networking is fully independent
  • No routing table consistency problems
  • L2 broadcast domains may be large (ARP overhead on many-cell systems)
  • Multi-host networking requires a management plane (not kernel's job)