Building software for places where the internet barely exists

Posted on Jun 15

By Mathéo Delbarre, 2nd year CS student at EPITECH Nancy, France

TL;DR

Problem: district health clinics in Bhutan need to sync patient data to a central hospital hub over 2G connections that drop constantly, with nodes that lose power mid-transfer.
Solution: a Rust sync engine built around an encrypted append-only log, delta sync via version vectors, hybrid logical clocks for clock-drift tolerance, and mTLS for identity.
Constraints: under 10 MB RAM on a Raspberry Pi 3, static binary, no stable connection required.
Result: 5,000 events replicated with zero loss across a simulated 2G link with a mid-transfer TCP cut.
Code: github.com/Etoile-Bleu/ZamSync

I want to tell you about a problem I discovered last year that I couldn't stop thinking about. Not a LeetCode problem, not a framework benchmark: a real situation where data gets lost, patients get hurt, and the existing software world has basically nothing useful to offer.

The problem, in one picture

Before anything else, here is the system we are talking about:

District clinic (Raspberry Pi 3, 1GB RAM)
  WAL: [event_1][event_2]...[event_N]   <- encrypted on disk
  |
  | 2G link: 23 kbps, 600ms latency, drops every few hours
  | (or offline for days during monsoon season)
  |
Hub hospital server
  WAL: [all events from all clinics]    <- encrypted, segregated by origin
  |
  REST API / dashboard / reporting

Each clinic writes events locally. When the link comes up, it syncs the delta to the hub. If the link drops mid-transfer, the next sync resumes exactly where it stopped. No re-transmission, no duplicates, no corrupted state.

That is ZamSync. Now let me explain why I built it.

The thing I found out about Bhutan

Bhutan is a landlocked kingdom in the Himalayas with about 750,000 people spread across terrain that is some of the most difficult on Earth. The country runs a fully free public healthcare system: every citizen, regardless of location, has the right to healthcare.

The problem is the word "location."

Bhutan is divided into 20 administrative districts called dzongkhags. Many district hospitals and Basic Health Units (BHUs) sit at altitude, accessible only by mountain roads that wash out in monsoon season. The government has been expanding its Rural Connectivity Programme: as of 2022, over 850 villages have 2G or 4G access. But "access" and "reliable connectivity" are not the same sentence.

In the highlands, the dominant backhaul option is satellite. Kacific-1, the high-throughput satellite serving Bhutan, delivers latency of around 600ms. In areas on 2G, the average rural broadband speed is 23 kilobits per second. That is not a typo. Twenty-three kbps. And it drops. Monsoons knock out towers. Terrain blocks signal. Power cuts happen.

Into this landscape, the Bhutanese government started rolling out the electronic Patient Information System, or ePIS.

What ePIS is, and what happened when it launched

ePIS is described as "a comprehensive, integrated Electronic Health Record system" managing patient demographics, diagnoses, medications, allergies, lab results, and imaging studies. It launched at the Jigme Dorji Wangchuck National Referral Hospital (JDWNRH) on April 18, 2023, as reported by The Bhutanese.

On launch day, when all staff logged in simultaneously, the network couldn't handle the load. People could not log in. Those who did were waiting to upload information on a single patient. The hospital leadership decided in real time: have every doctor see two or three patients through ePIS, and continue with paper prescriptions for the rest.

Thirty to forty patients had to repeat blood tests because the lab integration failed and the system couldn't capture their phone numbers to notify them.

This is a national referral hospital, in the capital, with the best connectivity in the country.

They fixed it. By early 2024, ePIS had expanded to 58 hospitals and 186 Primary Health Centres nationwide, with over 604,000 patient records and 6,400 registered health professionals. Real progress.

But the BBS article also says: "extreme weather conditions and challenging terrain affect reliable internet connectivity", and power and network issues continue to limit full utilization in remote areas. They are exploring Starlink. The previous system, BHMIS, used distributed MS Access databases where districts entered paper records and physically forwarded database files to the ministry for integration. No local analysis was possible.

ePIS is a generational improvement over that. But the underlying challenge has not disappeared: you still need to get data from a clinic node to a hub node, over a connection that may be slow, unreliable, or completely absent for stretches of time.

That is the problem I wanted to solve. I am a student, not the Ministry of Health. But the class of problem: reliable, efficient, low-footprint data sync between nodes where connectivity is a privilege, not a given.

Why I didn't just use an existing tool

Before writing a single line of Rust, I spent weeks looking at what already existed. The short version:

CouchDB / PouchDB replication is mature and battle-tested, but CouchDB carries ~150 MB of memory overhead and sends entire document revision trees. On a 23 kbps link, that tree overhead alone is unacceptable.

IPFS comes up often in "offline-first" conversations, and I want to address it directly because the mismatch with this use case runs deep:

IPFS is built around immutable content addressing. Updating a patient record means creating a new block, getting a new CID, and updating a mutable pointer. Significant overhead for a high-frequency append-only log.
The go-ipfs daemon consumes 200-400 MB at rest. A Raspberry Pi 3 with 1GB RAM cannot sustain that.
IPLD graph traversal requires many round-trips per sync. On 600ms satellite latency, this is noted in IPFS's own issue tracker as a bottleneck serious enough to consider "ditching the IPFS DAG sync" in favor of vector-clock-based protocols.
IPFS assumes a semi-open DHT network. A clinic-to-hospital topology is controlled: known nodes, fixed addresses, mTLS identity. DHT discovery is overhead you don't need and a security surface you don't want in a medical deployment.

SQLite with manual sync logic is closest, but it pushes the entire sync problem onto the application: no standard protocol, no connection-drop handling, no ordering guarantees across nodes with drifting clocks.

Here is the comparison across the axes that actually matter:

	ZamSync	CouchDB/PouchDB	IPFS	SQLite (manual)
Memory at rest	~4 MB	~150 MB	~200-400 MB	~2 MB
Offline-first native	Yes	Yes	Partial	No protocol
Delta sync	Version vectors	Partial (rev tree)	No (full DAG)	You build it
ARM static binary	Yes (~9 MB)	No	No	Yes
mTLS built-in	Yes	No	No	No
Connection drop resume	Exact position	Restarts	Restarts	You build it
Clock drift handling	HLC	NTP required	NTP required	You build it

How ZamSync works: the three core ideas

Before going deep, here is a quick map. ZamSync is built on three ideas that compound on each other. Each one solves a specific failure mode:

Failure mode                 Solution in ZamSync
--------------------------   -------------------
Data lost on power cut    -> Write-Ahead Log (WAL): encrypted, append-only, CRC32 per record
Clock drift across nodes  -> Hybrid Logical Clocks: stays near wall time but tolerates drift
Bandwidth wasted on resync -> Version Vectors: exact delta, zero re-transmission

Let me go through each one.

1. The Write-Ahead Log

An append-only log is one of the oldest ideas in databases. When you want durability without the cost of random-access writes, you write everything sequentially to a file. The WAL in ZamSync stores events as fixed-format binary records:

┌────────────────────────────────────────────────────────────────────────┐
│  Record wire format                                                    │
│                                                                        │
│  [4 bytes: magic]  [8 bytes: HLC timestamp]  [8 bytes: sequence]      │
│  [4 bytes: payload_len]  [PAYLOAD]  [4 bytes: CRC32]                  │
│                                                                        │
│  Nonce (96 bits, random per record) prepended when encryption enabled  │
└────────────────────────────────────────────────────────────────────────┘

Every record has a CRC32 integrity check. If the process dies mid-write, the partial record is detected on recovery and truncated. This is the same principle PostgreSQL's WAL uses, just much simpler because we don't need transactions or rollback.

Encryption is ChaCha20-Poly1305 with a fresh random 96-bit nonce per record. ChaCha20 was chosen over AES-GCM deliberately: it requires no hardware acceleration, which matters on ARMv7 Raspberry Pi 3s where AES-NI does not exist.

2. Hybrid Logical Clocks: solving the clock drift problem

Here is a thing that breaks almost every naive sync implementation: clocks drift.

A quick primer on the problem for those not deep in distributed systems: in a system where multiple nodes each generate events independently, you need a way to order those events globally. The obvious approach is to use the node's wall clock. The problem: clocks drift. A clinic offline for three days will have a clock that is slightly off from the hub's clock. Event A arrives with timestamp T+5, event B arrives with timestamp T+3, and now your log is out of order even though B happened after A.

Lamport clocks (Leslie Lamport, 1978) solve causality by using a logical counter that advances on every event and communication. But they lose all connection to physical time, making debugging and forensic analysis painful.

ZamSync uses Hybrid Logical Clocks (HLC), from the 2014 paper "Logical Physical Clocks and Consistent Snapshots in Globally Distributed Databases" by Kulkarni, Demirbas, Madeppa, and Avva. The idea: maintain a logical component like a Lamport clock, but keep it anchored close to physical time so values remain human-readable. The logical part only advances beyond physical time when the network requires it.

HLC = max(physical_now, last_seen_hlc) + ticks_if_collision

Normal case (clocks in sync):
  Wall clock:  |--T1-----T2--T3--------T4--|
  HLC:         |--T1-----T2--T3--------T4--|  <- stays near wall clock

Clock drift (clinic B is 2 seconds behind):
  Wall clock B:|--T1'----T2'------------|
  HLC B:       |--T1-----T2------------|  <- corrected by received HLC from A

A node offline for a week with a drifted clock will produce correctly ordered events when it reconnects. The HLC is monotonically increasing within a node and converges to global order across nodes.

3. Version Vectors: knowing exactly what you don't know

The sync protocol is built around Version Vectors. If you haven't encountered them before: a version vector is a map of node_id -> max_sequence_seen. Each node maintains one. When two nodes connect, they exchange vectors and compute the gap.

Hub vector:      { clinic_a: 847, clinic_b: 312, hub: 5001 }
Clinic A vector: { clinic_a: 923, clinic_b: 0,   hub: 4800 }

Gap analysis:
  Hub is missing:      clinic_a events 848..923
  Clinic A is missing: clinic_b events 1..312
                       hub events 4801..5001

The find_gaps function computes the exact missing ranges. Only those are transmitted. No full state transfer, no polling, no "did you get this?" handshakes. Bandwidth usage is proportional to the actual delta.

Here is what this means concretely on a constrained link:

5000-event backlog, 30 KB/s link:

Without delta sync:
  Full transfer: ~18 MB
  At 30 KB/s:   613 seconds = 10+ minutes
  On drop:      start over from zero

With ZamSync:
  Session 1 (cut at 8s): events 1-240 transferred, 24 KB
  Session 2 (resume):    gap vector shows 241-5000 missing
                         picks up at 241, zero re-transmission
  Total overhead:        0 bytes

mTLS and access control

In a health deployment, you cannot let arbitrary nodes connect to the hub. ZamSync uses mutual TLS: both sides present a certificate during the handshake.

The hub generates a CA certificate and signs node certificates for each clinic. If a clinic doesn't have a certificate signed by the hub's CA, the connection is rejected at the TLS layer before any data is read.

Hub CA (ECDSA P-256)
   |
   +-- hub node cert    (signed by CA)
   +-- clinic_a cert    (signed by CA, issued via: zamsync sign)
   +-- clinic_b cert    (signed by CA, issued via: zamsync sign)

clinic_c with a self-signed cert: rejected before any payload is read.

Beyond authentication, ZamSync implements --policy own: when Clinic A syncs, the hub only returns events that originate from Clinic A. Clinic B's data is at the hub but never disclosed to Clinic A. This is enforced at the protocol level, not middleware.

What running it actually looks like

# Hub: generate CA and hub cert, then serve
zamsync keygen /var/lib/zamsync/hub
zamsync serve --data-dir /var/lib/zamsync/hub --bind 0.0.0.0:7700 --policy own

# Clinic: generate keypair, get signed by hub
zamsync keygen /var/lib/zamsync/clinic_a
zamsync sign /var/lib/zamsync/clinic_a/tls/node.csr --ca /var/lib/zamsync/hub

# Push an event and sync
zamsync push --data-dir /var/lib/zamsync/clinic_a \
  --payload '{"patient_id":"BT-0042","event":"registration"}'

zamsync sync --data-dir /var/lib/zamsync/clinic_a --hub 192.168.1.10:7700

# Check status
zamsync status --data-dir /var/lib/zamsync/clinic_a

  Node ID:     clinic_a
  WAL records: 1247
  Last sync:   2 minutes ago  (hub: 192.168.1.10:7700)
  Pending:     0 events unsynced
  WAL size:    124 KB (encrypted)

The binary is a single static musl executable (~9 MB) with no shared library dependencies. Installation on a remote node is one curl command, no package manager, no dependency resolution.

The test that actually matters

I did not want to claim "this works on bad networks" based on localhost benchmarks.

The integration test suite uses Docker Compose and Toxiproxy, Shopify's network condition simulator. The setup:

Hub and clinic in separate containers, Toxiproxy between them.
Link configured: 600ms latency, 100ms jitter, 30 KB/s bandwidth cap.
5,000 events generated on the clinic.
Sync starts.
Mid-transfer: TCP connection cut for several seconds.
Reconnect, sync resumes.

Clinic                Toxiproxy (2G sim)           Hub
  |                        |                         |
  |------ connect -------->|------- connect -------->|
  |<-- version vectors ----|<-- version vectors ------|
  |--- events 1..240 ----->|---- events 1..240 ------>|
  |                        |                         |
  |      [TCP cut injected by Toxiproxy here]        |
  |                        |                         |
  |------ reconnect ------>|------- reconnect ------->|
  |<-- version vectors ----|<-- {clinic_a: 240} ------|
  |--- events 241..5000 -->|--- events 241..5000 ----->|
  |<---- sync complete ----|<----- 5000/5000. 0 dup ---|

Result: all 5,000 events replicated, zero loss, zero duplicates, no corrupted state.

git clone https://github.com/Etoile-Bleu/ZamSync
docker compose -f tests/docker-compose.test.yml up --build --abort-on-container-exit

Memory footprint

Phase 14 benchmark: 5-clinic simulation, all Toxiproxy constraints active.

Hub node:      ~7.8 MB RSS at peak   (5000 events, 5 concurrent clinic connections)
Clinic node:   ~4.2 MB RSS at peak   (1000 local events, 1 outbound sync)

For comparison:
  CouchDB:        ~150 MB at rest
  PostgreSQL:     ~120 MB at rest
  go-ipfs daemon: ~200-400 MB at rest

Under 10 MB is a hard design requirement, not an accident.

What I learned the hard way

Clock synchronization is weirder than you expect. Raspberry Pis that reconnect to a network after being offline sometimes have their clocks jump backward by several seconds when NTP kicks in. If the HLC does not handle the monotonicity invariant during that window, you produce events with timestamps lower than previously issued ones, and the version vector logic breaks silently. I found this the first time I ran the simulation on actual hardware, not on localhost.

Upstream library breaking changes are a real thing. The rcgen crate (TLS certificate generation) changed CertificateParams::signed_by from three arguments to two in version 0.14: instead of passing (node_key, ca_cert, ca_key) separately, you now construct an Issuer struct first with Issuer::from_params(&ca_params, ca_key), which moves the key. This means serializing the CA key PEM before the move, or losing access to it. Forty-line fix, not obvious, failed CI until I read the source.

Release automation and Cargo.lock. The release workflow bumped the version in Cargo.toml via sed, then tried to cargo publish. Failed every time with a "dirty working tree" error. The reason: sed updated Cargo.toml but never ran cargo, so Cargo.lock still had the old version. Fix: cargo generate-lockfile in the release job before committing. Obvious in hindsight.

Docker ARM emulation vs real hardware. QEMU's ARM emulation does not accurately represent CPU scheduling on real ARMv7 silicon. A race condition in the connection accept loop appeared only on real hardware under load and was masked by QEMU's single-threaded emulation. The Phase 14 test is designed to eventually run on real hardware, not just CI containers, for this reason.

Codebase structure

The project is a Rust workspace with four crates:

ZamSync/
├── crates/
│   ├── zamsync-core/       Pure logic: events, HLC, version vectors, port traits
│   │                       No I/O. Tested with fake backends.
│   ├── zamsync-storage/    WAL, SQLite metadata, encryption
│   ├── zamsync-network/    TCP transport, frame protocol, mTLS
│   └── zamsync-testing/    Shared test helpers, in-process test nodes
└── src/                    CLI: serve, sync, keygen, sign, status, push

The architecture is hexagonal: zamsync-core defines port traits, the other crates implement them. Core logic has zero I/O dependencies and can be fully tested without disk or network.

Where this is going

ZamSync is a sync engine, not an application. It does not know what a "patient" is. An application layer sits on top: the thing that renders a patient form, validates data, calls push_event(payload) and pull_events(). That layer is out of scope for me alone.

The vision is to present ZamSync to the Ministry of Health of Bhutan as a building block for exactly this: when a BHU loses connectivity for hours or days, local data survives and syncs cleanly when the link returns. That is a solved problem with ZamSync.

The roadmap:

Phase 15 (in progress): async Tokio runtime for better concurrency on single-core hardware.
Phase 16: mDNS peer discovery for local clinic networks.
Phase 17: conflict resolution primitives (last-write-wins default, application-level merge hooks).
Phase 18: bandwidth budgeting, sync_budget_kbps to leave capacity for actual clinical use.
Phase 21: structured error codes Z1xx-Z6xx with colored CLI output and JSON formatting.

Full roadmap: ROADMAP.md.

References

The core ideas in ZamSync are not mine. I implemented them:

Kulkarni, Demirbas, Madeppa, Avva. "Logical Physical Clocks and Consistent Snapshots in Globally Distributed Databases." OPODIS 2014. Springer. The HLC paper.
Mattern, F. "Virtual Time and Global States of Distributed Systems." 1989. Version vectors.
Dean & Ghemawat. "The Google File System." SOSP 2003. Append-only log design.
Rust crates: rustls, rcgen, rusqlite, zstd, tokio, chacha20poly1305. Production-grade, audited.
Shopify's Toxiproxy for network simulation.

Full citations: ACKNOWLEDGEMENTS.md.

Bhutan sources:

How you can help

I can write Rust. I cannot market a project. I want to be direct about what kind of help would actually move this forward.

If you want to give technical feedback:
The areas I am least confident about are the receive() fairness behavior under concurrent clinic connections and whether the Version Vector gap detection handles all edge cases. Open an issue, leave a comment, send a message. Code review is the most valuable thing anyone can offer right now.

If you want to contribute code:
Good First Issues are labeled. The current one is implementing the zamsync setup --hub interactive wizard: keygen, systemd unit installation, and setup checklist in a guided flow. Self-contained, no prior codebase knowledge required.

If you work in health tech, field data systems, or low-resource computing:
I would genuinely like to know if this kind of problem is something you have encountered in other contexts: humanitarian data collection, field research, rural logistics, anything that involves nodes that sync intermittently over constrained links. And if you have any idea how to get this in front of the people at health ministries who actually make infrastructure decisions, I would love to hear it.

A star on the repository also helps more than you might think.

GitHub: https://github.com/Etoile-Bleu/ZamSync

Thanks for reading.

Mathéo

Top comments (0)

For further actions, you may consider blocking this person and/or reporting abuse