Skip to main content

Small MongoDB, big ambitions -- a local-first document engine with WiredTiger and Atlas sync

Project description

smongo

Small MongoDB. Big ambitions.

MongoDB's document model and MQL are the most productive way to work with data -- but only if you can use them everywhere. Cloud, edge, laptop, airplane mode, CI pipeline, embedded device. smongo makes that real: a local-first MongoDB engine in Python, powered by WiredTiger (the same storage engine family that runs MongoDB itself), with bidirectional sync to Atlas when you're ready.

Write your app once. Run it against a local B-Tree. Ship it against Atlas. The query language never changes. The "S" stands for Small. The rest is all Mongo.

"Same everywhere" -- the architectural bet that the local engine, the query
language, the wire protocol, and the cloud database should all be the same
thing, with no translation layer in between.
from smongo import MongoClient

# Flip the URI -- same code, different backend
client = MongoClient("local://data")                              # embedded WiredTiger
# client = MongoClient("mongodb+srv://...")                        # Atlas / any mongod
# client = MongoClient("local://data", sync="mongodb+srv://...")   # local-first + auto sync

db = client["myapp"]
users = db["users"]

users.insert_one({"name": "Alice", "age": 34, "city": "NYC"})
users.create_index([("city", 1), ("age", -1)])

for doc in users.find({"city": "NYC", "age": {"$gt": 30}}):
    print(doc["name"])

results = users.aggregate([
    {"$group": {"_id": "$city", "avg_age": {"$avg": "$age"}}},
    {"$sort": {"avg_age": -1}},
])

Why smongo?

Problem How smongo fixes it
Local dev requires a running mongod or Docker container Embedded WiredTiger -- Rust extension with direct WiredTiger FFI. No mongod required
mongomock doesn't support real aggregation pipelines Full pipeline engine: 25+ stages incl. $facet, $merge, $out, $vectorSearch, $lookup with 17 group accumulators
Edge / offline-first apps need a different DB and query language Same MQL everywhere -- one codebase, portable across environments
Syncing local state to the cloud is a custom nightmare Built-in oplog-driven bidirectional sync with metrics, backoff, selective filters, and conflict resolution
Mock databases don't have indexes or query planners Real B-Tree indexes with a heuristic prefix-scoring query planner that accelerates reads and writes
Embedded databases lack ACID writes or thread safety WiredTiger transactions wrap every write (data + indexes + oplog), per-collection ReadWriteLock allows concurrent reads while serializing writes

Architecture

┌────────────────────────────────────────────────────────┐
│                    Your Application                     │
│              from smongo import MongoClient              │
└────────────────────┬───────────────────────────────────┘
                     │  URI routing
          ┌──────────┴──────────┐
          ▼                     ▼
   local://path          mongodb://host
          │                     │
   ┌──────┴──────┐       ┌─────┴─────┐
   │  Rust Engine│       │  PyMongo  │
   │ (_smongo_   │       │  Driver   │
   │   core)     │       └───────────┘
   │  ┌───────┐  │
   │  │ MQL   │  │  ◄── compile_query, apply_update (Rust)
   │  │Compiler│  │      $gt $lt $in $ne $or $and ...
   │  └───┬───┘  │
   │      │      │
   │  ┌───┴───┐  │
   │  │ Query │  │  ◄── RustQueryPlanner: prefix-scoring
   │  │Planner│  │      index scan / pk lookup / coll scan
   │  └───┬───┘  │
   │      │      │
   │  ┌───┴───┐  │
   │  │B-Tree │  │  ◄── RustIndexManager: WiredTiger tables
   │  │Indexes│  │      single, compound, unique, sparse
   │  └───┬───┘  │
   │      │      │
   │  ┌───┴───┐  │
   │  │WiredTi│  │  ◄── Direct C FFI via wiredtiger-sys
   │  │  ger  │  │      key=_id, value=BSON (transactional)
   │  └───┬───┘  │
   │      │      │
   │  ┌───┴───┐  │       ┌──────────────┐
   │  │ Oplog │  │──────►│  SyncManager  │──► Atlas
   │  └───────┘  │       │  push / pull  │
   └─────────────┘       │  conflict res │
                         └──────────────┘

Rust-Powered Engine (Required)

The compiled Rust extension (_smongo_core) is required and provides all performance-critical paths via PyO3. MongoClient("local://...") creates a Python LocalClient that delegates all storage operations, query compilation, expression evaluation, and update application to Rust:

  • Storage Engine -- RustLocalClient, RustLocalDB, RustLocalCollection with direct WiredTiger C FFI (wiredtiger-sys sub-crate, dlopen). Every insert, find, update, delete, and index operation flows through Rust.
  • B-Tree Indexes & Query Planner -- RustIndexManager and RustQueryPlanner manage all index types (single, compound, unique, sparse, text, hashed, wildcard) with Rust-native key encoding and plan scoring.
  • Streaming Cursors -- RustStreamingCursor lazily iterates WiredTiger cursors for collection scan, PK lookup, index-backed, and OR-union paths.
  • ACID Transactions -- RustTransactionSession with thread-local session override ensures all operations within a transaction route through the same WiredTiger session.
  • BSON Serialization -- encode/decode documents using the Rust bson crate, eliminating Python tree walks (~60% of write time eliminated)
  • MQL Query Compiler -- compile_query with all 18 query operators, compiled predicate evaluation
  • Expression Engine -- resolve_expr with all 72 aggregation expression operators
  • Update Engine -- apply_update with all 14 update operators, positional operators, and pipeline updates
  • Aggregation Pipeline -- Full pipeline dispatch in Rust via aggregate_pipeline. All 25+ stages including $group (17 accumulators), $lookup (equality + sub-pipeline), $graphLookup, $facet. I/O-dominated stages ($out, $merge, $unionWith) and $vectorSearch delegate to Python.
  • Wire Protocol -- Tokio-based async TCP server with Rust command handlers for all ~77 commands. BSON boundary normalization, cursor registry, session management, and profiler all in Rust. On the wire, find applies sort, skip, limit, and projection in Rust; aggregate dispatches straight into the Rust pipeline (aggregate_pipeline). Oplog and admin/metadata WiredTiger work uses typed Rust session/cursor borrow (no Python dispatch on those WT hot paths).
  • Schema Validation -- $jsonSchema document validation runs entirely in Rust (schema.rs). Supports required, properties, type/bsonType, numeric/string/array constraints, enum, pattern, additionalProperties, and nested objects with ReDoS-safe regex matching.

The Python modules that remain are high-level orchestration (aggregation Cursor for the Python API, SyncManager) that calls into the Rust storage layer. See BYE-BYE-GIL.md for the full story.

  • Free-Threaded Python -- smongo supports Python 3.13+ free-threaded builds (python3.13t). The extension declares gil_used = false and uses PyOnceLock for deadlock-free initialization. All unsafe impl Send/Sync are backed by Rust-native locks, not the GIL. Under the free-threaded interpreter, the wire protocol server can handle concurrent connections with true thread parallelism.

Features

Storage -- WiredTiger B-Trees with Streaming Reads

MongoDB acquired WiredTiger in 2014 and made it the default storage engine. smongo uses the same technology locally: documents are stored as native BSON bytes in WiredTiger B-Tree tables keyed by _id. Every write is wrapped in a WiredTiger transaction (data + indexes + oplog in a single atomic unit), a per-collection ReadWriteLock ensures thread safety with concurrent reader access, and the query planner accelerates writes (update/delete by _id or indexed field are O(log n), not O(n)). ACID atomicity, crash recovery, and efficient disk I/O -- for free.

Reads are lazy. Collection.find() returns a chainable Cursor backed by a RustStreamingCursor that pulls documents from WiredTiger one at a time. The streaming cursor consults the query planner and executes the optimal strategy (PK lookup, index scan, $in multi-point scan, $or-union, or collection scan) -- all lazily. Chained .limit(10) without .sort() deserializes only 10 documents from BSON regardless of how many match. find_one() and count_documents() use the same streaming path so they never build intermediate lists.

MQL Compiler

A Rust-accelerated compiler translates MongoDB query dictionaries into executable predicates. Supported query operators: $gt, $lt, $gte, $lte, $eq, $ne, $in, $nin, $exists, $regex, $not, $nor, $all, $elemMatch, $size, $type, $or, $and. Update operators: $set, $inc, $push, $unset, $addToSet, $pull, $pop, $min, $max, $rename, $currentDate, $mul. Dot-notation paths work everywhere ("address.city").

Aggregation Pipeline

In-memory pipeline execution with 25+ stages: $match, $group, $project, $sort, $limit, $skip, $unwind, $lookup, $graphLookup, $unionWith, $addFields/$set, $count, $replaceRoot/$replaceWith, $sample, $bucket, $bucketAuto, $sortByCount, $redact, $setWindowFields, $unset, $vectorSearch, $facet, $out, $merge. Memory-bounded with spill-to-disk for $sort and $group when allowDiskUse=True. Group accumulators: $sum, $avg, $min, $max, $push, $addToSet, $first, $last, $firstN, $lastN, $stdDevPop, $stdDevSamp, $mergeObjects, $top, $bottom, $topN, $bottomN.

$vectorSearch runs fully in memory with:

  • USearch (usearch) for fast RAM-native vector indexing/search
  • NumPy fallback when USearch is unavailable

$facet runs independent sub-pipelines against the same input. $out replaces a target collection's contents. $merge upserts into a target collection with whenMatched/whenNotMatched semantics.

Build analytics and similarity queries that run locally with no external vector DB.

B-Tree Indexes & Query Planner

Create single-field, compound, unique, and sparse indexes backed by dedicated WiredTiger tables. The query planner scores candidate indexes and picks the optimal execution path:

  • Index Scan -- range or equality scan on the best-matching index
  • PK Lookup -- O(log n) direct _id fetch
  • Collection Scan -- fallback full-table scan

Sortable key encoding (IEEE 754 bit-flipping for numbers, hex inversion for descending fields) ensures correct lexicographic ordering across mixed types.

Oplog (Operations Log)

Every mutation (insert, update, delete, index create/drop) is append-logged to a dedicated WiredTiger table with timestamps, version counters, and checksums. The oplog supports compaction (compact_oplog(keep=N)) to bound growth in long-running deployments, and auto-compacts after successful sync push cycles.

Bidirectional Sync

SyncManager syncs local state to any MongoDB-compatible remote:

  • Push: tail the oplog, batch bulk_write to remote, auto-compact after checkpoint
  • Pull: change streams (preferred) or timestamp-based polling, merge remote changes locally
  • Index sync: index definitions flow both directions
  • Conflict resolution: Last-Write-Wins, local-wins, remote-wins, field-level merge, or a custom callable
  • Checkpointing: survives crashes and restarts via a WiredTiger checkpoint table
  • Auto-sync: background thread with configurable interval
  • Hybrid mode: MongoClient("local://...", sync="mongodb+srv://...") auto-registers and starts sync
  • Exponential backoff: on consecutive failures, backoff doubles up to 300s
  • Sync metrics: status() returns pushed, pulled, conflicts, errors counters and a state field
  • Selective sync filters: per-collection MQL filters control which documents are pushed/pulled

Wire Protocol Server

smongo speaks the real MongoDB binary protocol (OP_MSG, OP_COMPRESSED, OP_QUERY). Point mongosh, PyMongo, Compass, or any MongoDB driver at localhost:27017 and they'll talk to the embedded engine as if it were a real mongod. The Docker Compose setup exposes the wire server on port 27018 alongside the web dashboard -- docker compose up and connect Compass immediately. Small database, real protocol.

Interactive Web Dashboard

A full-featured GUI at localhost:5000 with:

Tab What it does
Shell mongosh-compatible terminal -- db.users.find({}), db.users.aggregate([...]), arrow-key history, execution timing
Documents Browse, insert, delete docs in a rich table with formatted values
Find & Query Clickable query chips, plan badges (INDEX SCAN / COLL SCAN / PK LOOKUP), timing
Aggregation Visual pipeline builder with drag stages, pre-built example pipelines
Indexes List, create, drop B-Tree indexes; index template chips; query plan tester
Sync Live visualization of local <-> remote, push/pull controls, remote client simulator, conflict metrics
Oplog Color-coded mutation log with timestamps and version numbers

Quick Start

Docker Compose (recommended)

docker compose up --build
# open http://localhost:5000         -- web dashboard
# Compass: mongodb://localhost:27018 -- wire protocol (browse with Compass)

This starts a MongoDB container (stands in for Atlas), the smongo dashboard, and a wire protocol server. Compass connects to localhost:27018 out of the box. Sample data is auto-seeded on first run: 10 employees, 5 indexes, everything synced. See SMONGO-COMPASS.md for the full Compass guide.

Standalone (no Docker, no network)

pip install -e ".[all]"       # installs smongo + builds the Rust extension via maturin
python demo.py

Runs the full embedded engine locally -- indexes, queries, aggregation, oplog -- no MongoDB server. The Rust extension is built automatically by the maturin build backend.


Wire Protocol Server

smongo includes a wire protocol server so that real drivers can connect to the embedded engine over TCP.

# Start the server on the default port
python -m smongo.wire --port 27017

Then connect with any standard MongoDB client:

mongosh mongodb://localhost:27017/mydb
from pymongo import MongoClient
client = MongoClient("mongodb://localhost:27017")
db = client["mydb"]
db["things"].insert_one({"hello": "wire protocol"})

Or use the WireServer API directly in Python:

from smongo.wire import WireServer

with WireServer("./data", port=27017) as srv:
    input("Press Enter to stop...")

Security features (Rust wire server):

  • TLS via rustls -- available when using the Rust-native RustWireServer
  • SCRAM-SHA-256 authentication (RFC 7677) -- PBKDF2-hashed credentials persisted in WiredTiger (table:__users)
  • Auth gate enforces authentication on all commands (handshake commands exempted)

Note: TLS and SCRAM authentication are implemented in the Rust wire server (RustWireServer). The default Python WireServer provides plain TCP without auth. See WIRE-PROTOCOL.md for details on both server paths.


Project Structure

smongo/
  __init__.py        MongoClient, SyncManager, DuplicateKeyError,
                     InsertOne, UpdateOne, UpdateMany,
                     DeleteOne, DeleteMany, ReplaceOne, BulkWriteResult
  _smongo_core/      Compiled Rust extension (PyO3) -- the actual engine
  client.py          URI-based routing, bulk_write, find_one_and_* facade
  storage/           Storage layer (Python + Rust bridge)
    engine.py          LocalClient/LocalDB (Python interface; delegates to Rust)
    collection.py      TTLReaper (used by RustLocalCollection)
    locking.py         ReadWriteLock (Python fallback; runtime uses Rust)
    results.py         InsertResult, UpdateResult, DeleteResult
    streaming.py       StreamingCursor (Python fallback; runtime uses RustStreamingCursor)
    helpers.py         BSON encode/decode helpers
  query/             MQL compiler package (Rust-accelerated)
    compiler.py        compile_query, query operators
    update.py          apply_update, positional operators
    expressions.py     resolve_expr, 60+ expression operators
    paths.py           get_value, set_value, unset_value
  aggregation/       Pipeline engine package (25+ stages, Rust-accelerated)
    cursor.py          Cursor class (lazy Iterable input), aggregate dispatch
    stages.py          Core stages: $match, $group, $sort, etc.
    joins.py           $lookup, $graphLookup, $unionWith
    output.py          $facet, $out, $merge
    vector.py          $vectorSearch (NumPy / USearch)
  index.py           Index key encoding, helpers, DuplicateKeyError (runtime: RustIndexManager, RustQueryPlanner)
  oplog.py           Append-only operations log with compaction
  sync.py            Bidirectional sync with metrics, backoff, selective filters
  objectid.py        MongoDB-style ObjectId implementation
  schema.py          $jsonSchema validation layer (delegates to Rust)
  wire/              MongoDB binary protocol server (OP_MSG, OP_COMPRESSED)
    commands/          ~77 Rust command handlers (Python fallback for extensions)
    sessions.py        Session registry
    transactions.py    Transaction state, undo journal
    profiler.py        Profiler, OpTracker, TopStats

rust/                Rust crate (smongo-core) -- the engine
  src/
    storage_engine.rs    RustLocalClient, RustLocalDB
    local_collection.rs  RustLocalCollection (CRUD, txns, streaming)
    index_manager.rs     RustIndexManager, RustQueryPlanner
    streaming_cursor.rs  RustStreamingCursor (lazy WiredTiger iteration)
    transaction.rs       RustTransactionSession (thread-local session override)
    wt_bridge.rs         PyO3 bridge for WiredTiger FFI types
    wt_safe.rs           Safe RAII wrappers for WiredTiger C API
    wire_commands/       Rust command handlers (~77 commands, typed HandlerFn)
    wire_dispatch.rs     Single-downcast command dispatch (ConnectionContext)
    wire_server.rs       Tokio async TCP server (TLS via rustls)
    wire_context.rs      ConnectionContext, CachedImports (Arc-shared, OnceLock modules)
    cached_modules.rs    Process-wide OnceLock cache for stdlib Python modules
    schema.rs            $jsonSchema validation engine (ValidationError, validate_document)
    scram.rs             SCRAM-SHA-256 authentication (RFC 7677)
  wiredtiger-sys/      Raw FFI bindings for WiredTiger C API (dlopen)

web_app.py           Flask API + shell endpoint
templates/
  index.html         Single-page dashboard
static/              CSS, JS assets for dashboard

examples/
  basic/
    01_crud.py           Insert, find, update, delete, cursor chaining
    02_indexes.py        B-tree indexes, query planner, unique constraints
    03_aggregation.py    $group, $sort, $project, $unwind, $lookup, $facet
    04_streaming.py      Lazy reads: find_one, count, limit short-circuit
    05_schema_validation.py  $jsonSchema enforcement on insert and update
    06_bulk_write.py     Batch InsertOne, UpdateOne, ReplaceOne, DeleteOne
    07_change_streams.py Real-time watch() + raw oplog inspection
    08_advanced_queries.py $or, $regex, $elemMatch, dot-notation, $not, $all
  patterns/
    ecommerce.py         Shopping cart, orders, revenue analytics, dashboards
    iot_timeseries.py    1000+ sensor readings, anomaly detection, facility stats
    content_cms.py       Blog CMS: tagging, search, author leaderboard, facets

demo.py              Standalone CLI demo (no Docker needed)
Dockerfile           Python 3.11 + WiredTiger build deps
docker-compose.yml   App + MongoDB for the full sync experience

Dev Commands

make install-test   # install test/lint dependencies
make lint           # ruff checks
make format         # ruff formatter
make test           # unit suite (1,090 tests)
make integration    # docker-backed integration suite
make perf           # benchmark suite
make coverage       # coverage report (70% enforced)
make typecheck      # mypy strict

The API

from smongo import MongoClient, InsertOne, UpdateOne, DeleteOne

client = MongoClient("local://data")
db = client["mydb"]
coll = db["things"]

# CRUD
coll.insert_one({"x": 1})
coll.insert_many([{"x": 2}, {"x": 3}])
coll.find({"x": {"$gt": 1}})
coll.find_one({"x": 2})
coll.update_one({"x": 1}, {"$set": {"x": 10}})
coll.update_many({}, {"$inc": {"x": 1}})
coll.delete_one({"x": 2})
coll.delete_many({"x": {"$lt": 5}})
coll.count_documents({"x": {"$gte": 1}})

# Atomic find-and-modify
coll.find_one_and_update({"x": 1}, {"$set": {"x": 10}}, return_document="after")
coll.find_one_and_replace({"x": 1}, {"x": 99, "replaced": True})
coll.find_one_and_delete({"x": 99})

# Bulk writes
coll.bulk_write([
    InsertOne({"x": 100}),
    UpdateOne({"x": 100}, {"$set": {"x": 200}}),
    DeleteOne({"x": 3}),
])

# Indexes
coll.create_index([("x", 1)])
coll.create_index("name", unique=True)
coll.create_index([("city", 1), ("age", -1)])
coll.list_indexes()
coll.drop_index("x_1")
coll.explain({"x": {"$gt": 5}})

# Aggregation
coll.aggregate([
    {"$match": {"status": "active"}},
    {"$group": {"_id": "$dept", "total": {"$sum": "$salary"}}},
    {"$sort": {"total": -1}},
    {"$limit": 10},
])

# $facet -- run parallel sub-pipelines
coll.aggregate([
    {"$facet": {
        "by_dept": [{"$group": {"_id": "$dept", "count": {"$sum": 1}}}],
        "top_5":   [{"$sort": {"salary": -1}}, {"$limit": 5}],
    }},
])

# $merge -- upsert results into another collection
coll.aggregate([
    {"$group": {"_id": "$dept", "avg_salary": {"$avg": "$salary"}}},
    {"$merge": {"into": "dept_stats", "on": "_id", "whenMatched": "replace"}},
])

# Transparent hybrid sync
hybrid = MongoClient("local://data", sync="mongodb+srv://user:pass@cluster.mongodb.net")
hybrid.sync.status()   # includes pushed, pulled, conflicts, errors, state
hybrid.sync.sync_now()

License

See LICENSE.

Project details


Download files

Download the file for your platform. If you're not sure which to choose, learn more about installing packages.

Source Distribution

smongo-0.2.0.tar.gz (627.8 kB view details)

Uploaded Source

Built Distribution

If you're not sure about the file name format, learn more about wheel file names.

smongo-0.2.0-cp311-cp311-macosx_11_0_arm64.whl (3.3 MB view details)

Uploaded CPython 3.11macOS 11.0+ ARM64

File details

Details for the file smongo-0.2.0.tar.gz.

File metadata

  • Download URL: smongo-0.2.0.tar.gz
  • Upload date:
  • Size: 627.8 kB
  • Tags: Source
  • Uploaded using Trusted Publishing? No
  • Uploaded via: maturin/1.12.6

File hashes

Hashes for smongo-0.2.0.tar.gz
Algorithm Hash digest
SHA256 c23579466da5e112bc2658bfe6a734a3d6c01c680416c5151006bb8c75bd3e46
MD5 48af60d01d9826ca472af4fda45dcfbb
BLAKE2b-256 254cc47956b95cb6dff7d6836f5f8969d34ee1e6343dc3bd03e9261a649fa25e

See more details on using hashes here.

File details

Details for the file smongo-0.2.0-cp311-cp311-macosx_11_0_arm64.whl.

File metadata

File hashes

Hashes for smongo-0.2.0-cp311-cp311-macosx_11_0_arm64.whl
Algorithm Hash digest
SHA256 4b831164ac7e2d7c1cd2ff8db9c05d0de655e8693906a002c6bb781e7f432267
MD5 72a6fc7f9799383c77e677c21d9671d8
BLAKE2b-256 3742f1a29f820037b0b6985461b260107f8d8bebc97feee0324478e68d9a17b5

See more details on using hashes here.

Supported by

AWS Cloud computing and Security Sponsor Datadog Monitoring Depot Continuous Integration Fastly CDN Google Download Analytics Pingdom Monitoring Sentry Error logging StatusPage Status page