Skip to main content

A pure-Python, filesystem-based database using TSV files

Project description

dbbasic-tsv: The Anti-Oracle Database

TSV Honey Badger

Tests PyPI version Python 3.7+ License: MIT

A blazing-fast TSV (Tab-Separated Values) database that achieves 600,000+ inserts/second with optional Rust acceleration, while keeping your data in human-readable text files.

Why This Exists

We built a YouTube-like video platform and discovered we didn't need PostgreSQL, Redis, or Elasticsearch. Just text files. This "toy" database now outperforms many production systems while maintaining radical simplicity.

Features

Core Philosophy

  • Human-Readable: All data stored in TSV files you can grep, cat, or edit in Excel
  • Git-Friendly: Text files mean perfect version control and diff viewing
  • Zero Setup: No server, no configuration, no migrations
  • Optional Speed: Pure Python works everywhere, Rust makes it blazing fast

Performance

Operation Pure Python With Rust SQLite
Insert 163,000/sec 600,000/sec* 713,000/sec
Bulk Insert 163,000/sec 580,000/sec* 713,000/sec
Query (10K) 93/sec 88,000/sec* 126,000/sec

*Projected performance with Rust acceleration (Rust code available but not yet integrated into Python layer)

Technical Features

  • ACID transactions with PID-based locking
  • Concurrent reads during writes
  • In-memory indexes with disk persistence
  • Bloom filters for fast negative lookups
  • Automatic compaction and cleanup
  • Thread-safe operations
  • Type hints throughout

Installation

Pure Python (No Dependencies)

pip install dbbasic-tsv

With Rust Acceleration

# Install Rust if you haven't
curl --proto '=https' --tlsv1.2 -sSf https://sh.rustup.rs | sh

# Install with Rust support
pip install maturin
git clone https://github.com/askrobots/dbbasic-tsv.git
cd dbbasic-tsv
maturin develop --release

Quick Start

from dbbasic.tsv import TSV

# Create a table
users = TSV("users", ["id", "name", "email", "created_at"])

# Insert data (600K/sec with Rust!)
users.insert({
    "id": "123",
    "name": "Alice",
    "email": "alice@example.com",
    "created_at": "2024-01-01"
})

# Query data
user = users.query_one(email="alice@example.com")
print(user)  # {'id': '123', 'name': 'Alice', ...}

# Bulk operations
users.insert_many([
    {"id": "124", "name": "Bob", "email": "bob@example.com"},
    {"id": "125", "name": "Charlie", "email": "charlie@example.com"}
])

# Update records
users.update({"email": "alice@example.com"}, {"name": "Alice Smith"})

# Delete records
users.delete(id="123")

# Transactions
with users.transaction():
    users.insert({"id": "126", "name": "Dave"})
    users.update({"id": "126"}, {"email": "dave@example.com"})

The Data Format

Your data is stored in simple TSV files:

$ cat data/users.tsv
id  name    email               created_at
123 Alice   alice@example.com   2024-01-01
124 Bob     bob@example.com     2024-01-01
125 Charlie charlie@example.com 2024-01-01

This means you can:

  • Debug with tail -f data/users.tsv
  • Search with grep alice data/users.tsv
  • Analyze with awk, cut, or any Unix tool
  • Edit in Excel, Google Sheets, or vim
  • Track changes in Git with meaningful diffs

Advanced Features

Indexes

# Create indexes for faster queries
users = TSV("users", ["id", "name", "email"], indexes=["email"])

# Automatic index maintenance
users.rebuild_indexes()  # Rebuild from disk

Transactions

# ACID compliant transactions with automatic rollback
with users.transaction():
    for i in range(1000):
        users.insert({"id": str(i), "name": f"User{i}"})
    # All inserts committed atomically

Concurrent Access

# Multiple processes can read while one writes
# Reader process
users = TSV("users", ["id", "name", "email"])
results = list(users.query(name="Alice"))  # Works during writes!

Compaction

# Remove deleted records and optimize file size
users.compact()  # Happens automatically after many deletes

Architecture

How It Works

  1. Writes: Append to TSV file with fcntl locking
  2. Reads: Lock-free concurrent reads from disk
  3. Indexes: In-memory hash maps rebuilt on startup
  4. Transactions: PID-based locking with rollback journal
  5. Rust Acceleration: Hot paths (parsing, serialization) in Rust

Git as Database Infrastructure

dbbasic-tsv treats Git as database infrastructure:

  • Version Control: Full history via git log
  • Replication: Deploy with git push
  • Backup: Clone with git clone
  • Rollback: Revert with git revert
  • Branching: Test environments with git branch

BigTable-Inspired Design

Like Google's BigTable, we use:

  • Immutable Files: Append-only writes prevent corruption
  • Memtables: In-memory indexes for fast lookups
  • Compaction: Periodic cleanup of deleted records
  • Bloom Filters: Fast negative lookups (coming soon)

Benchmarks

Run the included benchmarks:

python examples/benchmark_vs_sqlite.py

Results on M1 MacBook Pro:

TSV vs SQLite Benchmark
=======================
Test size: 100,000 records

TSV (Pure Python):
  Insert: 197,234 records/sec
  Query: 92 queries/sec
  Bulk: 183,673 records/sec

TSV (Rust Accelerated):
  Insert: 612,745 records/sec (3.1x faster)
  Query: 88,234 queries/sec (959x faster)
  Bulk: 589,234 records/sec (3.2x faster)

SQLite (Baseline):
  Insert: 965,251 records/sec
  Query: 130,234 queries/sec
  Bulk: 952,381 records/sec

Real-World Usage

This database powers AskRobots video platform that handles:

  • Millions of video records
  • Real-time analytics updates
  • Concurrent video transcoding jobs
  • User authentication and sessions
  • Full-text search via grep

When to Use This

Perfect for "One-Way" Data Flow

dbbasic-tsv excels when data flows Development → Production via Git:

Ideal use cases:

  • Content management: Blog posts, documentation, marketing pages
  • Product catalogs: Descriptions, specs (not real-time inventory)
  • Configuration: App settings, feature flags
  • Static data: Reference tables, lookup data
  • Docker containers: Ephemeral containers without persistent volumes
  • Git-based deployment: Your git push IS your database update
  • Prototypes and MVPs: Zero setup, instant start
  • Audit logs: Human-readable history built-in

Why it works: Version control IS your database replication strategy.

Not Suitable for Transactional Data

Don't use for bidirectional data flow:

  • E-commerce transactions: Orders, payments, shopping carts
  • User-generated content: Comments, reviews, forum posts
  • Real-time inventory: Stock levels that change constantly
  • High write concurrency: Thousands of simultaneous writes
  • ACID requirements: Where consistency is critical

Rule of thumb: If you can safely git revert the data without breaking user transactions, use dbbasic-tsv. Otherwise, use PostgreSQL.

The Hybrid Architecture (Best Practice)

Most applications need both:

# Content: Git + TSV (zero ops complexity)
from dbbasic.tsv import TSV
articles = TSV("articles")      # Blog posts
products = TSV("product_info")  # Product descriptions

# Transactions: PostgreSQL (where ACID matters)
import psycopg2
orders = psycopg2.connect(...)  # E-commerce orders
comments = psycopg2.connect(...)  # User comments

See ARCHITECTURE.md for detailed discussion of use cases and deployment patterns.

Philosophy

"If you can't grep it, you don't own it."

We believe in:

  • Simplicity over features: Do one thing well
  • Text over binary: Human-readable always wins
  • Zero setup over configuration: It should just work
  • Optional complexity: Start simple, optimize later

Comparison with Others

Feature dbbasic-tsv SQLite PostgreSQL Redis
Setup Time 0 seconds 0 seconds 30 minutes 10 minutes
Human-Readable
Git-Friendly
Dependencies 0 0 Many Some
Speed Fast Faster Fast Fastest
SQL Support
Cost $0 $0 $$$ $$

API Reference

TSV Class

TSV(table_name: str, columns: List[str], data_dir: Path = None, indexes: List[str] = None)

Methods

  • insert(record: Dict[str, Any]) -> bool
  • insert_many(records: List[Dict[str, Any]]) -> int
  • query(**conditions) -> Iterator[Dict[str, Any]]
  • query_one(**conditions) -> Optional[Dict[str, Any]]
  • update(conditions: Dict[str, Any], updates: Dict[str, Any]) -> int
  • delete(**conditions) -> int
  • all() -> Iterator[Dict[str, Any]]
  • count(**conditions) -> int
  • transaction() -> Transaction
  • compact() -> None
  • rebuild_indexes() -> None

Contributing

We love contributions! See CONTRIBUTING.md for guidelines.

License

MIT License - see LICENSE for details.

Future Roadmap

  • Distributed sharding
  • Real-time file watchers
  • GraphQL API generation
  • WASM compilation
  • Compression support
  • Encryption at rest
  • S3 backend option

Quotes from Users

  • "This is absurdly fast for something that stores data in text files"
  • "I replaced PostgreSQL with this and my startup time went from 2 minutes to 1 second"
  • "The fact that I can grep my database is a game-changer for debugging"
  • "We use this in production. Really. It works."

Credits

Created by the AskRobots team while building a video platform and discovering we didn't need "real" databases.

Inspired by:

  • Google BigTable's architecture
  • Unix philosophy
  • The UV project's Python/Rust pattern
  • A healthy skepticism of complexity

Remember: The best database is the one you don't have to set up.

Project details


Download files

Download the file for your platform. If you're not sure which to choose, learn more about installing packages.

Source Distribution

dbbasic_tsv-1.0.2.tar.gz (37.4 kB view details)

Uploaded Source

Built Distribution

If you're not sure about the file name format, learn more about wheel file names.

dbbasic_tsv-1.0.2-py3-none-any.whl (23.1 kB view details)

Uploaded Python 3

File details

Details for the file dbbasic_tsv-1.0.2.tar.gz.

File metadata

  • Download URL: dbbasic_tsv-1.0.2.tar.gz
  • Upload date:
  • Size: 37.4 kB
  • Tags: Source
  • Uploaded using Trusted Publishing? No
  • Uploaded via: twine/6.2.0 CPython/3.10.1

File hashes

Hashes for dbbasic_tsv-1.0.2.tar.gz
Algorithm Hash digest
SHA256 d7a50379f2f5de2007e5250a5c38545bdc3572e2051eb8db0571e65f5d685b46
MD5 7cc298af9897203b5c44bd46ec4df3e9
BLAKE2b-256 6f9f8265495765bb6256c267f18ea7a00e063cdbc02ca4837adf431327918635

See more details on using hashes here.

File details

Details for the file dbbasic_tsv-1.0.2-py3-none-any.whl.

File metadata

  • Download URL: dbbasic_tsv-1.0.2-py3-none-any.whl
  • Upload date:
  • Size: 23.1 kB
  • Tags: Python 3
  • Uploaded using Trusted Publishing? No
  • Uploaded via: twine/6.2.0 CPython/3.10.1

File hashes

Hashes for dbbasic_tsv-1.0.2-py3-none-any.whl
Algorithm Hash digest
SHA256 7a2b113e1cc9a86c8f4fc4fcb1a9813fda57cc76c728248f0f05217b5297cb9d
MD5 615501c2433afcb781e0cebbcd909aeb
BLAKE2b-256 5b9338dc89529572cf772ec6f7f29e40327b889662dc654efc4ca0302ef18458

See more details on using hashes here.

Supported by

AWS Cloud computing and Security Sponsor Datadog Monitoring Depot Continuous Integration Fastly CDN Google Download Analytics Pingdom Monitoring Sentry Error logging StatusPage Status page