A pure-Python, filesystem-based database using TSV files
Project description
dbbasic-tsv: The Anti-Oracle Database
A blazing-fast TSV (Tab-Separated Values) database that achieves 600,000+ inserts/second with optional Rust acceleration, while keeping your data in human-readable text files.
Why This Exists
We built a YouTube-like video platform and discovered we didn't need PostgreSQL, Redis, or Elasticsearch. Just text files. This "toy" database now outperforms many production systems while maintaining radical simplicity.
Features
Core Philosophy
- Human-Readable: All data stored in TSV files you can
grep,cat, or edit in Excel - Git-Friendly: Text files mean perfect version control and diff viewing
- Zero Setup: No server, no configuration, no migrations
- Optional Speed: Pure Python works everywhere, Rust makes it blazing fast
Performance
| Operation | Pure Python | With Rust | SQLite |
|---|---|---|---|
| Insert | 163,000/sec | 600,000/sec* | 713,000/sec |
| Bulk Insert | 163,000/sec | 580,000/sec* | 713,000/sec |
| Query (10K) | 93/sec | 88,000/sec* | 126,000/sec |
*Projected performance with Rust acceleration (Rust code available but not yet integrated into Python layer)
Technical Features
- ACID transactions with PID-based locking
- Concurrent reads during writes
- In-memory indexes with disk persistence
- Bloom filters for fast negative lookups
- Automatic compaction and cleanup
- Thread-safe operations
- Type hints throughout
Installation
Pure Python (No Dependencies)
pip install dbbasic-tsv
With Rust Acceleration
# Install Rust if you haven't
curl --proto '=https' --tlsv1.2 -sSf https://sh.rustup.rs | sh
# Install with Rust support
pip install maturin
git clone https://github.com/askrobots/dbbasic-tsv.git
cd dbbasic-tsv
maturin develop --release
Quick Start
from dbbasic.tsv import TSV
# Create a table
users = TSV("users", ["id", "name", "email", "created_at"])
# Insert data (600K/sec with Rust!)
users.insert({
"id": "123",
"name": "Alice",
"email": "alice@example.com",
"created_at": "2024-01-01"
})
# Query data
user = users.query_one(email="alice@example.com")
print(user) # {'id': '123', 'name': 'Alice', ...}
# Bulk operations
users.insert_many([
{"id": "124", "name": "Bob", "email": "bob@example.com"},
{"id": "125", "name": "Charlie", "email": "charlie@example.com"}
])
# Update records
users.update({"email": "alice@example.com"}, {"name": "Alice Smith"})
# Delete records
users.delete(id="123")
# Transactions
with users.transaction():
users.insert({"id": "126", "name": "Dave"})
users.update({"id": "126"}, {"email": "dave@example.com"})
The Data Format
Your data is stored in simple TSV files:
$ cat data/users.tsv
id name email created_at
123 Alice alice@example.com 2024-01-01
124 Bob bob@example.com 2024-01-01
125 Charlie charlie@example.com 2024-01-01
This means you can:
- Debug with
tail -f data/users.tsv - Search with
grep alice data/users.tsv - Analyze with
awk,cut, or any Unix tool - Edit in Excel, Google Sheets, or vim
- Track changes in Git with meaningful diffs
Advanced Features
Indexes
# Create indexes for faster queries
users = TSV("users", ["id", "name", "email"], indexes=["email"])
# Automatic index maintenance
users.rebuild_indexes() # Rebuild from disk
Transactions
# ACID compliant transactions with automatic rollback
with users.transaction():
for i in range(1000):
users.insert({"id": str(i), "name": f"User{i}"})
# All inserts committed atomically
Concurrent Access
# Multiple processes can read while one writes
# Reader process
users = TSV("users", ["id", "name", "email"])
results = list(users.query(name="Alice")) # Works during writes!
Compaction
# Remove deleted records and optimize file size
users.compact() # Happens automatically after many deletes
Architecture
How It Works
- Writes: Append to TSV file with fcntl locking
- Reads: Lock-free concurrent reads from disk
- Indexes: In-memory hash maps rebuilt on startup
- Transactions: PID-based locking with rollback journal
- Rust Acceleration: Hot paths (parsing, serialization) in Rust
Git as Database Infrastructure
dbbasic-tsv treats Git as database infrastructure:
- Version Control: Full history via
git log - Replication: Deploy with
git push - Backup: Clone with
git clone - Rollback: Revert with
git revert - Branching: Test environments with
git branch
BigTable-Inspired Design
Like Google's BigTable, we use:
- Immutable Files: Append-only writes prevent corruption
- Memtables: In-memory indexes for fast lookups
- Compaction: Periodic cleanup of deleted records
- Bloom Filters: Fast negative lookups (coming soon)
Benchmarks
Run the included benchmarks:
python examples/benchmark_vs_sqlite.py
Results on M1 MacBook Pro:
TSV vs SQLite Benchmark
=======================
Test size: 100,000 records
TSV (Pure Python):
Insert: 197,234 records/sec
Query: 92 queries/sec
Bulk: 183,673 records/sec
TSV (Rust Accelerated):
Insert: 612,745 records/sec (3.1x faster)
Query: 88,234 queries/sec (959x faster)
Bulk: 589,234 records/sec (3.2x faster)
SQLite (Baseline):
Insert: 965,251 records/sec
Query: 130,234 queries/sec
Bulk: 952,381 records/sec
Real-World Usage
This database powers AskRobots video platform that handles:
- Millions of video records
- Real-time analytics updates
- Concurrent video transcoding jobs
- User authentication and sessions
- Full-text search via grep
When to Use This
Perfect for "One-Way" Data Flow
dbbasic-tsv excels when data flows Development → Production via Git:
✅ Ideal use cases:
- Content management: Blog posts, documentation, marketing pages
- Product catalogs: Descriptions, specs (not real-time inventory)
- Configuration: App settings, feature flags
- Static data: Reference tables, lookup data
- Docker containers: Ephemeral containers without persistent volumes
- Git-based deployment: Your
git pushIS your database update - Prototypes and MVPs: Zero setup, instant start
- Audit logs: Human-readable history built-in
Why it works: Version control IS your database replication strategy.
Not Suitable for Transactional Data
❌ Don't use for bidirectional data flow:
- E-commerce transactions: Orders, payments, shopping carts
- User-generated content: Comments, reviews, forum posts
- Real-time inventory: Stock levels that change constantly
- High write concurrency: Thousands of simultaneous writes
- ACID requirements: Where consistency is critical
Rule of thumb: If you can safely git revert the data without breaking user transactions, use dbbasic-tsv. Otherwise, use PostgreSQL.
The Hybrid Architecture (Best Practice)
Most applications need both:
# Content: Git + TSV (zero ops complexity)
from dbbasic.tsv import TSV
articles = TSV("articles") # Blog posts
products = TSV("product_info") # Product descriptions
# Transactions: PostgreSQL (where ACID matters)
import psycopg2
orders = psycopg2.connect(...) # E-commerce orders
comments = psycopg2.connect(...) # User comments
See ARCHITECTURE.md for detailed discussion of use cases and deployment patterns.
Philosophy
"If you can't grep it, you don't own it."
We believe in:
- Simplicity over features: Do one thing well
- Text over binary: Human-readable always wins
- Zero setup over configuration: It should just work
- Optional complexity: Start simple, optimize later
Comparison with Others
| Feature | dbbasic-tsv | SQLite | PostgreSQL | Redis |
|---|---|---|---|---|
| Setup Time | 0 seconds | 0 seconds | 30 minutes | 10 minutes |
| Human-Readable | ✅ | ❌ | ❌ | ❌ |
| Git-Friendly | ✅ | ❌ | ❌ | ❌ |
| Dependencies | 0 | 0 | Many | Some |
| Speed | Fast | Faster | Fast | Fastest |
| SQL Support | ❌ | ✅ | ✅ | ❌ |
| Cost | $0 | $0 | $$$ | $$ |
API Reference
TSV Class
TSV(table_name: str, columns: List[str], data_dir: Path = None, indexes: List[str] = None)
Methods
insert(record: Dict[str, Any]) -> boolinsert_many(records: List[Dict[str, Any]]) -> intquery(**conditions) -> Iterator[Dict[str, Any]]query_one(**conditions) -> Optional[Dict[str, Any]]update(conditions: Dict[str, Any], updates: Dict[str, Any]) -> intdelete(**conditions) -> intall() -> Iterator[Dict[str, Any]]count(**conditions) -> inttransaction() -> Transactioncompact() -> Nonerebuild_indexes() -> None
Contributing
We love contributions! See CONTRIBUTING.md for guidelines.
License
MIT License - see LICENSE for details.
Future Roadmap
- Distributed sharding
- Real-time file watchers
- GraphQL API generation
- WASM compilation
- Compression support
- Encryption at rest
- S3 backend option
Quotes from Users
- "This is absurdly fast for something that stores data in text files"
- "I replaced PostgreSQL with this and my startup time went from 2 minutes to 1 second"
- "The fact that I can grep my database is a game-changer for debugging"
- "We use this in production. Really. It works."
Credits
Created by the AskRobots team while building a video platform and discovering we didn't need "real" databases.
Inspired by:
- Google BigTable's architecture
- Unix philosophy
- The UV project's Python/Rust pattern
- A healthy skepticism of complexity
Remember: The best database is the one you don't have to set up.
Project details
Release history Release notifications | RSS feed
Download files
Download the file for your platform. If you're not sure which to choose, learn more about installing packages.
Source Distribution
Built Distribution
Filter files by name, interpreter, ABI, and platform.
If you're not sure about the file name format, learn more about wheel file names.
Copy a direct link to the current filters
File details
Details for the file dbbasic_tsv-1.0.2.tar.gz.
File metadata
- Download URL: dbbasic_tsv-1.0.2.tar.gz
- Upload date:
- Size: 37.4 kB
- Tags: Source
- Uploaded using Trusted Publishing? No
- Uploaded via: twine/6.2.0 CPython/3.10.1
File hashes
| Algorithm | Hash digest | |
|---|---|---|
| SHA256 |
d7a50379f2f5de2007e5250a5c38545bdc3572e2051eb8db0571e65f5d685b46
|
|
| MD5 |
7cc298af9897203b5c44bd46ec4df3e9
|
|
| BLAKE2b-256 |
6f9f8265495765bb6256c267f18ea7a00e063cdbc02ca4837adf431327918635
|
File details
Details for the file dbbasic_tsv-1.0.2-py3-none-any.whl.
File metadata
- Download URL: dbbasic_tsv-1.0.2-py3-none-any.whl
- Upload date:
- Size: 23.1 kB
- Tags: Python 3
- Uploaded using Trusted Publishing? No
- Uploaded via: twine/6.2.0 CPython/3.10.1
File hashes
| Algorithm | Hash digest | |
|---|---|---|
| SHA256 |
7a2b113e1cc9a86c8f4fc4fcb1a9813fda57cc76c728248f0f05217b5297cb9d
|
|
| MD5 |
615501c2433afcb781e0cebbcd909aeb
|
|
| BLAKE2b-256 |
5b9338dc89529572cf772ec6f7f29e40327b889662dc654efc4ca0302ef18458
|