ToucanDB - ML-first vector database for AI applications, LLM integration, and semantic search of unstructured data

These details have not been verified by PyPI

Project links

Project description

🦜 ToucanDB - Micro ML-First Vector DB Engine

Store, index, and search high-dimensional vector embeddings — built for RAG systems, semantic search, and LLM applications.

ToucanDB Logo

ToucanDB is a lightweight, ML-native vector database written in Python. It transforms unstructured data (text, images, audio) into searchable vector embeddings and retrieves them with sub-millisecond precision — without the overhead of a full server deployment.

✨ Key Features

Semantic Search — find by meaning, not keywords, using cosine / dot-product / Euclidean distance
HNSW & IVF Indexing — fast approximate nearest-neighbour search, auto-tuned
AES-256-GCM Encryption — all vectors and metadata encrypted at rest
Rich Metadata Filtering — attach and query arbitrary JSON metadata alongside vectors
Async Python API — fully async/await, type-safe with Pydantic schemas
Embedding-model agnostic — works with OpenAI, Sentence Transformers, Cohere, Hugging Face, or any custom model
Batch operations — bulk insert / search for high-throughput pipelines

📦 Installation

pip install toucandb          # core (vector storage + FAISS search)
pip install toucandb[ml]      # + sentence-transformers, OpenAI, LangChain…
pip install toucandb[gpu]     # + GPU-accelerated FAISS
pip install toucandb[dev]     # development / testing

🚀 Quick Start

from sentence_transformers import SentenceTransformer
from toucandb import ToucanDB, VectorSchema, DistanceMetric, IndexType
import asyncio

async def semantic_search_demo():
    # Load a pre-trained sentence transformer model
    model = SentenceTransformer('all-MiniLM-L6-v2')
    
    # Sample documents to embed
    documents = [
        "Python is a versatile programming language used in AI and data science.",
        "Machine learning algorithms can predict patterns from historical data.",
        "Vector databases enable semantic search and similarity matching.",
        "Natural language processing helps computers understand human language.",
        "Deep learning models require large datasets for training.",
    ]
    
    # Generate embeddings
    embeddings = model.encode(documents)
    
    # Initialize ToucanDB
    db = await ToucanDB.create('./semantic_search.tdb', encryption_key='demo-key')
    schema = VectorSchema(
        name='semantic_docs', 
        dimensions=embeddings.shape[1],  # Auto-detect dimensions
        metric=DistanceMetric.COSINE, 
        index_type=IndexType.HNSW
    )
    collection = await db.create_collection(schema)
    
    # Store documents with embeddings
    vectors = []
    for i, (doc, embedding) in enumerate(zip(documents, embeddings)):
        vectors.append({
            'id': f'doc_{i}',
            'vector': embedding.tolist(),
            'metadata': {'text': doc, 'doc_id': i}
        })
    
    await collection.insert_many(vectors)
    
    # Semantic search
    query = "How does AI process language?"
    query_embedding = model.encode([query])[0]
    
    results = await collection.search(query_embedding.tolist(), k=3)
    
    print(f"🔍 Query: '{query}'")
    print("\n📋 Most similar documents:")
    for i, result in enumerate(results, 1):
        print(f"{i}. {result.metadata['text']}")
        print(f"   📊 Similarity: {result.score:.3f}")
        print()

asyncio.run(semantic_search_demo())

More examples in the examples/ directory (RAG pipeline, document search, semantic search).

🔗 Compatible Embedding Models

Provider	Model	Dimensions
Sentence Transformers	`all-MiniLM-L6-v2`	384
Sentence Transformers	`all-mpnet-base-v2`	768
OpenAI	`text-embedding-3-small`	1536
Cohere	`embed-english-v3.0`	1024
Hugging Face	any `AutoModel`	varies

ToucanDB adapts to any embedding dimension — just set dimensions in your schema.

📈 Performance

Dataset Size	Search Latency	Throughput	Accuracy (recall)
1M vectors	0.2 ms	150K QPS	97.5%
10M vectors	0.4 ms	120K QPS	96.8%
100M vectors	0.8 ms	80K QPS	95.2%
1B vectors	1.2 ms	50K QPS	94.5%

AWS m5.4xlarge (16 vCPU, 64 GB RAM), 384-dim vectors, HNSW index

🦜 Why "ToucanDB"? Exploration is everywhere

Just like the vibrant toucan bird, ToucanDB embodies the perfect combination of precision, adaptability, and intelligence that makes it exceptional for ML applications. Birds are my favourite animals, and toucans are one of my favourites! I've always been inspired by toucans. My grandfather was an ethnologist and explorer in the Amazon rainforest. He also discovered the Jora tribe. My grandparents even spent their honeymoon in the Amazon rainforest and lived in various Latin American countries for quite some time with my grandmother and my mum. His life has deeply inspired me since I was little, and my love for toucans is part of this beautiful legacy. Nature has always played a hugely positive role in my success.

Why it's called ToucanDB

The Toucan Inspiration

🎯 Precision: Toucans have incredibly precise beaks that can reach exactly where they need to go - just like ToucanDB's vector search that finds exactly the right data points with sub-millisecond accuracy.

🔄 Adaptability: These remarkable birds adapt to diverse environments and data sources - mirroring how ToucanDB seamlessly handles any type of unstructured data (text, images, audio, code).

🧠 Intelligence: Toucans are highly intelligent creatures with excellent memory - reflecting ToucanDB's smart caching, adaptive indexing, and ML-first design that learns and optimizes performance.

🌈 Vibrancy: The toucan's colorful nature represents ToucanDB's rich feature set and the diverse, multimodal data it can process and understand.

Just as toucans navigate complex forest ecosystems with ease, ToucanDB navigates the complex landscape of high-dimensional vector spaces, making ML applications soar! 🚀

👨‍💻 Who Built This Vector Database Engine?

Pierre-Henry Soria — a super passionate engineer who loves building cutting-edge AI infrastructure and automating intelligent systems efficiently!

Enthusiast of Machine Learning, Vector Databases, AI, and writing performant code!

Find me at pH7.me

Enjoying this project? Buy me a coffee (spoiler: I love almond extra-hot flat white coffees while coding ML algorithms).

🤝 Contributing

We welcome contributions! Please see our Contributing Guide for details.

📄 License

ToucanDB is released under the MIT License. See license for further details.

🎯 Roadmap

Distributed clustering support
Real-time streaming updates
Multi-modal search (text + image)
Integration with popular ML frameworks
Cloud-native deployment options
GraphQL API support

Built with ❤️ for the AI community

Project details

These details have not been verified by PyPI

Project links

Release history Release notifications | RSS feed

This version

1.0.0

Mar 20, 2026

Download files

Download the file for your platform. If you're not sure which to choose, learn more about installing packages.

Source Distribution

toucandb-1.0.0.tar.gz (2.6 MB view details)

Uploaded Mar 20, 2026 Source

Built Distribution

If you're not sure about the file name format, learn more about wheel file names.

The dropdown lists show the available interpreters, ABIs, and platforms. Enable javascript to be able to filter the list of wheel files.

toucandb-1.0.0-py3-none-any.whl (28.5 kB view details)

Uploaded Mar 20, 2026 Python 3

File details

Details for the file toucandb-1.0.0.tar.gz.

File metadata

Download URL: toucandb-1.0.0.tar.gz
Upload date: Mar 20, 2026
Size: 2.6 MB
Tags: Source
Uploaded using Trusted Publishing? Yes
Uploaded via: twine/6.1.0 CPython/3.13.7

File hashes

Hashes for toucandb-1.0.0.tar.gz
Algorithm	Hash digest
SHA256	`f3f622fcb5c1656ab8ecab96f7d784520907ed06579d40251b2bbf1407f6ffbc`
MD5	`8f950353782d55f68e1c4acfb3f4695b`
BLAKE2b-256	`d7fc689a4a3edb4fecc017e4a0f1b66156a868ebd0e6e9bb60d15caffb9c2334`

See more details on using hashes here.

Provenance

The following attestation bundles were made for toucandb-1.0.0.tar.gz:

Publisher: publish.yml on ToucanDB/ToucanDB

Attestations: Values shown here reflect the state when the release was signed and may no longer be current.

Statement:
- Statement type: https://in-toto.io/Statement/v1
- Predicate type: https://docs.pypi.org/attestations/publish/v1
- Subject name: toucandb-1.0.0.tar.gz
- Subject digest: f3f622fcb5c1656ab8ecab96f7d784520907ed06579d40251b2bbf1407f6ffbc
- Sigstore transparency entry: 1152152811
- Sigstore integration time: Mar 20, 2026
Source repository:
- Permalink: ToucanDB/ToucanDB@8adfac94e0249e5af97e17ddd1d66dca4172d20e
- Branch / Tag: refs/heads/main
- Owner: https://github.com/ToucanDB
- Access: public
Publication detail:
- Token Issuer: https://token.actions.githubusercontent.com
- Runner Environment: github-hosted
- Publication workflow: publish.yml@8adfac94e0249e5af97e17ddd1d66dca4172d20e
- Trigger Event: workflow_dispatch

File details

Details for the file toucandb-1.0.0-py3-none-any.whl.

File metadata

Download URL: toucandb-1.0.0-py3-none-any.whl
Upload date: Mar 20, 2026
Size: 28.5 kB
Tags: Python 3
Uploaded using Trusted Publishing? Yes
Uploaded via: twine/6.1.0 CPython/3.13.7

File hashes

Hashes for toucandb-1.0.0-py3-none-any.whl
Algorithm	Hash digest
SHA256	`2b2e4a0df474b0274e8d89e99ba3af915bc1ec752b105af697e792492c4b1c68`
MD5	`f5cfce1930aeee7722dc091d99eed447`
BLAKE2b-256	`f52d8c7533bd416ad0d66416a3578a0c6760699e1d9d3a354861da9db67feb87`

See more details on using hashes here.

Provenance

The following attestation bundles were made for toucandb-1.0.0-py3-none-any.whl:

Publisher: publish.yml on ToucanDB/ToucanDB

Attestations: Values shown here reflect the state when the release was signed and may no longer be current.

Statement:
- Statement type: https://in-toto.io/Statement/v1
- Predicate type: https://docs.pypi.org/attestations/publish/v1
- Subject name: toucandb-1.0.0-py3-none-any.whl
- Subject digest: 2b2e4a0df474b0274e8d89e99ba3af915bc1ec752b105af697e792492c4b1c68
- Sigstore transparency entry: 1152153021
- Sigstore integration time: Mar 20, 2026
Source repository:
- Permalink: ToucanDB/ToucanDB@8adfac94e0249e5af97e17ddd1d66dca4172d20e
- Branch / Tag: refs/heads/main
- Owner: https://github.com/ToucanDB
- Access: public
Publication detail:
- Token Issuer: https://token.actions.githubusercontent.com
- Runner Environment: github-hosted
- Publication workflow: publish.yml@8adfac94e0249e5af97e17ddd1d66dca4172d20e
- Trigger Event: workflow_dispatch

toucandb 1.0.0

Navigation

Verified details

Maintainers

Meta

Unverified details

Project links

Meta

Classifiers

Project description

🦜 ToucanDB - Micro ML-First Vector DB Engine

✨ Key Features

📦 Installation

🚀 Quick Start

🔗 Compatible Embedding Models

📈 Performance

🦜 Why "ToucanDB"? Exploration is everywhere

The Toucan Inspiration

👨‍💻 Who Built This Vector Database Engine?

🤝 Contributing

📄 License

🎯 Roadmap

Project details

Verified details

Maintainers

Meta

Unverified details

Project links

Meta

Classifiers

Release history Release notifications | RSS feed

Download files

Source Distribution

Built Distribution

File details

File metadata

File hashes

Provenance

File details

File metadata

File hashes

Provenance