Skip to main content

Modern CDS/ISIS implementation using SQLite FTS5

Project description

CapraISIS

Modern CDS/ISIS Implementation Using SQLite FTS5

CapraISIS brings UNESCO's revolutionary CDS/ISIS text database principles to modern Python. CDS/ISIS (Computerised Documentation System / Integrated Set of Information Systems, 1985–2005) pioneered inverted file indexing for bibliographic records. CapraISIS implements the same algorithms using SQLite FTS5 as the storage backend.

Why CDS/ISIS Principles Still Matter

The CDS/ISIS architecture solved a fundamental problem: O(log k) retrieval from millions of text records. Modern systems like Elasticsearch, Lucene, and SQLite FTS5 all implement variations of the same inverted file index that CDS/ISIS pioneered.

CapraISIS is:

  • Portable: Single SQLite file, no server required
  • Fast: Sub-100ms queries on 70M+ records
  • Simple: Pure Python, no dependencies beyond stdlib
  • Proven: Based on 40 years of CDS/ISIS architecture

Installation

pip install capraisis

Or install from source:

git clone https://github.com/capraCoder/capraisis
cd capraisis
pip install -e .

Quick Start

Python API

from capraisis import CapraIndex

# Create or open an index
index = CapraIndex("my_corpus.db")

# Add records
index.add(
    id="10.5281/zenodo.12345",
    title="Quantum Mechanics and Consciousness",
    content="A paper exploring the relationship between...",
    year=2024,
    prefix="10.5281"
)

# Bulk add
records = [
    ("10.1234/a", "Title A", "Content A", 2023, "10.1234"),
    ("10.1234/b", "Title B", "Content B", 2024, "10.1234"),
]
index.bulk_add(records)

# Search
results = index.search("quantum consciousness")
for r in results:
    print(f"{r['year']} | {r['id']} | {r['title']}")

# Boolean queries (FTS5 syntax)
results = index.search("quantum AND NOT classical")
results = index.search('"exact phrase"')
results = index.search("title:quantum")  # Field-specific

# Statistics
stats = index.stats()
print(f"Total records: {stats['total_records']:,}")

Command Line

# Build index from JSONL files
python -m capraisis build "data/*.jsonl" --output corpus.db

# Search
python -m capraisis search corpus.db "quantum mechanics"
python -m capraisis search corpus.db "neural network" --year 2024

# Show statistics
python -m capraisis stats corpus.db

# Benchmark
python -m capraisis benchmark corpus.db

Building Large Indices (e.g., DataCite)

For large datasets (millions of records), use the IndexBuilder:

from capraisis import IndexBuilder

builder = IndexBuilder(
    "datacite.db",
    batch_size=100_000,      # Commit every 100K records
    progress_interval=1_000_000  # Report every 1M records
)

# Build from DataCite JSONL files
stats = builder.add_jsonl_files(
    "/path/to/DataCite/**/*.jsonl",  # Adjust to your extraction path
    resume=True  # Skip already processed files
)

print(f"Indexed {stats['total_records']:,} records in {stats['elapsed_hours']:.2f} hours")

Obtaining DataCite Data

The DataCite Public Data File contains metadata for 70M+ DOIs. Download options:

Source URL
Official Repository https://datafiles.datacite.org/
Internet Archive Search for "DataCite Public Data File"
DataCite API https://support.datacite.org/docs/api

Documentation: https://support.datacite.org/docs/datacite-public-data-file

File Format:

  • Large .tar file containing compressed NDJSON (newline-delimited JSON)
  • Each line = one bibliographic record with DOI, title, description, year, etc.
  • Uncompressed size: ~350 GB

Processing:

# Extract the tar file
tar -xf DataCite_Public_Data_File_2024.tar -C /path/to/extraction/

# Then build with CapraISIS
python -m capraisis build "/path/to/extraction/**/*.jsonl" --output datacite.db

Usage: Subject to DataCite Data File Use Policy.

Custom Record Extractors

Define your own extractor for non-DataCite formats:

from capraisis import IndexBuilder

def my_extractor(rec: dict):
    """Extract fields from your JSON format."""
    return (
        rec.get('identifier'),      # id
        rec.get('name'),            # title
        rec.get('abstract', ''),    # content
        str(rec.get('year', '')),   # year
        rec.get('source', '')       # prefix
    )

builder = IndexBuilder("my_index.db")
builder.add_jsonl_files("my_data/*.jsonl", extractor=my_extractor)

FTS5 Query Syntax

CapraISIS supports the full SQLite FTS5 query syntax:

Query Meaning
quantum mechanics Both terms (implicit AND)
quantum OR mechanics Either term
quantum NOT classical Exclude term
"quantum mechanics" Exact phrase
quant* Prefix match
title:quantum Field-specific
NEAR(quantum mechanics, 5) Within 5 tokens

Performance

Benchmarks on 70M DataCite records (15GB index):

Query Results Time
polysemanticity 847 12ms
neural network 2.3M 45ms
climate change 890K 38ms

Target: <100ms per query

Scaling Proof: O(log k) Complexity

Search time remains nearly constant as corpus size increases 500×:

Records Avg Search (ms) Build (s)
1,000 0.39 0.0
10,000 0.36 0.1
100,000 0.29 0.7
500,000 0.26 3.8

This demonstrates O(log k) retrieval — the defining characteristic of inverted file indexing. Extrapolated to 70M records: ~0.4ms per search.

Architecture

CapraISIS implements CDS/ISIS principles using modern tools:

CDS/ISIS (1985) CapraISIS (2026)
Master File (.MST) SQLite database
Inverted File (.IFX) FTS5 virtual table
Cross-Reference (.XRF) B-tree index
ISIS Pascal Python + sqlite3

The fundamental insight: FTS5 IS an inverted file index with B-tree organization. We're not emulating CDS/ISIS — we're using its spiritual successor.

History

CDS/ISIS was developed by UNESCO in 1985 as a text database system for libraries and documentation centres. It introduced several innovations:

  • Inverted file indexing: O(log k) term lookup
  • Variable-length fields: No fixed schema
  • Boolean retrieval: AND, OR, NOT operators
  • Repeatable fields: Multiple authors, subjects

These principles remain the foundation of modern search engines. CapraISIS honours this heritage while providing a modern, portable implementation.

Why Not Other Libraries?

Several Python full-text search libraries exist. Here's why CapraISIS chose SQLite FTS5:

Library Pros Cons
Whoosh Pure Python, feature-rich Unmaintained since 2015, memory-heavy, no async
Elasticsearch Powerful, scalable Server-based, operational complexity, overkill for local use
Xapian Fast, mature C++ bindings, installation complexity
SQLite FTS5 Zero-config, stdlib, single-file Less feature-rich than Elasticsearch

Decision rationale:

  • Zero dependencies: CapraISIS uses only Python's built-in sqlite3 module
  • Single-file portability: Copy one .db file, search anywhere
  • Proven at scale: SQLite handles databases up to 281 TB
  • Active maintenance: SQLite is actively developed (unlike Whoosh)

For small to medium corpora (<100M records), FTS5 delivers Elasticsearch-class performance with shell-script simplicity.

License

MIT License. Use at your own risk.

Citation

If you use CapraISIS in research, please cite:

@software{capraisis2026,
  author = {Caprazli, Kafkas M.},
  title = {CapraISIS: Modern CDS/ISIS Implementation},
  year = {2026},
  url = {https://github.com/capraCoder/capraisis}
}

Author ORCID: 0000-0002-5744-8944

See Also

Historical Reference: Original WinISIS

The original UNESCO CDS/ISIS software is preserved in the historical/ folder for educational purposes:

  • Winisis1_4.zip — Original WinISIS 1.4 installer (UNESCO, ~1995)
  • ctl3d.dll — Required Windows dependency

Copyright: CDS/ISIS is a UNESCO product, provided free for non-commercial use. UNESCO retains all intellectual property rights.

Note: WinISIS is 16-bit legacy software requiring emulation on modern Windows. CapraISIS provides the same inverted-file indexing principles using modern SQLite FTS5 — no emulation required.


"The path forward is the Neural-to-Symbolic Bridge: using the semantic brilliance of modern AI to populate the structural perfection of classical indexing."

Project details


Download files

Download the file for your platform. If you're not sure which to choose, learn more about installing packages.

Source Distribution

capraisis-0.1.1.tar.gz (19.1 kB view details)

Uploaded Source

Built Distribution

If you're not sure about the file name format, learn more about wheel file names.

capraisis-0.1.1-py3-none-any.whl (17.3 kB view details)

Uploaded Python 3

File details

Details for the file capraisis-0.1.1.tar.gz.

File metadata

  • Download URL: capraisis-0.1.1.tar.gz
  • Upload date:
  • Size: 19.1 kB
  • Tags: Source
  • Uploaded using Trusted Publishing? No
  • Uploaded via: twine/6.2.0 CPython/3.13.7

File hashes

Hashes for capraisis-0.1.1.tar.gz
Algorithm Hash digest
SHA256 5e68c2e4ce41b4486bc29fd38c562f16a63c588751564ced059095f23677c6c0
MD5 9fb99b7cf6aacdf770f0e5164cfdd615
BLAKE2b-256 05e15801c9141aab82c2baa4769372cc73275b67d1c450fef1306bce70fa3beb

See more details on using hashes here.

File details

Details for the file capraisis-0.1.1-py3-none-any.whl.

File metadata

  • Download URL: capraisis-0.1.1-py3-none-any.whl
  • Upload date:
  • Size: 17.3 kB
  • Tags: Python 3
  • Uploaded using Trusted Publishing? No
  • Uploaded via: twine/6.2.0 CPython/3.13.7

File hashes

Hashes for capraisis-0.1.1-py3-none-any.whl
Algorithm Hash digest
SHA256 470982d7e8ab025f18684ff9ee77590538f2a5965ab74b83eef2003c2cae97ae
MD5 285b9ce719c5f0423a3a771130ff5026
BLAKE2b-256 98b6318a5a79918044b17308c97ef7cf8f75a6eeb5d70f6cf0de8f955658dd48

See more details on using hashes here.

Supported by

AWS Cloud computing and Security Sponsor Datadog Monitoring Depot Continuous Integration Fastly CDN Google Download Analytics Pingdom Monitoring Sentry Error logging StatusPage Status page