Skip to main content

Modern CDS/ISIS implementation using SQLite FTS5

Project description

CapraISIS

Modern CDS/ISIS Implementation Using SQLite FTS5

CapraISIS brings UNESCO's revolutionary CDS/ISIS text database principles to modern Python. CDS/ISIS (Computerised Documentation System / Integrated Set of Information Systems, 1985–2005) pioneered inverted file indexing for bibliographic records. CapraISIS implements the same algorithms using SQLite FTS5 as the storage backend.

Why CDS/ISIS Principles Still Matter

The CDS/ISIS architecture solved a fundamental problem: O(log k) retrieval from millions of text records. Modern systems like Elasticsearch, Lucene, and SQLite FTS5 all implement variations of the same inverted file index that CDS/ISIS pioneered.

CapraISIS is:

  • Portable: Single SQLite file, no server required
  • Fast: Sub-100ms queries on 70M+ records
  • Simple: Pure Python, no dependencies beyond stdlib
  • Proven: Based on 40 years of CDS/ISIS architecture

Installation

pip install capraisis

Or install from source:

git clone https://github.com/capraCoder/capraisis
cd capraisis
pip install -e .

Quick Start

Python API

from capraisis import CapraIndex

# Create or open an index
index = CapraIndex("my_corpus.db")

# Add records
index.add(
    id="10.5281/zenodo.12345",
    title="Quantum Mechanics and Consciousness",
    content="A paper exploring the relationship between...",
    year=2024,
    prefix="10.5281"
)

# Bulk add
records = [
    ("10.1234/a", "Title A", "Content A", 2023, "10.1234"),
    ("10.1234/b", "Title B", "Content B", 2024, "10.1234"),
]
index.bulk_add(records)

# Search
results = index.search("quantum consciousness")
for r in results:
    print(f"{r['year']} | {r['id']} | {r['title']}")

# Boolean queries (FTS5 syntax)
results = index.search("quantum AND NOT classical")
results = index.search('"exact phrase"')
results = index.search("title:quantum")  # Field-specific

# Statistics
stats = index.stats()
print(f"Total records: {stats['total_records']:,}")

Command Line

# Build index from JSONL files
python -m capraisis build "data/*.jsonl" --output corpus.db

# Search
python -m capraisis search corpus.db "quantum mechanics"
python -m capraisis search corpus.db "neural network" --year 2024

# Show statistics
python -m capraisis stats corpus.db

# Benchmark
python -m capraisis benchmark corpus.db

Building Large Indices (e.g., DataCite)

For large datasets (millions of records), use the IndexBuilder:

from capraisis import IndexBuilder

builder = IndexBuilder(
    "datacite.db",
    batch_size=100_000,      # Commit every 100K records
    progress_interval=1_000_000  # Report every 1M records
)

# Build from DataCite JSONL files
stats = builder.add_jsonl_files(
    "/path/to/DataCite/**/*.jsonl",  # Adjust to your extraction path
    resume=True  # Skip already processed files
)

print(f"Indexed {stats['total_records']:,} records in {stats['elapsed_hours']:.2f} hours")

Obtaining DataCite Data

The DataCite Public Data File contains metadata for 70M+ DOIs. Download options:

Source URL
Official Repository https://datafiles.datacite.org/
Internet Archive Search for "DataCite Public Data File"
DataCite API https://support.datacite.org/docs/api

Documentation: https://support.datacite.org/docs/datacite-public-data-file

File Format:

  • Large .tar file containing compressed NDJSON (newline-delimited JSON)
  • Each line = one bibliographic record with DOI, title, description, year, etc.
  • Uncompressed size: ~350 GB

Processing:

# Extract the tar file
tar -xf DataCite_Public_Data_File_2024.tar -C /path/to/extraction/

# Then build with CapraISIS
python -m capraisis build "/path/to/extraction/**/*.jsonl" --output datacite.db

Usage: Subject to DataCite Data File Use Policy.

Custom Record Extractors

Define your own extractor for non-DataCite formats:

from capraisis import IndexBuilder

def my_extractor(rec: dict):
    """Extract fields from your JSON format."""
    return (
        rec.get('identifier'),      # id
        rec.get('name'),            # title
        rec.get('abstract', ''),    # content
        str(rec.get('year', '')),   # year
        rec.get('source', '')       # prefix
    )

builder = IndexBuilder("my_index.db")
builder.add_jsonl_files("my_data/*.jsonl", extractor=my_extractor)

FTS5 Query Syntax

CapraISIS supports the full SQLite FTS5 query syntax:

Query Meaning
quantum mechanics Both terms (implicit AND)
quantum OR mechanics Either term
quantum NOT classical Exclude term
"quantum mechanics" Exact phrase
quant* Prefix match
title:quantum Field-specific
NEAR(quantum mechanics, 5) Within 5 tokens

Performance

Benchmarks on 70M DataCite records (15GB index):

Query Results Time
polysemanticity 847 12ms
neural network 2.3M 45ms
climate change 890K 38ms

Target: <100ms per query

Scaling Proof: O(log k) Complexity

Search time remains nearly constant as corpus size increases 500×:

Records Avg Search (ms) Build (s)
1,000 0.39 0.0
10,000 0.36 0.1
100,000 0.29 0.7
500,000 0.26 3.8

This demonstrates O(log k) retrieval — the defining characteristic of inverted file indexing. Extrapolated to 70M records: ~0.4ms per search.

Architecture

CapraISIS implements CDS/ISIS principles using modern tools:

CDS/ISIS (1985) CapraISIS (2026)
Master File (.MST) SQLite database
Inverted File (.IFX) FTS5 virtual table
Cross-Reference (.XRF) B-tree index
ISIS Pascal Python + sqlite3

The fundamental insight: FTS5 IS an inverted file index with B-tree organization. We're not emulating CDS/ISIS — we're using its spiritual successor.

History

CDS/ISIS was developed by UNESCO in 1985 as a text database system for libraries and documentation centres. It introduced several innovations:

  • Inverted file indexing: O(log k) term lookup
  • Variable-length fields: No fixed schema
  • Boolean retrieval: AND, OR, NOT operators
  • Repeatable fields: Multiple authors, subjects

These principles remain the foundation of modern search engines. CapraISIS honours this heritage while providing a modern, portable implementation.

Why Not Other Libraries?

Several Python full-text search libraries exist. Here's why CapraISIS chose SQLite FTS5:

Library Pros Cons
Whoosh Pure Python, feature-rich Unmaintained since 2015, memory-heavy, no async
Elasticsearch Powerful, scalable Server-based, operational complexity, overkill for local use
Xapian Fast, mature C++ bindings, installation complexity
SQLite FTS5 Zero-config, stdlib, single-file Less feature-rich than Elasticsearch

Decision rationale:

  • Zero dependencies: CapraISIS uses only Python's built-in sqlite3 module
  • Single-file portability: Copy one .db file, search anywhere
  • Proven at scale: SQLite handles databases up to 281 TB
  • Active maintenance: SQLite is actively developed (unlike Whoosh)

For small to medium corpora (<100M records), FTS5 delivers Elasticsearch-class performance with shell-script simplicity.

License

MIT License. Use at your own risk.

Citation

If you use CapraISIS in research, please cite:

@software{capraisis2026,
  author = {Caprazli, Kafkas M.},
  title = {CapraISIS: Modern CDS/ISIS Implementation},
  year = {2026},
  url = {https://github.com/capraCoder/capraisis}
}

Author ORCID: 0000-0002-5744-8944

See Also

Historical Reference: Original WinISIS

The original UNESCO CDS/ISIS software is preserved in the historical/ folder for educational purposes:

  • Winisis1_4.zip — Original WinISIS 1.4 installer (UNESCO, ~1995)
  • ctl3d.dll — Required Windows dependency

Copyright: CDS/ISIS is a UNESCO product, provided free for non-commercial use. UNESCO retains all intellectual property rights.

Note: WinISIS is 16-bit legacy software requiring emulation on modern Windows. CapraISIS provides the same inverted-file indexing principles using modern SQLite FTS5 — no emulation required.


"The path forward is the Neural-to-Symbolic Bridge: using the semantic brilliance of modern AI to populate the structural perfection of classical indexing."

Project details


Download files

Download the file for your platform. If you're not sure which to choose, learn more about installing packages.

Source Distribution

capraisis-0.1.0.tar.gz (7.1 kB view details)

Uploaded Source

Built Distribution

If you're not sure about the file name format, learn more about wheel file names.

capraisis-0.1.0-py3-none-any.whl (6.7 kB view details)

Uploaded Python 3

File details

Details for the file capraisis-0.1.0.tar.gz.

File metadata

  • Download URL: capraisis-0.1.0.tar.gz
  • Upload date:
  • Size: 7.1 kB
  • Tags: Source
  • Uploaded using Trusted Publishing? No
  • Uploaded via: twine/6.2.0 CPython/3.13.7

File hashes

Hashes for capraisis-0.1.0.tar.gz
Algorithm Hash digest
SHA256 ec87194ddb17634f30909319959d52bbafd1a635a3e1b2ad724429b6a5de1d3b
MD5 e09745b87db0605da98415cfbbea5d44
BLAKE2b-256 97492165120d794ae0432e6caeb74fab9dcf95f585968863f271a4c0b8e5a364

See more details on using hashes here.

File details

Details for the file capraisis-0.1.0-py3-none-any.whl.

File metadata

  • Download URL: capraisis-0.1.0-py3-none-any.whl
  • Upload date:
  • Size: 6.7 kB
  • Tags: Python 3
  • Uploaded using Trusted Publishing? No
  • Uploaded via: twine/6.2.0 CPython/3.13.7

File hashes

Hashes for capraisis-0.1.0-py3-none-any.whl
Algorithm Hash digest
SHA256 bfe6eb43cd3c748e827b87c86021de9b5c30cdf86e4831b44fb27f31c0418cc3
MD5 6a96fa71c3ffba32cb9bc51bc1980fa2
BLAKE2b-256 9444455541662c32137e3b2c1b6e80b4978a4538d1381833ed2c59b83399fc97

See more details on using hashes here.

Supported by

AWS Cloud computing and Security Sponsor Datadog Monitoring Depot Continuous Integration Fastly CDN Google Download Analytics Pingdom Monitoring Sentry Error logging StatusPage Status page