Modern CDS/ISIS implementation using SQLite FTS5

These details have not been verified by PyPI

Project links

Project description

CapraISIS

Modern CDS/ISIS Implementation Using SQLite FTS5

CapraISIS brings UNESCO's revolutionary CDS/ISIS text database principles to modern Python. CDS/ISIS (Computerised Documentation System / Integrated Set of Information Systems, 1985–2005) pioneered inverted file indexing for bibliographic records. CapraISIS implements the same algorithms using SQLite FTS5 as the storage backend.

Why CDS/ISIS Principles Still Matter

The CDS/ISIS architecture solved a fundamental problem: O(log k) retrieval from millions of text records. Modern systems like Elasticsearch, Lucene, and SQLite FTS5 all implement variations of the same inverted file index that CDS/ISIS pioneered.

CapraISIS is:

Portable: Single SQLite file, no server required
Fast: Sub-100ms queries on 70M+ records
Simple: Pure Python, no dependencies beyond stdlib
Proven: Based on 40 years of CDS/ISIS architecture

Installation

pip install capraisis

Or install from source:

git clone https://github.com/capraCoder/capraisis
cd capraisis
pip install -e .

Quick Start

Python API

from capraisis import CapraIndex

# Create or open an index
index = CapraIndex("my_corpus.db")

# Add records
index.add(
    id="10.5281/zenodo.12345",
    title="Quantum Mechanics and Consciousness",
    content="A paper exploring the relationship between...",
    year=2024,
    prefix="10.5281"
)

# Bulk add
records = [
    ("10.1234/a", "Title A", "Content A", 2023, "10.1234"),
    ("10.1234/b", "Title B", "Content B", 2024, "10.1234"),
]
index.bulk_add(records)

# Search
results = index.search("quantum consciousness")
for r in results:
    print(f"{r['year']} | {r['id']} | {r['title']}")

# Boolean queries (FTS5 syntax)
results = index.search("quantum AND NOT classical")
results = index.search('"exact phrase"')
results = index.search("title:quantum")  # Field-specific

# Statistics
stats = index.stats()
print(f"Total records: {stats['total_records']:,}")

Command Line

# Build index from JSONL files
python -m capraisis build "data/*.jsonl" --output corpus.db

# Search
python -m capraisis search corpus.db "quantum mechanics"
python -m capraisis search corpus.db "neural network" --year 2024

# Show statistics
python -m capraisis stats corpus.db

# Benchmark
python -m capraisis benchmark corpus.db

Building Large Indices (e.g., DataCite)

For large datasets (millions of records), use the IndexBuilder:

from capraisis import IndexBuilder

builder = IndexBuilder(
    "datacite.db",
    batch_size=100_000,      # Commit every 100K records
    progress_interval=1_000_000  # Report every 1M records
)

# Build from DataCite JSONL files
stats = builder.add_jsonl_files(
    "/path/to/DataCite/**/*.jsonl",  # Adjust to your extraction path
    resume=True  # Skip already processed files
)

print(f"Indexed {stats['total_records']:,} records in {stats['elapsed_hours']:.2f} hours")

Obtaining DataCite Data

The DataCite Public Data File contains metadata for 70M+ DOIs. Download options:

Source	URL
Official Repository	https://datafiles.datacite.org/
Internet Archive	Search for "DataCite Public Data File"
DataCite API	https://support.datacite.org/docs/api

Documentation: https://support.datacite.org/docs/datacite-public-data-file

File Format:

Large .tar file containing compressed NDJSON (newline-delimited JSON)
Each line = one bibliographic record with DOI, title, description, year, etc.
Uncompressed size: ~350 GB

Processing:

# Extract the tar file
tar -xf DataCite_Public_Data_File_2024.tar -C /path/to/extraction/

# Then build with CapraISIS
python -m capraisis build "/path/to/extraction/**/*.jsonl" --output datacite.db

Usage: Subject to DataCite Data File Use Policy.

Custom Record Extractors

Define your own extractor for non-DataCite formats:

from capraisis import IndexBuilder

def my_extractor(rec: dict):
    """Extract fields from your JSON format."""
    return (
        rec.get('identifier'),      # id
        rec.get('name'),            # title
        rec.get('abstract', ''),    # content
        str(rec.get('year', '')),   # year
        rec.get('source', '')       # prefix
    )

builder = IndexBuilder("my_index.db")
builder.add_jsonl_files("my_data/*.jsonl", extractor=my_extractor)

FTS5 Query Syntax

CapraISIS supports the full SQLite FTS5 query syntax:

Query	Meaning
`quantum mechanics`	Both terms (implicit AND)
`quantum OR mechanics`	Either term
`quantum NOT classical`	Exclude term
`"quantum mechanics"`	Exact phrase
`quant*`	Prefix match
`title:quantum`	Field-specific
`NEAR(quantum mechanics, 5)`	Within 5 tokens

Performance

Benchmarks on 70M DataCite records (15GB index):

Query	Results	Time
`polysemanticity`	847	12ms
`neural network`	2.3M	45ms
`climate change`	890K	38ms

Target: <100ms per query ✓

Scaling Proof: O(log k) Complexity

Search time remains nearly constant as corpus size increases 500×:

Records	Avg Search (ms)	Build (s)
1,000	0.39	0.0
10,000	0.36	0.1
100,000	0.29	0.7
500,000	0.26	3.8

This demonstrates O(log k) retrieval — the defining characteristic of inverted file indexing. Extrapolated to 70M records: ~0.4ms per search.

Architecture

CapraISIS implements CDS/ISIS principles using modern tools:

CDS/ISIS (1985)	CapraISIS (2026)
Master File (.MST)	SQLite database
Inverted File (.IFX)	FTS5 virtual table
Cross-Reference (.XRF)	B-tree index
ISIS Pascal	Python + sqlite3

The fundamental insight: FTS5 IS an inverted file index with B-tree organization. We're not emulating CDS/ISIS — we're using its spiritual successor.

History

CDS/ISIS was developed by UNESCO in 1985 as a text database system for libraries and documentation centres. It introduced several innovations:

Inverted file indexing: O(log k) term lookup
Variable-length fields: No fixed schema
Boolean retrieval: AND, OR, NOT operators
Repeatable fields: Multiple authors, subjects

These principles remain the foundation of modern search engines. CapraISIS honours this heritage while providing a modern, portable implementation.

Why Not Other Libraries?

Several Python full-text search libraries exist. Here's why CapraISIS chose SQLite FTS5:

Library	Pros	Cons
Whoosh	Pure Python, feature-rich	Unmaintained since 2015, memory-heavy, no async
Elasticsearch	Powerful, scalable	Server-based, operational complexity, overkill for local use
Xapian	Fast, mature	C++ bindings, installation complexity
SQLite FTS5	Zero-config, stdlib, single-file	Less feature-rich than Elasticsearch

Decision rationale:

Zero dependencies: CapraISIS uses only Python's built-in sqlite3 module
Single-file portability: Copy one .db file, search anywhere
Proven at scale: SQLite handles databases up to 281 TB
Active maintenance: SQLite is actively developed (unlike Whoosh)

For small to medium corpora (<100M records), FTS5 delivers Elasticsearch-class performance with shell-script simplicity.

License

MIT License. Use at your own risk.

Citation

If you use CapraISIS in research, please cite:

@software{capraisis2026,
  author = {Caprazli, Kafkas M.},
  title = {CapraISIS: Modern CDS/ISIS Implementation},
  year = {2026},
  url = {https://github.com/capraCoder/capraisis}
}

Author ORCID: 0000-0002-5744-8944

Historical Reference: Original WinISIS

The original UNESCO CDS/ISIS software is preserved in the historical/ folder for educational purposes:

Winisis1_4.zip — Original WinISIS 1.4 installer (UNESCO, ~1995)
ctl3d.dll — Required Windows dependency

Copyright: CDS/ISIS is a UNESCO product, provided free for non-commercial use. UNESCO retains all intellectual property rights.

Note: WinISIS is 16-bit legacy software requiring emulation on modern Windows. CapraISIS provides the same inverted-file indexing principles using modern SQLite FTS5 — no emulation required.

"The path forward is the Neural-to-Symbolic Bridge: using the semantic brilliance of modern AI to populate the structural perfection of classical indexing."

Project details

These details have not been verified by PyPI

Project links

Release history Release notifications | RSS feed

This version

0.1.1

Jan 12, 2026

0.1.0

Jan 12, 2026

Download files

Download the file for your platform. If you're not sure which to choose, learn more about installing packages.

Source Distribution

capraisis-0.1.1.tar.gz (19.1 kB view details)

Uploaded Jan 12, 2026 Source

Built Distribution

If you're not sure about the file name format, learn more about wheel file names.

The dropdown lists show the available interpreters, ABIs, and platforms. Enable javascript to be able to filter the list of wheel files.

capraisis-0.1.1-py3-none-any.whl (17.3 kB view details)

Uploaded Jan 12, 2026 Python 3

File details

Details for the file capraisis-0.1.1.tar.gz.

File metadata

Download URL: capraisis-0.1.1.tar.gz
Upload date: Jan 12, 2026
Size: 19.1 kB
Tags: Source
Uploaded using Trusted Publishing? No
Uploaded via: twine/6.2.0 CPython/3.13.7

File hashes

Hashes for capraisis-0.1.1.tar.gz
Algorithm	Hash digest
SHA256	`5e68c2e4ce41b4486bc29fd38c562f16a63c588751564ced059095f23677c6c0`
MD5	`9fb99b7cf6aacdf770f0e5164cfdd615`
BLAKE2b-256	`05e15801c9141aab82c2baa4769372cc73275b67d1c450fef1306bce70fa3beb`

See more details on using hashes here.

File details

Details for the file capraisis-0.1.1-py3-none-any.whl.

File metadata

Download URL: capraisis-0.1.1-py3-none-any.whl
Upload date: Jan 12, 2026
Size: 17.3 kB
Tags: Python 3
Uploaded using Trusted Publishing? No
Uploaded via: twine/6.2.0 CPython/3.13.7

File hashes

Hashes for capraisis-0.1.1-py3-none-any.whl
Algorithm	Hash digest
SHA256	`470982d7e8ab025f18684ff9ee77590538f2a5965ab74b83eef2003c2cae97ae`
MD5	`285b9ce719c5f0423a3a771130ff5026`
BLAKE2b-256	`98b6318a5a79918044b17308c97ef7cf8f75a6eeb5d70f6cf0de8f955658dd48`

See more details on using hashes here.

capraisis 0.1.1

Navigation

Verified details

Maintainers

Unverified details

Project links

Meta

Classifiers

Project description

CapraISIS

Why CDS/ISIS Principles Still Matter

Installation

Quick Start

Python API

Command Line

Building Large Indices (e.g., DataCite)

Obtaining DataCite Data

Custom Record Extractors

FTS5 Query Syntax

Performance

Scaling Proof: O(log k) Complexity

Architecture

History

Why Not Other Libraries?

License

Citation

See Also

Historical Reference: Original WinISIS

Project details

Verified details

Maintainers

Unverified details

Project links

Meta

Classifiers

Release history Release notifications | RSS feed

Download files

Source Distribution

Built Distribution

File details

File metadata

File hashes

File details

File metadata

File hashes