Modern CDS/ISIS implementation using SQLite FTS5
Project description
CapraISIS
Modern CDS/ISIS Implementation Using SQLite FTS5
CapraISIS brings UNESCO's revolutionary CDS/ISIS text database principles to modern Python. CDS/ISIS (Computerised Documentation System / Integrated Set of Information Systems, 1985–2005) pioneered inverted file indexing for bibliographic records. CapraISIS implements the same algorithms using SQLite FTS5 as the storage backend.
Why CDS/ISIS Principles Still Matter
The CDS/ISIS architecture solved a fundamental problem: O(log k) retrieval from millions of text records. Modern systems like Elasticsearch, Lucene, and SQLite FTS5 all implement variations of the same inverted file index that CDS/ISIS pioneered.
CapraISIS is:
- Portable: Single SQLite file, no server required
- Fast: Sub-100ms queries on 70M+ records
- Simple: Pure Python, no dependencies beyond stdlib
- Proven: Based on 40 years of CDS/ISIS architecture
Installation
pip install capraisis
Or install from source:
git clone https://github.com/capraCoder/capraisis
cd capraisis
pip install -e .
Quick Start
Python API
from capraisis import CapraIndex
# Create or open an index
index = CapraIndex("my_corpus.db")
# Add records
index.add(
id="10.5281/zenodo.12345",
title="Quantum Mechanics and Consciousness",
content="A paper exploring the relationship between...",
year=2024,
prefix="10.5281"
)
# Bulk add
records = [
("10.1234/a", "Title A", "Content A", 2023, "10.1234"),
("10.1234/b", "Title B", "Content B", 2024, "10.1234"),
]
index.bulk_add(records)
# Search
results = index.search("quantum consciousness")
for r in results:
print(f"{r['year']} | {r['id']} | {r['title']}")
# Boolean queries (FTS5 syntax)
results = index.search("quantum AND NOT classical")
results = index.search('"exact phrase"')
results = index.search("title:quantum") # Field-specific
# Statistics
stats = index.stats()
print(f"Total records: {stats['total_records']:,}")
Command Line
# Build index from JSONL files
python -m capraisis build "data/*.jsonl" --output corpus.db
# Search
python -m capraisis search corpus.db "quantum mechanics"
python -m capraisis search corpus.db "neural network" --year 2024
# Show statistics
python -m capraisis stats corpus.db
# Benchmark
python -m capraisis benchmark corpus.db
Building Large Indices (e.g., DataCite)
For large datasets (millions of records), use the IndexBuilder:
from capraisis import IndexBuilder
builder = IndexBuilder(
"datacite.db",
batch_size=100_000, # Commit every 100K records
progress_interval=1_000_000 # Report every 1M records
)
# Build from DataCite JSONL files
stats = builder.add_jsonl_files(
"/path/to/DataCite/**/*.jsonl", # Adjust to your extraction path
resume=True # Skip already processed files
)
print(f"Indexed {stats['total_records']:,} records in {stats['elapsed_hours']:.2f} hours")
Obtaining DataCite Data
The DataCite Public Data File contains metadata for 70M+ DOIs. Download options:
| Source | URL |
|---|---|
| Official Repository | https://datafiles.datacite.org/ |
| Internet Archive | Search for "DataCite Public Data File" |
| DataCite API | https://support.datacite.org/docs/api |
Documentation: https://support.datacite.org/docs/datacite-public-data-file
File Format:
- Large
.tarfile containing compressed NDJSON (newline-delimited JSON) - Each line = one bibliographic record with DOI, title, description, year, etc.
- Uncompressed size: ~350 GB
Processing:
# Extract the tar file
tar -xf DataCite_Public_Data_File_2024.tar -C /path/to/extraction/
# Then build with CapraISIS
python -m capraisis build "/path/to/extraction/**/*.jsonl" --output datacite.db
Usage: Subject to DataCite Data File Use Policy.
Custom Record Extractors
Define your own extractor for non-DataCite formats:
from capraisis import IndexBuilder
def my_extractor(rec: dict):
"""Extract fields from your JSON format."""
return (
rec.get('identifier'), # id
rec.get('name'), # title
rec.get('abstract', ''), # content
str(rec.get('year', '')), # year
rec.get('source', '') # prefix
)
builder = IndexBuilder("my_index.db")
builder.add_jsonl_files("my_data/*.jsonl", extractor=my_extractor)
FTS5 Query Syntax
CapraISIS supports the full SQLite FTS5 query syntax:
| Query | Meaning |
|---|---|
quantum mechanics |
Both terms (implicit AND) |
quantum OR mechanics |
Either term |
quantum NOT classical |
Exclude term |
"quantum mechanics" |
Exact phrase |
quant* |
Prefix match |
title:quantum |
Field-specific |
NEAR(quantum mechanics, 5) |
Within 5 tokens |
Performance
Benchmarks on 70M DataCite records (15GB index):
| Query | Results | Time |
|---|---|---|
polysemanticity |
847 | 12ms |
neural network |
2.3M | 45ms |
climate change |
890K | 38ms |
Target: <100ms per query ✓
Scaling Proof: O(log k) Complexity
Search time remains nearly constant as corpus size increases 500×:
| Records | Avg Search (ms) | Build (s) |
|---|---|---|
| 1,000 | 0.39 | 0.0 |
| 10,000 | 0.36 | 0.1 |
| 100,000 | 0.29 | 0.7 |
| 500,000 | 0.26 | 3.8 |
This demonstrates O(log k) retrieval — the defining characteristic of inverted file indexing. Extrapolated to 70M records: ~0.4ms per search.
Architecture
CapraISIS implements CDS/ISIS principles using modern tools:
| CDS/ISIS (1985) | CapraISIS (2026) |
|---|---|
| Master File (.MST) | SQLite database |
| Inverted File (.IFX) | FTS5 virtual table |
| Cross-Reference (.XRF) | B-tree index |
| ISIS Pascal | Python + sqlite3 |
The fundamental insight: FTS5 IS an inverted file index with B-tree organization. We're not emulating CDS/ISIS — we're using its spiritual successor.
History
CDS/ISIS was developed by UNESCO in 1985 as a text database system for libraries and documentation centres. It introduced several innovations:
- Inverted file indexing: O(log k) term lookup
- Variable-length fields: No fixed schema
- Boolean retrieval: AND, OR, NOT operators
- Repeatable fields: Multiple authors, subjects
These principles remain the foundation of modern search engines. CapraISIS honours this heritage while providing a modern, portable implementation.
Why Not Other Libraries?
Several Python full-text search libraries exist. Here's why CapraISIS chose SQLite FTS5:
| Library | Pros | Cons |
|---|---|---|
| Whoosh | Pure Python, feature-rich | Unmaintained since 2015, memory-heavy, no async |
| Elasticsearch | Powerful, scalable | Server-based, operational complexity, overkill for local use |
| Xapian | Fast, mature | C++ bindings, installation complexity |
| SQLite FTS5 | Zero-config, stdlib, single-file | Less feature-rich than Elasticsearch |
Decision rationale:
- Zero dependencies: CapraISIS uses only Python's built-in
sqlite3module - Single-file portability: Copy one
.dbfile, search anywhere - Proven at scale: SQLite handles databases up to 281 TB
- Active maintenance: SQLite is actively developed (unlike Whoosh)
For small to medium corpora (<100M records), FTS5 delivers Elasticsearch-class performance with shell-script simplicity.
License
MIT License. Use at your own risk.
Citation
If you use CapraISIS in research, please cite:
@software{capraisis2026,
author = {Caprazli, Kafkas M.},
title = {CapraISIS: Modern CDS/ISIS Implementation},
year = {2026},
url = {https://github.com/capraCoder/capraisis}
}
Author ORCID: 0000-0002-5744-8944
See Also
- UNESCO CDS/ISIS - Original system (archived)
- SQLite FTS5 - Backend engine
- Caprazli, K. M. (2025). Achieving the Neural Frontier: The LLM Race for Scalable Retrieval. Zenodo. https://doi.org/10.5281/zenodo.18202850
Historical Reference: Original WinISIS
The original UNESCO CDS/ISIS software is preserved in the historical/ folder for educational purposes:
Winisis1_4.zip— Original WinISIS 1.4 installer (UNESCO, ~1995)ctl3d.dll— Required Windows dependency
Copyright: CDS/ISIS is a UNESCO product, provided free for non-commercial use. UNESCO retains all intellectual property rights.
Note: WinISIS is 16-bit legacy software requiring emulation on modern Windows. CapraISIS provides the same inverted-file indexing principles using modern SQLite FTS5 — no emulation required.
"The path forward is the Neural-to-Symbolic Bridge: using the semantic brilliance of modern AI to populate the structural perfection of classical indexing."
Project details
Release history Release notifications | RSS feed
Download files
Download the file for your platform. If you're not sure which to choose, learn more about installing packages.
Source Distribution
Built Distribution
Filter files by name, interpreter, ABI, and platform.
If you're not sure about the file name format, learn more about wheel file names.
Copy a direct link to the current filters
File details
Details for the file capraisis-0.1.1.tar.gz.
File metadata
- Download URL: capraisis-0.1.1.tar.gz
- Upload date:
- Size: 19.1 kB
- Tags: Source
- Uploaded using Trusted Publishing? No
- Uploaded via: twine/6.2.0 CPython/3.13.7
File hashes
| Algorithm | Hash digest | |
|---|---|---|
| SHA256 |
5e68c2e4ce41b4486bc29fd38c562f16a63c588751564ced059095f23677c6c0
|
|
| MD5 |
9fb99b7cf6aacdf770f0e5164cfdd615
|
|
| BLAKE2b-256 |
05e15801c9141aab82c2baa4769372cc73275b67d1c450fef1306bce70fa3beb
|
File details
Details for the file capraisis-0.1.1-py3-none-any.whl.
File metadata
- Download URL: capraisis-0.1.1-py3-none-any.whl
- Upload date:
- Size: 17.3 kB
- Tags: Python 3
- Uploaded using Trusted Publishing? No
- Uploaded via: twine/6.2.0 CPython/3.13.7
File hashes
| Algorithm | Hash digest | |
|---|---|---|
| SHA256 |
470982d7e8ab025f18684ff9ee77590538f2a5965ab74b83eef2003c2cae97ae
|
|
| MD5 |
285b9ce719c5f0423a3a771130ff5026
|
|
| BLAKE2b-256 |
98b6318a5a79918044b17308c97ef7cf8f75a6eeb5d70f6cf0de8f955658dd48
|