Industrial-scale CDS/ISIS implementation using Elasticsearch
Project description
CaproneISIS
Industrial-Scale CDS/ISIS Implementation Using Elasticsearch
CaproneISIS is the "big brother" of capraISIS, designed for massive datasets (100M–10B records) using Elasticsearch as the backend.
Both packages implement UNESCO's CDS/ISIS (Computerised Documentation System / Integrated Set of Information Systems, 1985–2005) inverted file indexing principles — capraISIS for portable single-file use, caproneISIS for distributed enterprise scale.
Family Architecture
| Package | Backend | Scale | Use Case |
|---|---|---|---|
| capraISIS | SQLite FTS5 | <100M records | Research, portable, zero dependencies |
| caproneISIS | Elasticsearch | 100M–10B records | Enterprise, distributed, horizontally scalable |
Same API, different scale. Code written for capraISIS migrates to caproneISIS by changing one import.
Installation
pip install caproneisis
Requires: Elasticsearch 8.x cluster (local or remote)
Quick Start with Docker
# Start local Elasticsearch (for testing)
docker run -d --name es-test \
-p 9200:9200 \
-e "discovery.type=single-node" \
-e "xpack.security.enabled=false" \
elasticsearch:8.11.0
Quick Start
Python API
from caproneisis import CaproneIndex
# Connect to local Elasticsearch
index = CaproneIndex("my_corpus")
# Add records (same API as capraISIS)
index.add(
id="10.5281/zenodo.12345",
title="Quantum Mechanics and Consciousness",
content="A paper exploring the relationship between...",
year=2024,
prefix="10.5281"
)
# Bulk add
records = [
("10.1234/a", "Title A", "Content A", 2023, "10.1234"),
("10.1234/b", "Title B", "Content B", 2024, "10.1234"),
]
index.bulk_add(records)
# Search
results = index.search("quantum consciousness")
for r in results:
print(f"{r['year']} | {r['id']} | {r['title']}")
# Statistics
stats = index.stats()
print(f"Total records: {stats['total_records']:,}")
print(f"Size: {stats['size_gb']:.2f} GB")
print(f"Shards: {stats['shards']}")
Production Cluster
from caproneisis import CaproneIndex
# Connect to production cluster with authentication
index = CaproneIndex(
"production_corpus",
hosts=["https://es1.example.com:9200", "https://es2.example.com:9200"],
api_key="your-api-key",
verify_certs=True,
shards=10, # More shards for larger clusters
replicas=2 # Higher redundancy for production
)
Command Line
# Cluster management
python -m caproneisis cluster health
python -m caproneisis cluster indices
# Create index
python -m caproneisis create myindex --shards 5 --replicas 1
# Build index from JSONL files
python -m caproneisis build "data/*.jsonl" --index myindex
# Search
python -m caproneisis search myindex "quantum mechanics"
python -m caproneisis search myindex "neural network" --year 2024
# Interactive search
python -m caproneisis interactive myindex
# Statistics
python -m caproneisis stats myindex
# Benchmark
python -m caproneisis benchmark myindex
Large-Scale Indexing
For massive datasets (millions of records), use the IndexBuilder:
from caproneisis import IndexBuilder
builder = IndexBuilder(
"datacite",
hosts=["http://localhost:9200"],
batch_size=5000, # Optimal for Elasticsearch
thread_count=4, # Parallel ingestion threads
shards=10, # Scale horizontally
replicas=1
)
# Build from DataCite JSONL files
stats = builder.add_jsonl_files(
"/path/to/DataCite/**/*.jsonl",
resume=True # Skip already processed files
)
print(f"Indexed {stats['total_records']:,} records")
print(f"Rate: {stats['rate_per_second']:,.0f} records/second")
Cluster Management
from caproneisis import ClusterManager
manager = ClusterManager(hosts=["http://localhost:9200"])
# Check cluster health
health = manager.health()
print(f"Status: {health['status']}")
# List indices
for idx in manager.indices():
print(f"{idx['name']}: {idx['docs_count']:,} docs, {idx['size']}")
# Optimize index for queries
manager.optimize("myindex", max_segments=1)
# Create alias for zero-downtime reindexing
manager.alias("myindex_v2", "myindex")
Search Syntax
CaproneISIS supports Elasticsearch query string syntax:
| Query | Meaning |
|---|---|
quantum mechanics |
Both terms (AND) |
quantum OR mechanics |
Either term |
quantum -classical |
Exclude term |
"quantum mechanics" |
Exact phrase |
quant* |
Prefix match |
title:quantum |
Field-specific |
year:[2020 TO 2024] |
Range query |
Performance
Designed for enterprise scale:
| Metric | Capability |
|---|---|
| Records | 100M–10B |
| Ingestion | 10,000+ records/second |
| Search | <100ms at billion scale |
| Shards | Configurable (5 default) |
| Replicas | Configurable (1 default) |
Scaling Guidelines
| Records | Recommended Shards | Notes |
|---|---|---|
| <10M | 1–3 | Single node sufficient |
| 10M–100M | 5–10 | Small cluster |
| 100M–1B | 10–50 | Multi-node cluster |
| >1B | 50–100+ | Large cluster, consider aliases |
Why Not capraISIS?
Use capraISIS when:
- You need zero dependencies (stdlib only)
- Portability matters (single
.dbfile) - Records <100M
- Local/research use
Use caproneISIS when:
- You need horizontal scaling
- Records >100M
- Production/enterprise deployment
- You already have Elasticsearch infrastructure
Migration from capraISIS
The APIs are compatible. Migration is straightforward:
# Before (capraISIS)
from capraisis import CapraIndex
index = CapraIndex("corpus.db")
# After (caproneISIS)
from caproneisis import CaproneIndex
index = CaproneIndex("corpus") # Now an ES index name
Most code works unchanged. Key differences:
- File path → Index name
- Single file → Cluster connection
- Automatic refresh → Configurable refresh
License
MIT License. Use at your own risk.
Citation
If you use CaproneISIS in research, please cite:
@software{caproneisis2026,
author = {Caprazli, Kafkas M.},
title = {CaproneISIS: Industrial-Scale CDS/ISIS Implementation},
year = {2026},
url = {https://github.com/capraCoder/caproneisis}
}
Author ORCID: 0000-0002-5744-8944
See Also
- capraISIS — Little sibling (SQLite FTS5)
- Elasticsearch — Backend engine
- UNESCO CDS/ISIS — Original system (archived)
"Scale the mountain, honour the heritage."
Project details
Release history Release notifications | RSS feed
Download files
Download the file for your platform. If you're not sure which to choose, learn more about installing packages.
Source Distribution
Built Distribution
Filter files by name, interpreter, ABI, and platform.
If you're not sure about the file name format, learn more about wheel file names.
Copy a direct link to the current filters
File details
Details for the file caproneisis-0.1.1.tar.gz.
File metadata
- Download URL: caproneisis-0.1.1.tar.gz
- Upload date:
- Size: 20.6 kB
- Tags: Source
- Uploaded using Trusted Publishing? No
- Uploaded via: twine/6.2.0 CPython/3.13.7
File hashes
| Algorithm | Hash digest | |
|---|---|---|
| SHA256 |
469ccc4e3b78a273357fb4eee83a54b3f03dc5a99a632e2ec8d481b04b1f6c9e
|
|
| MD5 |
6a674735b7fa445581a27a49b1a2850e
|
|
| BLAKE2b-256 |
7f71b20d95cbcb198a39678abc8040b064f6f5647854c151cc09c6ee535e28fd
|
File details
Details for the file caproneisis-0.1.1-py3-none-any.whl.
File metadata
- Download URL: caproneisis-0.1.1-py3-none-any.whl
- Upload date:
- Size: 20.9 kB
- Tags: Python 3
- Uploaded using Trusted Publishing? No
- Uploaded via: twine/6.2.0 CPython/3.13.7
File hashes
| Algorithm | Hash digest | |
|---|---|---|
| SHA256 |
a7002460054160d21cb77ceb58a21460c083f40f979e1fd7de9cb4bed9cc222e
|
|
| MD5 |
0a823354e08c813ad9d18a1a7aa205b5
|
|
| BLAKE2b-256 |
82014c04afd0ee441ace9d7a26da7a873a7cd7bb0fdf950bc5f2e653ff66297c
|