Skip to main content

Industrial-scale CDS/ISIS implementation using Elasticsearch

Project description

CaproneISIS

Industrial-Scale CDS/ISIS Implementation Using Elasticsearch

CaproneISIS is the "big brother" of capraISIS, designed for massive datasets (100M–10B records) using Elasticsearch as the backend.

Both packages implement UNESCO's CDS/ISIS (Computerised Documentation System / Integrated Set of Information Systems, 1985–2005) inverted file indexing principles — capraISIS for portable single-file use, caproneISIS for distributed enterprise scale.

Family Architecture

Package Backend Scale Use Case
capraISIS SQLite FTS5 <100M records Research, portable, zero dependencies
caproneISIS Elasticsearch 100M–10B records Enterprise, distributed, horizontally scalable

Same API, different scale. Code written for capraISIS migrates to caproneISIS by changing one import.

Installation

pip install caproneisis

Requires: Elasticsearch 8.x cluster (local or remote)

Quick Start with Docker

# Start local Elasticsearch (for testing)
docker run -d --name es-test \
  -p 9200:9200 \
  -e "discovery.type=single-node" \
  -e "xpack.security.enabled=false" \
  elasticsearch:8.11.0

Quick Start

Python API

from caproneisis import CaproneIndex

# Connect to local Elasticsearch
index = CaproneIndex("my_corpus")

# Add records (same API as capraISIS)
index.add(
    id="10.5281/zenodo.12345",
    title="Quantum Mechanics and Consciousness",
    content="A paper exploring the relationship between...",
    year=2024,
    prefix="10.5281"
)

# Bulk add
records = [
    ("10.1234/a", "Title A", "Content A", 2023, "10.1234"),
    ("10.1234/b", "Title B", "Content B", 2024, "10.1234"),
]
index.bulk_add(records)

# Search
results = index.search("quantum consciousness")
for r in results:
    print(f"{r['year']} | {r['id']} | {r['title']}")

# Statistics
stats = index.stats()
print(f"Total records: {stats['total_records']:,}")
print(f"Size: {stats['size_gb']:.2f} GB")
print(f"Shards: {stats['shards']}")

Production Cluster

from caproneisis import CaproneIndex

# Connect to production cluster with authentication
index = CaproneIndex(
    "production_corpus",
    hosts=["https://es1.example.com:9200", "https://es2.example.com:9200"],
    api_key="your-api-key",
    verify_certs=True,
    shards=10,      # More shards for larger clusters
    replicas=2      # Higher redundancy for production
)

Command Line

# Cluster management
python -m caproneisis cluster health
python -m caproneisis cluster indices

# Create index
python -m caproneisis create myindex --shards 5 --replicas 1

# Build index from JSONL files
python -m caproneisis build "data/*.jsonl" --index myindex

# Search
python -m caproneisis search myindex "quantum mechanics"
python -m caproneisis search myindex "neural network" --year 2024

# Interactive search
python -m caproneisis interactive myindex

# Statistics
python -m caproneisis stats myindex

# Benchmark
python -m caproneisis benchmark myindex

Large-Scale Indexing

For massive datasets (millions of records), use the IndexBuilder:

from caproneisis import IndexBuilder

builder = IndexBuilder(
    "datacite",
    hosts=["http://localhost:9200"],
    batch_size=5000,         # Optimal for Elasticsearch
    thread_count=4,          # Parallel ingestion threads
    shards=10,               # Scale horizontally
    replicas=1
)

# Build from DataCite JSONL files
stats = builder.add_jsonl_files(
    "/path/to/DataCite/**/*.jsonl",
    resume=True              # Skip already processed files
)

print(f"Indexed {stats['total_records']:,} records")
print(f"Rate: {stats['rate_per_second']:,.0f} records/second")

Cluster Management

from caproneisis import ClusterManager

manager = ClusterManager(hosts=["http://localhost:9200"])

# Check cluster health
health = manager.health()
print(f"Status: {health['status']}")

# List indices
for idx in manager.indices():
    print(f"{idx['name']}: {idx['docs_count']:,} docs, {idx['size']}")

# Optimize index for queries
manager.optimize("myindex", max_segments=1)

# Create alias for zero-downtime reindexing
manager.alias("myindex_v2", "myindex")

Search Syntax

CaproneISIS supports Elasticsearch query string syntax:

Query Meaning
quantum mechanics Both terms (AND)
quantum OR mechanics Either term
quantum -classical Exclude term
"quantum mechanics" Exact phrase
quant* Prefix match
title:quantum Field-specific
year:[2020 TO 2024] Range query

Performance

Designed for enterprise scale:

Metric Capability
Records 100M–10B
Ingestion 10,000+ records/second
Search <100ms at billion scale
Shards Configurable (5 default)
Replicas Configurable (1 default)

Scaling Guidelines

Records Recommended Shards Notes
<10M 1–3 Single node sufficient
10M–100M 5–10 Small cluster
100M–1B 10–50 Multi-node cluster
>1B 50–100+ Large cluster, consider aliases

Why Not capraISIS?

Use capraISIS when:

  • You need zero dependencies (stdlib only)
  • Portability matters (single .db file)
  • Records <100M
  • Local/research use

Use caproneISIS when:

  • You need horizontal scaling
  • Records >100M
  • Production/enterprise deployment
  • You already have Elasticsearch infrastructure

Migration from capraISIS

The APIs are compatible. Migration is straightforward:

# Before (capraISIS)
from capraisis import CapraIndex
index = CapraIndex("corpus.db")

# After (caproneISIS)
from caproneisis import CaproneIndex
index = CaproneIndex("corpus")  # Now an ES index name

Most code works unchanged. Key differences:

  • File path → Index name
  • Single file → Cluster connection
  • Automatic refresh → Configurable refresh

License

MIT License. Use at your own risk.

Citation

If you use CaproneISIS in research, please cite:

@software{caproneisis2026,
  author = {Caprazli, Kafkas M.},
  title = {CaproneISIS: Industrial-Scale CDS/ISIS Implementation},
  year = {2026},
  url = {https://github.com/capraCoder/caproneisis}
}

Author ORCID: 0000-0002-5744-8944

See Also


"Scale the mountain, honour the heritage."

Project details


Download files

Download the file for your platform. If you're not sure which to choose, learn more about installing packages.

Source Distribution

caproneisis-0.1.1.tar.gz (20.6 kB view details)

Uploaded Source

Built Distribution

If you're not sure about the file name format, learn more about wheel file names.

caproneisis-0.1.1-py3-none-any.whl (20.9 kB view details)

Uploaded Python 3

File details

Details for the file caproneisis-0.1.1.tar.gz.

File metadata

  • Download URL: caproneisis-0.1.1.tar.gz
  • Upload date:
  • Size: 20.6 kB
  • Tags: Source
  • Uploaded using Trusted Publishing? No
  • Uploaded via: twine/6.2.0 CPython/3.13.7

File hashes

Hashes for caproneisis-0.1.1.tar.gz
Algorithm Hash digest
SHA256 469ccc4e3b78a273357fb4eee83a54b3f03dc5a99a632e2ec8d481b04b1f6c9e
MD5 6a674735b7fa445581a27a49b1a2850e
BLAKE2b-256 7f71b20d95cbcb198a39678abc8040b064f6f5647854c151cc09c6ee535e28fd

See more details on using hashes here.

File details

Details for the file caproneisis-0.1.1-py3-none-any.whl.

File metadata

  • Download URL: caproneisis-0.1.1-py3-none-any.whl
  • Upload date:
  • Size: 20.9 kB
  • Tags: Python 3
  • Uploaded using Trusted Publishing? No
  • Uploaded via: twine/6.2.0 CPython/3.13.7

File hashes

Hashes for caproneisis-0.1.1-py3-none-any.whl
Algorithm Hash digest
SHA256 a7002460054160d21cb77ceb58a21460c083f40f979e1fd7de9cb4bed9cc222e
MD5 0a823354e08c813ad9d18a1a7aa205b5
BLAKE2b-256 82014c04afd0ee441ace9d7a26da7a873a7cd7bb0fdf950bc5f2e653ff66297c

See more details on using hashes here.

Supported by

AWS Cloud computing and Security Sponsor Datadog Monitoring Depot Continuous Integration Fastly CDN Google Download Analytics Pingdom Monitoring Sentry Error logging StatusPage Status page