Skip to main content

Industrial-scale CDS/ISIS implementation using Elasticsearch

Project description

CaproneISIS

Industrial-Scale CDS/ISIS Implementation Using Elasticsearch

CaproneISIS is the "big brother" of capraISIS, designed for massive datasets (100M–10B records) using Elasticsearch as the backend.

Both packages implement UNESCO's CDS/ISIS (Computerised Documentation System / Integrated Set of Information Systems, 1985–2005) inverted file indexing principles — capraISIS for portable single-file use, caproneISIS for distributed enterprise scale.

Family Architecture

Package Backend Scale Use Case
capraISIS SQLite FTS5 <100M records Research, portable, zero dependencies
caproneISIS Elasticsearch 100M–10B records Enterprise, distributed, horizontally scalable

Same API, different scale. Code written for capraISIS migrates to caproneISIS by changing one import.

Installation

pip install caproneisis

Requires: Elasticsearch 8.x cluster (local or remote)

Quick Start with Docker

# Start local Elasticsearch (for testing)
docker run -d --name es-test \
  -p 9200:9200 \
  -e "discovery.type=single-node" \
  -e "xpack.security.enabled=false" \
  elasticsearch:8.11.0

Quick Start

Python API

from caproneisis import CaproneIndex

# Connect to local Elasticsearch
index = CaproneIndex("my_corpus")

# Add records (same API as capraISIS)
index.add(
    id="10.5281/zenodo.12345",
    title="Quantum Mechanics and Consciousness",
    content="A paper exploring the relationship between...",
    year=2024,
    prefix="10.5281"
)

# Bulk add
records = [
    ("10.1234/a", "Title A", "Content A", 2023, "10.1234"),
    ("10.1234/b", "Title B", "Content B", 2024, "10.1234"),
]
index.bulk_add(records)

# Search
results = index.search("quantum consciousness")
for r in results:
    print(f"{r['year']} | {r['id']} | {r['title']}")

# Statistics
stats = index.stats()
print(f"Total records: {stats['total_records']:,}")
print(f"Size: {stats['size_gb']:.2f} GB")
print(f"Shards: {stats['shards']}")

Production Cluster

from caproneisis import CaproneIndex

# Connect to production cluster with authentication
index = CaproneIndex(
    "production_corpus",
    hosts=["https://es1.example.com:9200", "https://es2.example.com:9200"],
    api_key="your-api-key",
    verify_certs=True,
    shards=10,      # More shards for larger clusters
    replicas=2      # Higher redundancy for production
)

Command Line

# Cluster management
python -m caproneisis cluster health
python -m caproneisis cluster indices

# Create index
python -m caproneisis create myindex --shards 5 --replicas 1

# Build index from JSONL files
python -m caproneisis build "data/*.jsonl" --index myindex

# Search
python -m caproneisis search myindex "quantum mechanics"
python -m caproneisis search myindex "neural network" --year 2024

# Interactive search
python -m caproneisis interactive myindex

# Statistics
python -m caproneisis stats myindex

# Benchmark
python -m caproneisis benchmark myindex

Large-Scale Indexing

For massive datasets (millions of records), use the IndexBuilder:

from caproneisis import IndexBuilder

builder = IndexBuilder(
    "datacite",
    hosts=["http://localhost:9200"],
    batch_size=5000,         # Optimal for Elasticsearch
    thread_count=4,          # Parallel ingestion threads
    shards=10,               # Scale horizontally
    replicas=1
)

# Build from DataCite JSONL files
stats = builder.add_jsonl_files(
    "/path/to/DataCite/**/*.jsonl",
    resume=True              # Skip already processed files
)

print(f"Indexed {stats['total_records']:,} records")
print(f"Rate: {stats['rate_per_second']:,.0f} records/second")

Cluster Management

from caproneisis import ClusterManager

manager = ClusterManager(hosts=["http://localhost:9200"])

# Check cluster health
health = manager.health()
print(f"Status: {health['status']}")

# List indices
for idx in manager.indices():
    print(f"{idx['name']}: {idx['docs_count']:,} docs, {idx['size']}")

# Optimize index for queries
manager.optimize("myindex", max_segments=1)

# Create alias for zero-downtime reindexing
manager.alias("myindex_v2", "myindex")

Search Syntax

CaproneISIS supports Elasticsearch query string syntax:

Query Meaning
quantum mechanics Both terms (AND)
quantum OR mechanics Either term
quantum -classical Exclude term
"quantum mechanics" Exact phrase
quant* Prefix match
title:quantum Field-specific
year:[2020 TO 2024] Range query

Performance

Designed for enterprise scale:

Metric Capability
Records 100M–10B
Ingestion 10,000+ records/second
Search <100ms at billion scale
Shards Configurable (5 default)
Replicas Configurable (1 default)

Scaling Guidelines

Records Recommended Shards Notes
<10M 1–3 Single node sufficient
10M–100M 5–10 Small cluster
100M–1B 10–50 Multi-node cluster
>1B 50–100+ Large cluster, consider aliases

Why Not capraISIS?

Use capraISIS when:

  • You need zero dependencies (stdlib only)
  • Portability matters (single .db file)
  • Records <100M
  • Local/research use

Use caproneISIS when:

  • You need horizontal scaling
  • Records >100M
  • Production/enterprise deployment
  • You already have Elasticsearch infrastructure

Migration from capraISIS

The APIs are compatible. Migration is straightforward:

# Before (capraISIS)
from capraisis import CapraIndex
index = CapraIndex("corpus.db")

# After (caproneISIS)
from caproneisis import CaproneIndex
index = CaproneIndex("corpus")  # Now an ES index name

Most code works unchanged. Key differences:

  • File path → Index name
  • Single file → Cluster connection
  • Automatic refresh → Configurable refresh

License

MIT License. Use at your own risk.

Citation

If you use CaproneISIS in research, please cite:

@software{caproneisis2026,
  author = {Caprazli, Kafkas M.},
  title = {CaproneISIS: Industrial-Scale CDS/ISIS Implementation},
  year = {2026},
  url = {https://github.com/capraCoder/caproneisis}
}

Author ORCID: 0000-0002-5744-8944

See Also


"Scale the mountain, honour the heritage."

Project details


Download files

Download the file for your platform. If you're not sure which to choose, learn more about installing packages.

Source Distribution

caproneisis-0.1.0.tar.gz (20.4 kB view details)

Uploaded Source

Built Distribution

If you're not sure about the file name format, learn more about wheel file names.

caproneisis-0.1.0-py3-none-any.whl (20.8 kB view details)

Uploaded Python 3

File details

Details for the file caproneisis-0.1.0.tar.gz.

File metadata

  • Download URL: caproneisis-0.1.0.tar.gz
  • Upload date:
  • Size: 20.4 kB
  • Tags: Source
  • Uploaded using Trusted Publishing? No
  • Uploaded via: twine/6.2.0 CPython/3.13.7

File hashes

Hashes for caproneisis-0.1.0.tar.gz
Algorithm Hash digest
SHA256 5ef561e46924ca233c7b44e935b8580bfb8267fed092a2c986c682b27e23a0fd
MD5 b61ed44a82b2237fa083749034759705
BLAKE2b-256 b2b2846ca1eacafb47f48905eb0f871083a29ecafef33103db26bd0f774e6072

See more details on using hashes here.

File details

Details for the file caproneisis-0.1.0-py3-none-any.whl.

File metadata

  • Download URL: caproneisis-0.1.0-py3-none-any.whl
  • Upload date:
  • Size: 20.8 kB
  • Tags: Python 3
  • Uploaded using Trusted Publishing? No
  • Uploaded via: twine/6.2.0 CPython/3.13.7

File hashes

Hashes for caproneisis-0.1.0-py3-none-any.whl
Algorithm Hash digest
SHA256 ec1fd4d3c74f4526a0b7cd365cf17a151d6997a5998bd45af46f2804834f197c
MD5 dc6fc361d48430b5b0c1c99771951a77
BLAKE2b-256 df73d303ebf462228132abfa8a2fb83d52abde3fe152643d51215d8629501cd7

See more details on using hashes here.

Supported by

AWS Cloud computing and Security Sponsor Datadog Monitoring Depot Continuous Integration Fastly CDN Google Download Analytics Pingdom Monitoring Sentry Error logging StatusPage Status page