Industrial-scale CDS/ISIS implementation using Elasticsearch

These details have not been verified by PyPI

Project links

Project description

CaproneISIS

Industrial-Scale CDS/ISIS Implementation Using Elasticsearch

CaproneISIS is the "big brother" of capraISIS, designed for massive datasets (100M–10B records) using Elasticsearch as the backend.

Both packages implement UNESCO's CDS/ISIS (Computerised Documentation System / Integrated Set of Information Systems, 1985–2005) inverted file indexing principles — capraISIS for portable single-file use, caproneISIS for distributed enterprise scale.

Family Architecture

Package	Backend	Scale	Use Case
capraISIS	SQLite FTS5	<100M records	Research, portable, zero dependencies
caproneISIS	Elasticsearch	100M–10B records	Enterprise, distributed, horizontally scalable

Same API, different scale. Code written for capraISIS migrates to caproneISIS by changing one import.

Installation

pip install caproneisis

Requires: Elasticsearch 8.x cluster (local or remote)

Quick Start with Docker

# Start local Elasticsearch (for testing)
docker run -d --name es-test \
  -p 9200:9200 \
  -e "discovery.type=single-node" \
  -e "xpack.security.enabled=false" \
  elasticsearch:8.11.0

Quick Start

Python API

from caproneisis import CaproneIndex

# Connect to local Elasticsearch
index = CaproneIndex("my_corpus")

# Add records (same API as capraISIS)
index.add(
    id="10.5281/zenodo.12345",
    title="Quantum Mechanics and Consciousness",
    content="A paper exploring the relationship between...",
    year=2024,
    prefix="10.5281"
)

# Bulk add
records = [
    ("10.1234/a", "Title A", "Content A", 2023, "10.1234"),
    ("10.1234/b", "Title B", "Content B", 2024, "10.1234"),
]
index.bulk_add(records)

# Search
results = index.search("quantum consciousness")
for r in results:
    print(f"{r['year']} | {r['id']} | {r['title']}")

# Statistics
stats = index.stats()
print(f"Total records: {stats['total_records']:,}")
print(f"Size: {stats['size_gb']:.2f} GB")
print(f"Shards: {stats['shards']}")

Production Cluster

from caproneisis import CaproneIndex

# Connect to production cluster with authentication
index = CaproneIndex(
    "production_corpus",
    hosts=["https://es1.example.com:9200", "https://es2.example.com:9200"],
    api_key="your-api-key",
    verify_certs=True,
    shards=10,      # More shards for larger clusters
    replicas=2      # Higher redundancy for production
)

Command Line

# Cluster management
python -m caproneisis cluster health
python -m caproneisis cluster indices

# Create index
python -m caproneisis create myindex --shards 5 --replicas 1

# Build index from JSONL files
python -m caproneisis build "data/*.jsonl" --index myindex

# Search
python -m caproneisis search myindex "quantum mechanics"
python -m caproneisis search myindex "neural network" --year 2024

# Interactive search
python -m caproneisis interactive myindex

# Statistics
python -m caproneisis stats myindex

# Benchmark
python -m caproneisis benchmark myindex

Large-Scale Indexing

For massive datasets (millions of records), use the IndexBuilder:

from caproneisis import IndexBuilder

builder = IndexBuilder(
    "datacite",
    hosts=["http://localhost:9200"],
    batch_size=5000,         # Optimal for Elasticsearch
    thread_count=4,          # Parallel ingestion threads
    shards=10,               # Scale horizontally
    replicas=1
)

# Build from DataCite JSONL files
stats = builder.add_jsonl_files(
    "/path/to/DataCite/**/*.jsonl",
    resume=True              # Skip already processed files
)

print(f"Indexed {stats['total_records']:,} records")
print(f"Rate: {stats['rate_per_second']:,.0f} records/second")

Cluster Management

from caproneisis import ClusterManager

manager = ClusterManager(hosts=["http://localhost:9200"])

# Check cluster health
health = manager.health()
print(f"Status: {health['status']}")

# List indices
for idx in manager.indices():
    print(f"{idx['name']}: {idx['docs_count']:,} docs, {idx['size']}")

# Optimize index for queries
manager.optimize("myindex", max_segments=1)

# Create alias for zero-downtime reindexing
manager.alias("myindex_v2", "myindex")

Search Syntax

CaproneISIS supports Elasticsearch query string syntax:

Query	Meaning
`quantum mechanics`	Both terms (AND)
`quantum OR mechanics`	Either term
`quantum -classical`	Exclude term
`"quantum mechanics"`	Exact phrase
`quant*`	Prefix match
`title:quantum`	Field-specific
`year:[2020 TO 2024]`	Range query

Performance

Designed for enterprise scale:

Metric	Capability
Records	100M–10B
Ingestion	10,000+ records/second
Search	<100ms at billion scale
Shards	Configurable (5 default)
Replicas	Configurable (1 default)

Scaling Guidelines

Records	Recommended Shards	Notes
<10M	1–3	Single node sufficient
10M–100M	5–10	Small cluster
100M–1B	10–50	Multi-node cluster
>1B	50–100+	Large cluster, consider aliases

Why Not capraISIS?

Use capraISIS when:

You need zero dependencies (stdlib only)
Portability matters (single .db file)
Records <100M
Local/research use

Use caproneISIS when:

You need horizontal scaling
Records >100M
Production/enterprise deployment
You already have Elasticsearch infrastructure

Migration from capraISIS

The APIs are compatible. Migration is straightforward:

# Before (capraISIS)
from capraisis import CapraIndex
index = CapraIndex("corpus.db")

# After (caproneISIS)
from caproneisis import CaproneIndex
index = CaproneIndex("corpus")  # Now an ES index name

Most code works unchanged. Key differences:

File path → Index name
Single file → Cluster connection
Automatic refresh → Configurable refresh

License

MIT License. Use at your own risk.

Citation

If you use CaproneISIS in research, please cite:

@software{caproneisis2026,
  author = {Caprazli, Kafkas M.},
  title = {CaproneISIS: Industrial-Scale CDS/ISIS Implementation},
  year = {2026},
  url = {https://github.com/capraCoder/caproneisis}
}

Author ORCID: 0000-0002-5744-8944

Project details

These details have not been verified by PyPI

Project links

Release history Release notifications | RSS feed

0.1.1

Jan 12, 2026

This version

0.1.0

Jan 12, 2026

Download files

Download the file for your platform. If you're not sure which to choose, learn more about installing packages.

Source Distribution

caproneisis-0.1.0.tar.gz (20.4 kB view details)

Uploaded Jan 12, 2026 Source

Built Distribution

If you're not sure about the file name format, learn more about wheel file names.

The dropdown lists show the available interpreters, ABIs, and platforms. Enable javascript to be able to filter the list of wheel files.

caproneisis-0.1.0-py3-none-any.whl (20.8 kB view details)

Uploaded Jan 12, 2026 Python 3

File details

Details for the file caproneisis-0.1.0.tar.gz.

File metadata

Download URL: caproneisis-0.1.0.tar.gz
Upload date: Jan 12, 2026
Size: 20.4 kB
Tags: Source
Uploaded using Trusted Publishing? No
Uploaded via: twine/6.2.0 CPython/3.13.7

File hashes

Hashes for caproneisis-0.1.0.tar.gz
Algorithm	Hash digest
SHA256	`5ef561e46924ca233c7b44e935b8580bfb8267fed092a2c986c682b27e23a0fd`
MD5	`b61ed44a82b2237fa083749034759705`
BLAKE2b-256	`b2b2846ca1eacafb47f48905eb0f871083a29ecafef33103db26bd0f774e6072`

See more details on using hashes here.

File details

Details for the file caproneisis-0.1.0-py3-none-any.whl.

File metadata

Download URL: caproneisis-0.1.0-py3-none-any.whl
Upload date: Jan 12, 2026
Size: 20.8 kB
Tags: Python 3
Uploaded using Trusted Publishing? No
Uploaded via: twine/6.2.0 CPython/3.13.7

File hashes

Hashes for caproneisis-0.1.0-py3-none-any.whl
Algorithm	Hash digest
SHA256	`ec1fd4d3c74f4526a0b7cd365cf17a151d6997a5998bd45af46f2804834f197c`
MD5	`dc6fc361d48430b5b0c1c99771951a77`
BLAKE2b-256	`df73d303ebf462228132abfa8a2fb83d52abde3fe152643d51215d8629501cd7`

See more details on using hashes here.

caproneisis 0.1.0

Navigation

Verified details

Maintainers

Unverified details

Project links

Meta

Classifiers

Project description

CaproneISIS

Family Architecture

Installation

Quick Start with Docker

Quick Start

Python API

Production Cluster

Command Line

Large-Scale Indexing

Cluster Management

Search Syntax

Performance

Scaling Guidelines

Why Not capraISIS?

Migration from capraISIS

License

Citation

See Also

Project details

Verified details

Maintainers

Unverified details

Project links

Meta

Classifiers

Release history Release notifications | RSS feed

Download files

Source Distribution

Built Distribution

File details

File metadata

File hashes

File details

File metadata

File hashes