Skip to main content

Lightweight data processing toolkit - algorithms, utilities, and database helpers for ETL pipelines

Project description

oet-core

Python Version License: MIT Code style: PEP 8

Lightweight data processing toolkit for Python

oet-core (Outer Element Taxonomy) is a minimal, pure-Python library for building data pipelines and ETL workflows without heavy dependencies. Search, transform, validate, and model your data using simple, readable implementations that prioritize portability over performance.

Originally developed to support the Outer Element Taxonomy research framework, this toolkit is designed for modular research workflows and rapid prototyping where simplicity and reproducibility matter more than raw performance. Production-grade reliability for research computing - perfect for labs, experiments, and projects where you want to avoid pandas/numpy dependencies.

Features

Data Access & Storage

  • Binary Search: Fast lookups on sorted lists and coordinate pairs
  • HashMap: Pure-Python hash table with automatic resizing and collision handling
  • SQLite Helpers: Simple wrappers for database operations - queries, bulk inserts, schema management

Data Transformation

  • Matrix: Operations for numerical data (transpose, get/set, generation)

Text Analysis (MinText)

  • Text: Tokenization, frequency analysis, entropy, sentiment, and vectorization
  • Corpus: Collection operations, vocabulary building, batch vectorization, and SQLite persistence

I/O & Validation

  • Text Validators: Validate JSON, YAML, and Markdown content before processing

Data Modeling

  • Graph Generation: Build NetworkX graphs programmatically for relationship modeling

Observability

  • Logging Helpers: Lightweight logger factory and inline logging with opt-in verbosity

Quality

  • Well Tested: 125+ tests covering algorithms, edge cases, and error handling

Quick Start

Installation

pip install oet-core

For local development with optional features:

git clone https://github.com/markusapplegate/oet-core.git
cd oet-core
pip install -e .[dev,all]

Basic Usage

from oet_core import binary_search, HashMap, Matrix, SQLiteHelper, Text, Corpus

# Binary search
numbers = [1, 2, 3, 4, 5, 6, 7, 8, 9, 10]
index = binary_search(numbers, 7)  # Returns: 6

# HashMap
hmap = HashMap()
hmap.put("name", "Alice")
print(hmap.get("name"))  # Returns: "Alice"

# Matrix operations
matrix = Matrix(3, 3, fill=0)
matrix.set(0, 0, 1)
transposed = matrix.transpose()

# SQLite database operations
with SQLiteHelper(":memory:") as db:
    db.create_table("users", {"id": "INTEGER PRIMARY KEY", "name": "TEXT"})
    db.execute("INSERT INTO users VALUES (?, ?)", (1, "Alice"))
    users = db.fetch_all("SELECT * FROM users")
    print(users[0]["name"])  # Returns: "Alice"

# Text analysis (MinText)
text = Text("The quick brown fox jumps over the lazy dog.")
tokens = text.tokenize()  # ['the', 'quick', 'brown', 'fox', ...]
sentiment = text.sentiment()  # {'score': 0, 'positive': 0, 'negative': 0, ...}
entropy = text.entropy()  # 3.17 bits (text diversity measure)

# Corpus operations with persistence
corpus = Corpus()
corpus.add_from_string("I love this product!", metadata={"rating": 5})
corpus.add_from_string("Terrible experience.", metadata={"rating": 1})

vocab = corpus.vocabulary()  # Build shared vocabulary
vectors = corpus.vectorize_all()  # Convert to term-frequency matrix

with SQLiteHelper(":memory:") as db:
    corpus.save_to_db(db, table="reviews")  # Persist to SQLite
    loaded = Corpus.load_from_db(db, table="reviews")

# Logging utilities
from io import StringIO
from oet_core import get_logger, log, set_utils_verbose_logging, generate_matrix

buffer = StringIO()
logger = get_logger("demo", stream=buffer, timestamps=False)
logger.info("Pipeline started")

log("Inline status update", level="warning")

set_utils_verbose_logging(True)
generate_matrix(1, 1)
set_utils_verbose_logging(False)

Project Structure

oet-core/
├── README.md              # This file
├── docs/
│   ├── API_DOCS.md        # Complete API documentation
│   └── MINTEXT_GUIDE.md   # MinText quick reference
├── CONTRIBUTING.md        # Contribution guidelines
├── LICENSE                # MIT License
├── requirements.txt       # Development dependencies
├── pyproject.toml         # Package metadata
├── src/
│   ├── oet_core/
│   │   ├── __init__.py    # Package exports
│   │   ├── algos.py       # Algorithm implementations (binary_search, HashMap)
│   │   ├── mintext.py     # Text analysis (Text, Corpus)
│   │   └── utils.py       # Utility helpers (Matrix, SQLite, logging, graphs)
│   ├── __init__.py        # Compatibility shim for legacy imports
│   └── utils.py           # Compatibility shim for legacy imports
└── tests/
    ├── __init__.py        # Test package
    ├── test_algos.py      # Algorithm tests
    ├── test_mintext.py    # MinText tests
    ├── test_utils.py      # Utility tests
    └── run_all_tests.py   # Test runner

Running Tests

Comprehensive test suite with 125+ tests covering all modules.

Run all tests:

python tests/run_all_tests.py

Run specific test modules:

python tests/test_algos.py    # Test algorithms (binary_search, HashMap)
python tests/test_mintext.py  # Test text analysis (Text, Corpus)
python tests/test_utils.py    # Test utilities (Matrix, SQLite, validation, logging, graphs)

Test Coverage:

  • algos.py: Binary search (scalars, pairs, duplicates, edge cases), HashMap (CRUD operations, resizing, collisions)
  • mintext.py: Text tokenization, frequency analysis, entropy, sentiment, vectorization, Corpus operations, SQLite persistence
  • utils.py: Matrix operations, text validation (JSON/YAML/Markdown), SQLite helpers, logging, graph building

Note: Graph tests require networkx to be installed (see requirements.txt).

Documentation

Design Philosophy

Built for research workflows:

Originally developed to support the Outer Element Taxonomy research framework, this library embodies principles essential for production research software:

  • Simplicity over speed: Readable implementations researchers can understand, modify, and trust
  • Zero core dependencies: Ensures reproducibility - works anywhere Python runs without dependency hell
  • Pure Python portability: From laptops to HPC clusters to embedded systems
  • Modular design: Mix and match components for rapid prototyping and experimentation
  • Production-ready: Well-tested and documented - reliable enough for daily research use
  • Research-grade engineering: Clear APIs, comprehensive tests, and proper versioning

When to use oet-core:

  • Production research computing - reliable tools for daily research workflows
  • Experimental prototyping and rapid iteration
  • Reproducibility-critical environments where dependencies matter
  • Teaching, learning, and understanding data structures
  • Memory-constrained systems (embedded, serverless, HPC login nodes)
  • Any project prioritizing simplicity and transparency over raw speed

When NOT to use oet-core:

  • High-performance numerical computing at massive scale (use numpy/pandas)
  • Enterprise data warehousing with strict SLAs
  • When you need highly optimized algorithms for production data pipelines

Contributing

Contributions are welcome! Please see CONTRIBUTING.md for guidelines.

  1. Fork the repository
  2. Create a feature branch
  3. Add tests for your changes
  4. Ensure all tests pass
  5. Submit a pull request

License

This project is licensed under the MIT License - see the LICENSE file for details.

Optional Dependencies

  • YAML validation support: pip install oet-core[yaml]
  • Graph utilities: pip install oet-core[graph]
  • All extras and dev tooling: pip install oet-core[all,dev]

Built with care following the principle of "minimum code, maximum value"

Project details


Download files

Download the file for your platform. If you're not sure which to choose, learn more about installing packages.

Source Distribution

oet_core-1.0.0.tar.gz (42.1 kB view details)

Uploaded Source

Built Distribution

If you're not sure about the file name format, learn more about wheel file names.

oet_core-1.0.0-py3-none-any.whl (18.0 kB view details)

Uploaded Python 3

File details

Details for the file oet_core-1.0.0.tar.gz.

File metadata

  • Download URL: oet_core-1.0.0.tar.gz
  • Upload date:
  • Size: 42.1 kB
  • Tags: Source
  • Uploaded using Trusted Publishing? No
  • Uploaded via: twine/6.2.0 CPython/3.11.0

File hashes

Hashes for oet_core-1.0.0.tar.gz
Algorithm Hash digest
SHA256 b24aa4f2350c3fbc8fa2a0201e5c6e74b283f311087f2713a4476bb5e022eac6
MD5 7574b82511a54032beeee613fc531519
BLAKE2b-256 a00dfdf2f25a4effdaa5952a292cdeb6144357262f69c3f9c373cf1e4d962d9e

See more details on using hashes here.

File details

Details for the file oet_core-1.0.0-py3-none-any.whl.

File metadata

  • Download URL: oet_core-1.0.0-py3-none-any.whl
  • Upload date:
  • Size: 18.0 kB
  • Tags: Python 3
  • Uploaded using Trusted Publishing? No
  • Uploaded via: twine/6.2.0 CPython/3.11.0

File hashes

Hashes for oet_core-1.0.0-py3-none-any.whl
Algorithm Hash digest
SHA256 7d78f463646ac9ffeb08e7701ecf876f27b7a3077d3b98868b7f98818cb25ab8
MD5 efc948299426e1c023a1878150b17226
BLAKE2b-256 5e339b6e5d9e3a199c5ca1de94bb8e2a10a0072abcfbb319f65cbe5639d9e83e

See more details on using hashes here.

Supported by

AWS Cloud computing and Security Sponsor Datadog Monitoring Depot Continuous Integration Fastly CDN Google Download Analytics Pingdom Monitoring Sentry Error logging StatusPage Status page