No project description provided

These details have not been verified by PyPI

Project description

📦 Ingestion Engine

Ingestion Engine is a typed Python library for building document ingestion pipelines. It gives you small async building blocks for loading data, transforming it, and writing it somewhere useful.

🌟 Highlights

Simple Source, Transformer, and Sink contracts for ingestion workflows.
Async-first interfaces for I/O-heavy data processing.
Built-in PDF parsing with Docling.
LangChain-compatible document objects for retrieval and AI workflows.
Local JSON Lines output for embedded documents.
Pydantic settings models for explicit, typed component configuration.
Extensible design: bring your own source, transformer, embedder, or sink.

ℹ️ Overview

Ingestion Engine helps you structure document ingestion code without locking you into one storage backend, embedding provider, or orchestration framework. A pipeline is built from three concepts:

A Source loads input data.
A Transformer converts that data into another representation.
A Sink writes the final output.

The current library focuses on document and retrieval workflows. It includes a Docling-based PDF parser that turns PDFs into page-level documents with metadata, plus a local JSONL sink for storing embedded documents during development, testing, or batch handoff.

This project is useful when you want clear ingestion boundaries before sending data to a vector database, search index, data lake, or downstream AI application.

✍️ Authors

Created by Efysent.

🚀 Usage

Parse a PDF into page-level documents:

from pydantic import BaseModel

from ingestion_engine.transformer.pdf_parser import (
    DoclingPDFParserTransformer,
    RawPDFDocument,
)
from ingestion_engine.transformer.settings import DoclingPDFParserTransformerSettings


class PaperMetadata(BaseModel):
    paper_id: str
    title: str


settings = DoclingPDFParserTransformerSettings(
    module_path="ingestion_engine.transformer.pdf_parser.DoclingPDFParserTransformer",
)
transformer = DoclingPDFParserTransformer(settings)

raw_document = RawPDFDocument(
    metadata=PaperMetadata(paper_id="paper-123", title="Example Paper"),
    pdf_path="/path/to/paper.pdf",
)

documents = await transformer.transform(raw_document)

Write embedded documents to JSON Lines:

from ingestion_engine.sink.local_json import LocalJSONSink
from ingestion_engine.sink.settings import LocalJSONSinkSettings
from ingestion_engine.transformer.embedder import EmbeddedDocument


sink = LocalJSONSink(
    LocalJSONSinkSettings(
        module_path="ingestion_engine.sink.local_json.LocalJSONSink",
        output_path="./data/embedded_documents.jsonl",
    )
)

await sink.write(
    [
        EmbeddedDocument(
            page_content="first page",
            metadata={"page": 1},
            embedding=[0.1, 0.2, 0.3],
        )
    ]
)

Define your own components by implementing the protected hooks. Call the public methods (load, transform, and write) from pipeline code so component progress is logged:

from collections.abc import AsyncGenerator

from ingestion_engine.source import Source
from ingestion_engine.source.settings import SourceSettings
from ingestion_engine.transformer import Transformer
from ingestion_engine.transformer.settings import TransformerSettings
from ingestion_engine.sink import Sink
from ingestion_engine.sink.settings import SinkSettings


class TextSource(Source[TextSourceSourceSettings, str]):
    async def _load(self) -> AsyncGenerator[str]:
        yield "hello"


class UppercaseTransformer(Transformer[UppercaseTransformerSettings, str, str]):
    async def _transform(self, data: str) -> str:
        return data.upper()


class PrintSink(Sink[PrintSinkSettings, str]):
    async def _write(self, data: str) -> None:
        print(data)

⬇️ Installation

This project requires Python 3.12 or newer.

Install it with:

pip install ingestion-engine

For local development from this repository, use uv:

uv sync --group dev --all-extras

🧱 Project Structure

ingestion-engine/
|-- src/ingestion_engine/
|   |-- source/          # Source base class and settings
|   |-- transformer/     # Transformer base class, documents, PDF parser, embedder contracts
|   |-- sink/            # Sink base class, settings, local JSONL sink
|   |-- exceptions.py    # Package-level base exception
|   `-- py.typed         # Type information marker
|-- tests/
|   |-- fixtures/        # Shared pytest fixtures
|   |-- unit/            # Unit tests
|   `-- integration/     # Integration tests
|-- pyproject.toml
|-- Makefile
`-- README.md

🧩 Common Use Cases

Parse PDFs into page-level documents for retrieval systems.
Preserve source metadata while adding parser metadata like doc_id, page_number, and total_pages.
Build ingestion pipelines for vector databases, search indexes, data lakes, or local JSONL exports.
Keep embedding logic replaceable behind an EmbedderTransformer implementation.
Test ingestion pieces independently with mocked sources, transformers, and sinks.
Prototype local document workflows before wiring production infrastructure.

🧪 Development

Run tests:

make test

Run formatting and linting:

make format
make lint

Run type checking:

make type-check

Run the local CI equivalent:

make ci

💭 Feedback and Contributing

Bug reports, feature requests, and implementation ideas are welcome. Open an issue or discussion in the repository with:

What you expected to happen.
What actually happened.
A minimal example or test case when possible.
The Python version and relevant dependency versions.

Good contributions for this project include new sources, transformers, sinks, tests, examples, and documentation improvements.

Project details

These details have not been verified by PyPI

Release history Release notifications | RSS feed

This version

2.0.0

May 21, 2026

1.0.0

May 20, 2026

Download files

Download the file for your platform. If you're not sure which to choose, learn more about installing packages.

Source Distribution

ingestion_engine-2.0.0.tar.gz (5.9 kB view details)

Uploaded May 21, 2026 Source

Built Distribution

If you're not sure about the file name format, learn more about wheel file names.

The dropdown lists show the available interpreters, ABIs, and platforms. Enable javascript to be able to filter the list of wheel files.

ingestion_engine-2.0.0-py3-none-any.whl (10.0 kB view details)

Uploaded May 21, 2026 Python 3

File details

Details for the file ingestion_engine-2.0.0.tar.gz.

File metadata

Download URL: ingestion_engine-2.0.0.tar.gz
Upload date: May 21, 2026
Size: 5.9 kB
Tags: Source
Uploaded using Trusted Publishing? No
Uploaded via: uv/0.11.15 {"installer":{"name":"uv","version":"0.11.15","subcommand":["publish"]},"python":null,"implementation":{"name":null,"version":null},"distro":{"name":"Debian GNU/Linux","version":"12","id":"bookworm","libc":null},"system":{"name":null,"release":null},"cpu":null,"openssl_version":null,"setuptools_version":null,"rustc_version":null,"ci":true}

File hashes

Hashes for ingestion_engine-2.0.0.tar.gz
Algorithm	Hash digest
SHA256	`2e4a51bc23c3b9e961257454ceb387293b5da88522b23260bc4e7eec033f833a`
MD5	`1191b2249d56f6ac32795602d2606cef`
BLAKE2b-256	`eb33186f601ace7848e6e6e9acd0491bab8b7a36523e41bd109b23eab87e324b`

See more details on using hashes here.

File details

Details for the file ingestion_engine-2.0.0-py3-none-any.whl.

File metadata

Download URL: ingestion_engine-2.0.0-py3-none-any.whl
Upload date: May 21, 2026
Size: 10.0 kB
Tags: Python 3
Uploaded using Trusted Publishing? No
Uploaded via: uv/0.11.15 {"installer":{"name":"uv","version":"0.11.15","subcommand":["publish"]},"python":null,"implementation":{"name":null,"version":null},"distro":{"name":"Debian GNU/Linux","version":"12","id":"bookworm","libc":null},"system":{"name":null,"release":null},"cpu":null,"openssl_version":null,"setuptools_version":null,"rustc_version":null,"ci":true}

File hashes

Hashes for ingestion_engine-2.0.0-py3-none-any.whl
Algorithm	Hash digest
SHA256	`93554cc2bcb2cfdc7ab0620c5e96a3784211f25be86fdc508b60ffe728be37d0`
MD5	`b0608d236c569e486353a58ecb431a16`
BLAKE2b-256	`bcd968d24057246953c9bd276696047ea1bc9be29134a1e8ac857bc071302f09`

See more details on using hashes here.

ingestion-engine 2.0.0

Navigation

Verified details

Maintainers

Unverified details

Meta

Classifiers

Project description

📦 Ingestion Engine

🌟 Highlights

ℹ️ Overview

✍️ Authors

🚀 Usage

⬇️ Installation

🧱 Project Structure

🧩 Common Use Cases

🧪 Development

💭 Feedback and Contributing

Project details

Verified details

Maintainers

Unverified details

Meta

Classifiers

Release history Release notifications | RSS feed

Download files

Source Distribution

Built Distribution

File details

File metadata

File hashes

File details

File metadata

File hashes