Skip to main content

No project description provided

Project description

📦 Ingestion Engine

Ingestion Engine is a typed Python library for building document ingestion pipelines. It gives you small async building blocks for loading data, transforming it, and writing it somewhere useful.

🌟 Highlights

  • Simple Source, Transformer, and Sink contracts for ingestion workflows.
  • Async-first interfaces for I/O-heavy data processing.
  • Built-in PDF parsing with Docling.
  • LangChain-compatible document objects for retrieval and AI workflows.
  • Local JSON Lines output for embedded documents.
  • Pydantic settings models for explicit, typed component configuration.
  • Extensible design: bring your own source, transformer, embedder, or sink.

ℹ️ Overview

Ingestion Engine helps you structure document ingestion code without locking you into one storage backend, embedding provider, or orchestration framework. A pipeline is built from three concepts:

  1. A Source loads input data.
  2. A Transformer converts that data into another representation.
  3. A Sink writes the final output.

The current library focuses on document and retrieval workflows. It includes a Docling-based PDF parser that turns PDFs into page-level documents with metadata, plus a local JSONL sink for storing embedded documents during development, testing, or batch handoff.

This project is useful when you want clear ingestion boundaries before sending data to a vector database, search index, data lake, or downstream AI application.

✍️ Authors

Created by Efysent.

🚀 Usage

Parse a PDF into page-level documents:

from pydantic import BaseModel

from ingestion_engine.transformer.pdf_parser import (
    DoclingPDFParserTransformer,
    RawPDFDocument,
)
from ingestion_engine.transformer.settings import DoclingPDFParserTransformerSettings


class PaperMetadata(BaseModel):
    paper_id: str
    title: str


settings = DoclingPDFParserTransformerSettings(
    module_path="ingestion_engine.transformer.pdf_parser.DoclingPDFParserTransformer",
)
transformer = DoclingPDFParserTransformer(settings)

raw_document = RawPDFDocument(
    metadata=PaperMetadata(paper_id="paper-123", title="Example Paper"),
    pdf_path="/path/to/paper.pdf",
)

documents = await transformer.transform(raw_document)

Write embedded documents to JSON Lines:

from ingestion_engine.sink.local_json import LocalJSONSink
from ingestion_engine.sink.settings import LocalJSONSinkSettings
from ingestion_engine.transformer.embedder import EmbeddedDocument


sink = LocalJSONSink(
    LocalJSONSinkSettings(
        module_path="ingestion_engine.sink.local_json.LocalJSONSink",
        output_path="./data/embedded_documents.jsonl",
    )
)

await sink.write(
    [
        EmbeddedDocument(
            page_content="first page",
            metadata={"page": 1},
            embedding=[0.1, 0.2, 0.3],
        )
    ]
)

Define your own components by implementing the protected hooks. Call the public methods (load, transform, and write) from pipeline code so component progress is logged:

from collections.abc import AsyncGenerator

from ingestion_engine.source import Source
from ingestion_engine.source.settings import SourceSettings
from ingestion_engine.transformer import Transformer
from ingestion_engine.transformer.settings import TransformerSettings
from ingestion_engine.sink import Sink
from ingestion_engine.sink.settings import SinkSettings


class TextSource(Source[TextSourceSourceSettings, str]):
    async def _load(self) -> AsyncGenerator[str]:
        yield "hello"


class UppercaseTransformer(Transformer[UppercaseTransformerSettings, str, str]):
    async def _transform(self, data: str) -> str:
        return data.upper()


class PrintSink(Sink[PrintSinkSettings, str]):
    async def _write(self, data: str) -> None:
        print(data)

⬇️ Installation

This project requires Python 3.12 or newer.

Install it with:

pip install ingestion-engine

For local development from this repository, use uv:

uv sync --group dev --all-extras

🧱 Project Structure

ingestion-engine/
|-- src/ingestion_engine/
|   |-- source/          # Source base class and settings
|   |-- transformer/     # Transformer base class, documents, PDF parser, embedder contracts
|   |-- sink/            # Sink base class, settings, local JSONL sink
|   |-- exceptions.py    # Package-level base exception
|   `-- py.typed         # Type information marker
|-- tests/
|   |-- fixtures/        # Shared pytest fixtures
|   |-- unit/            # Unit tests
|   `-- integration/     # Integration tests
|-- pyproject.toml
|-- Makefile
`-- README.md

🧩 Common Use Cases

  • Parse PDFs into page-level documents for retrieval systems.
  • Preserve source metadata while adding parser metadata like doc_id, page_number, and total_pages.
  • Build ingestion pipelines for vector databases, search indexes, data lakes, or local JSONL exports.
  • Keep embedding logic replaceable behind an EmbedderTransformer implementation.
  • Test ingestion pieces independently with mocked sources, transformers, and sinks.
  • Prototype local document workflows before wiring production infrastructure.

🧪 Development

Run tests:

make test

Run formatting and linting:

make format
make lint

Run type checking:

make type-check

Run the local CI equivalent:

make ci

💭 Feedback and Contributing

Bug reports, feature requests, and implementation ideas are welcome. Open an issue or discussion in the repository with:

  • What you expected to happen.
  • What actually happened.
  • A minimal example or test case when possible.
  • The Python version and relevant dependency versions.

Good contributions for this project include new sources, transformers, sinks, tests, examples, and documentation improvements.

Project details


Download files

Download the file for your platform. If you're not sure which to choose, learn more about installing packages.

Source Distribution

ingestion_engine-2.0.0.tar.gz (5.9 kB view details)

Uploaded Source

Built Distribution

If you're not sure about the file name format, learn more about wheel file names.

ingestion_engine-2.0.0-py3-none-any.whl (10.0 kB view details)

Uploaded Python 3

File details

Details for the file ingestion_engine-2.0.0.tar.gz.

File metadata

  • Download URL: ingestion_engine-2.0.0.tar.gz
  • Upload date:
  • Size: 5.9 kB
  • Tags: Source
  • Uploaded using Trusted Publishing? No
  • Uploaded via: uv/0.11.15 {"installer":{"name":"uv","version":"0.11.15","subcommand":["publish"]},"python":null,"implementation":{"name":null,"version":null},"distro":{"name":"Debian GNU/Linux","version":"12","id":"bookworm","libc":null},"system":{"name":null,"release":null},"cpu":null,"openssl_version":null,"setuptools_version":null,"rustc_version":null,"ci":true}

File hashes

Hashes for ingestion_engine-2.0.0.tar.gz
Algorithm Hash digest
SHA256 2e4a51bc23c3b9e961257454ceb387293b5da88522b23260bc4e7eec033f833a
MD5 1191b2249d56f6ac32795602d2606cef
BLAKE2b-256 eb33186f601ace7848e6e6e9acd0491bab8b7a36523e41bd109b23eab87e324b

See more details on using hashes here.

File details

Details for the file ingestion_engine-2.0.0-py3-none-any.whl.

File metadata

  • Download URL: ingestion_engine-2.0.0-py3-none-any.whl
  • Upload date:
  • Size: 10.0 kB
  • Tags: Python 3
  • Uploaded using Trusted Publishing? No
  • Uploaded via: uv/0.11.15 {"installer":{"name":"uv","version":"0.11.15","subcommand":["publish"]},"python":null,"implementation":{"name":null,"version":null},"distro":{"name":"Debian GNU/Linux","version":"12","id":"bookworm","libc":null},"system":{"name":null,"release":null},"cpu":null,"openssl_version":null,"setuptools_version":null,"rustc_version":null,"ci":true}

File hashes

Hashes for ingestion_engine-2.0.0-py3-none-any.whl
Algorithm Hash digest
SHA256 93554cc2bcb2cfdc7ab0620c5e96a3784211f25be86fdc508b60ffe728be37d0
MD5 b0608d236c569e486353a58ecb431a16
BLAKE2b-256 bcd968d24057246953c9bd276696047ea1bc9be29134a1e8ac857bc071302f09

See more details on using hashes here.

Supported by

AWS Cloud computing and Security Sponsor Datadog Monitoring Depot Continuous Integration Fastly CDN Google Download Analytics Pingdom Monitoring Sentry Error logging StatusPage Status page