No project description provided
Project description
📦 Ingestion Engine
Ingestion Engine is a typed Python library for building document ingestion pipelines. It gives you small async building blocks for loading data, transforming it, and writing it somewhere useful.
🌟 Highlights
- Simple
Source,Transformer, andSinkcontracts for ingestion workflows. - Async-first interfaces for I/O-heavy data processing.
- Built-in PDF parsing with Docling.
- LangChain-compatible document objects for retrieval and AI workflows.
- Local JSON Lines output for embedded documents.
- Pydantic settings models for explicit, typed component configuration.
- Extensible design: bring your own source, transformer, embedder, or sink.
ℹ️ Overview
Ingestion Engine helps you structure document ingestion code without locking you into one storage backend, embedding provider, or orchestration framework. A pipeline is built from three concepts:
- A
Sourceloads input data. - A
Transformerconverts that data into another representation. - A
Sinkwrites the final output.
The current library focuses on document and retrieval workflows. It includes a Docling-based PDF parser that turns PDFs into page-level documents with metadata, plus a local JSONL sink for storing embedded documents during development, testing, or batch handoff.
This project is useful when you want clear ingestion boundaries before sending data to a vector database, search index, data lake, or downstream AI application.
✍️ Authors
Created by Efysent.
🚀 Usage
Parse a PDF into page-level documents:
from pydantic import BaseModel
from ingestion_engine.transformer.pdf_parser import (
DoclingPDFParserTransformer,
RawPDFDocument,
)
from ingestion_engine.transformer.settings import DoclingPDFParserTransformerSettings
class PaperMetadata(BaseModel):
paper_id: str
title: str
settings = DoclingPDFParserTransformerSettings(
module_path="ingestion_engine.transformer.pdf_parser.DoclingPDFParserTransformer",
)
transformer = DoclingPDFParserTransformer(settings)
raw_document = RawPDFDocument(
metadata=PaperMetadata(paper_id="paper-123", title="Example Paper"),
pdf_path="/path/to/paper.pdf",
)
documents = await transformer.transform(raw_document)
Write embedded documents to JSON Lines:
from ingestion_engine.sink.local_json import LocalJSONSink
from ingestion_engine.sink.settings import LocalJSONSinkSettings
from ingestion_engine.transformer.embedder import EmbeddedDocument
sink = LocalJSONSink(
LocalJSONSinkSettings(
module_path="ingestion_engine.sink.local_json.LocalJSONSink",
output_path="./data/embedded_documents.jsonl",
)
)
await sink.write(
[
EmbeddedDocument(
page_content="first page",
metadata={"page": 1},
embedding=[0.1, 0.2, 0.3],
)
]
)
Define your own components by implementing the protected hooks. Call the public methods (load, transform, and write) from pipeline code so component progress is logged:
from collections.abc import AsyncGenerator
from ingestion_engine.source import Source
from ingestion_engine.source.settings import SourceSettings
from ingestion_engine.transformer import Transformer
from ingestion_engine.transformer.settings import TransformerSettings
from ingestion_engine.sink import Sink
from ingestion_engine.sink.settings import SinkSettings
class TextSource(Source[TextSourceSourceSettings, str]):
async def _load(self) -> AsyncGenerator[str]:
yield "hello"
class UppercaseTransformer(Transformer[UppercaseTransformerSettings, str, str]):
async def _transform(self, data: str) -> str:
return data.upper()
class PrintSink(Sink[PrintSinkSettings, str]):
async def _write(self, data: str) -> None:
print(data)
⬇️ Installation
This project requires Python 3.12 or newer.
Install it with:
pip install ingestion-engine
For local development from this repository, use uv:
uv sync --group dev --all-extras
🧱 Project Structure
ingestion-engine/
|-- src/ingestion_engine/
| |-- source/ # Source base class and settings
| |-- transformer/ # Transformer base class, documents, PDF parser, embedder contracts
| |-- sink/ # Sink base class, settings, local JSONL sink
| |-- exceptions.py # Package-level base exception
| `-- py.typed # Type information marker
|-- tests/
| |-- fixtures/ # Shared pytest fixtures
| |-- unit/ # Unit tests
| `-- integration/ # Integration tests
|-- pyproject.toml
|-- Makefile
`-- README.md
🧩 Common Use Cases
- Parse PDFs into page-level documents for retrieval systems.
- Preserve source metadata while adding parser metadata like
doc_id,page_number, andtotal_pages. - Build ingestion pipelines for vector databases, search indexes, data lakes, or local JSONL exports.
- Keep embedding logic replaceable behind an
EmbedderTransformerimplementation. - Test ingestion pieces independently with mocked sources, transformers, and sinks.
- Prototype local document workflows before wiring production infrastructure.
🧪 Development
Run tests:
make test
Run formatting and linting:
make format
make lint
Run type checking:
make type-check
Run the local CI equivalent:
make ci
💭 Feedback and Contributing
Bug reports, feature requests, and implementation ideas are welcome. Open an issue or discussion in the repository with:
- What you expected to happen.
- What actually happened.
- A minimal example or test case when possible.
- The Python version and relevant dependency versions.
Good contributions for this project include new sources, transformers, sinks, tests, examples, and documentation improvements.
Project details
Release history Release notifications | RSS feed
Download files
Download the file for your platform. If you're not sure which to choose, learn more about installing packages.
Source Distribution
Built Distribution
Filter files by name, interpreter, ABI, and platform.
If you're not sure about the file name format, learn more about wheel file names.
Copy a direct link to the current filters
File details
Details for the file ingestion_engine-2.0.0.tar.gz.
File metadata
- Download URL: ingestion_engine-2.0.0.tar.gz
- Upload date:
- Size: 5.9 kB
- Tags: Source
- Uploaded using Trusted Publishing? No
- Uploaded via: uv/0.11.15 {"installer":{"name":"uv","version":"0.11.15","subcommand":["publish"]},"python":null,"implementation":{"name":null,"version":null},"distro":{"name":"Debian GNU/Linux","version":"12","id":"bookworm","libc":null},"system":{"name":null,"release":null},"cpu":null,"openssl_version":null,"setuptools_version":null,"rustc_version":null,"ci":true}
File hashes
| Algorithm | Hash digest | |
|---|---|---|
| SHA256 |
2e4a51bc23c3b9e961257454ceb387293b5da88522b23260bc4e7eec033f833a
|
|
| MD5 |
1191b2249d56f6ac32795602d2606cef
|
|
| BLAKE2b-256 |
eb33186f601ace7848e6e6e9acd0491bab8b7a36523e41bd109b23eab87e324b
|
File details
Details for the file ingestion_engine-2.0.0-py3-none-any.whl.
File metadata
- Download URL: ingestion_engine-2.0.0-py3-none-any.whl
- Upload date:
- Size: 10.0 kB
- Tags: Python 3
- Uploaded using Trusted Publishing? No
- Uploaded via: uv/0.11.15 {"installer":{"name":"uv","version":"0.11.15","subcommand":["publish"]},"python":null,"implementation":{"name":null,"version":null},"distro":{"name":"Debian GNU/Linux","version":"12","id":"bookworm","libc":null},"system":{"name":null,"release":null},"cpu":null,"openssl_version":null,"setuptools_version":null,"rustc_version":null,"ci":true}
File hashes
| Algorithm | Hash digest | |
|---|---|---|
| SHA256 |
93554cc2bcb2cfdc7ab0620c5e96a3784211f25be86fdc508b60ffe728be37d0
|
|
| MD5 |
b0608d236c569e486353a58ecb431a16
|
|
| BLAKE2b-256 |
bcd968d24057246953c9bd276696047ea1bc9be29134a1e8ac857bc071302f09
|