Extracts the content of documents, websites, etc and maps it to a common format.

These details have not been verified by PyPI

Project links

Repository

Project description

extractor-api-lib

Content ingestion layer for the STACKIT RAG template. This library exposes a FastAPI extraction service that ingests raw documents (files or remote sources), extracts and converts (to internal representations) the information, and hands output to admin-api-lib.

Responsibilities

Receive binary uploads and remote source descriptors from the admin backend.
Route each request through the appropriate extractor (file, sitemap, Confluence, etc.).
Convert extracted fragments into the shared InformationPiece schema expected by downstream services.

Feature highlights

Broad format coverage – PDFs, DOCX, PPTX, XML/EPUB, Confluence spaces, and sitemap-driven websites.
Consistent output schema – Information pieces are returned in a unified structure with content type (TEXT, TABLE, IMAGE) and metadata.
Swappable extractors – Dependency-injector container makes it easy to add or replace file/source extractors, table converters, etc.
Production-grade plumbing – Built-in S3-compatible file service, LangChain loaders with retry/backoff, optional PDF OCR, and throttling controls for web crawls.

Installation

pip install extractor-api-lib

Python 3.13 is required. OCR and computer-vision features expect system packages such as ffmpeg, poppler-utils, and tesseract (see services/document-extractor/README.md for the full list).

Module tour

dependency_container.py – Central dependency-injector wiring. Override providers here to plug in custom extractors, endpoints etc.
api_endpoints/ & impl/api_endpoints/ – Thin FastAPI endpoint abstractions and implementations for file and source (like confluence & sitemaps) extractors.
apis/ – Extractor API abstractions and implementations.
extractors/ & impl/extractors/ – Format-specific logic (PDF, DOCX, PPTX, XML, EPUB, Confluence, sitemap) packaged behind the InformationExtractor/InformationFileExtractor interfaces.
mapper/ & impl/mapper/ – Abstractions and implementations to map langchain documents, internal and external information piece representations to each other.
file_services/ – Default S3-compatible storage adapter; replace it if you store files elsewhere.
impl/settings/ – Configuration settings for dependency injection container components.
table_converter/ & impl/table_converter/ – Abstractions and implementations to convert pandas.DataFrame to markdown and vice versa.
impl/types/ - Enums for content-, extractor- and file types.
impl/utils/ – Helper functions for hashed datetime and sitemap crawling, header injection, and custom metadata parsing.

Endpoints provided

POST /extract_from_file – Downloads the file from S3, extracts its contents, and returns normalized InformationPiece records.
POST /extract_from_source – Pulls from remote sources (Confluence, sitemap) using credentials and further optional kwargs.

Both endpoints stream their results back to admin-api-lib, which takes care of enrichment and persistence.

How the file extraction endpoint works

Download the file from S3
Chose suitable file extractor based on the filename ending
Extract the content from the file
Map the internal representation to the external schema
Return the final output

How the source extraction endpoint works

Chose suitable source extractor based on the source type
Pull the source content using the provided credentials and parameters
Extract the content from the source
Map the internal representation to the external schema
Return the final output

Configuration overview

Two pydantic-settings models ship with this package:

S3 storage (S3Settings) – configure the built-in file service with S3_ACCESS_KEY_ID, S3_SECRET_ACCESS_KEY, S3_ENDPOINT, and S3_BUCKET.
PDF extraction (PDFExtractorSettings) – adjust footer trimming or diagram export via PDF_EXTRACTOR_FOOTER_HEIGHT and PDF_EXTRACTOR_DIAGRAMS_FOLDER_NAME.

Other extractors accept their parameters at runtime through the request payload (ExtractionParameters). For example, the admin backend forwards Confluence credentials, sitemap URLs, or custom headers when it calls /extract_from_source. This keeps the library stateless and makes it easy to plug in additional sources without redeploying.

The Helm chart exposes the environment variables mentioned above under documentExtractor.envs.* so production deployments remain declarative.

Typical usage

from extractor_api_lib.main import app as perfect_extractor_app

admin-api-lib calls /extract_from_file and /extract_from_source to populate the ingestion pipeline.

Extending the library

Implement InformationFileExtractor or InformationExtractor for your new format/source.
Register it in dependency_container.py (append to file_extractors list or source_extractors dict).
Update mapper or metadata handling if additional fields are required.
Add unit tests under libs/extractor-api-lib/tests using fixtures and fake storage providers.

Contributing

Ensure new endpoints or adapters remain thin and defer to rag-core-lib for shared logic. Run poetry run pytest and the configured linters before opening a PR. For further instructions see the Contributing Guide.

License

Licensed under the project license. See the root LICENSE file.

Project details

These details have not been verified by PyPI

Project links

Repository

Release history Release notifications | RSS feed

4.2.0

Feb 10, 2026

4.1.1

Feb 3, 2026

4.1.0 yanked

Feb 3, 2026

Reason this release was yanked:

dependency to old version of internal lib rag-core-lib

4.0.0

Jan 22, 2026

3.4.0

Nov 17, 2025

This version

3.3.0

Nov 7, 2025

3.2.0

Oct 27, 2025

Download files

Download the file for your platform. If you're not sure which to choose, learn more about installing packages.

Source Distribution

extractor_api_lib-3.3.0.tar.gz (26.2 kB view details)

Uploaded Nov 7, 2025 Source

Built Distribution

If you're not sure about the file name format, learn more about wheel file names.

The dropdown lists show the available interpreters, ABIs, and platforms. Enable javascript to be able to filter the list of wheel files.

extractor_api_lib-3.3.0-py3-none-any.whl (48.3 kB view details)

Uploaded Nov 7, 2025 Python 3

File details

Details for the file extractor_api_lib-3.3.0.tar.gz.

File metadata

Download URL: extractor_api_lib-3.3.0.tar.gz
Upload date: Nov 7, 2025
Size: 26.2 kB
Tags: Source
Uploaded using Trusted Publishing? No
Uploaded via: poetry/2.1.3 CPython/3.13.5 Darwin/24.6.0

File hashes

Hashes for extractor_api_lib-3.3.0.tar.gz
Algorithm	Hash digest
SHA256	`ff8b9c5612254be3eaab4e77e25483179a1f58dde5499d2888ff465890d21015`
MD5	`80445f211f50ad91ffec18204d61b988`
BLAKE2b-256	`6e02aef5559449cbdc0746bf8e6a58a7da3fd638ea163b831517d82ae03f4764`

See more details on using hashes here.

File details

Details for the file extractor_api_lib-3.3.0-py3-none-any.whl.

File metadata

Download URL: extractor_api_lib-3.3.0-py3-none-any.whl
Upload date: Nov 7, 2025
Size: 48.3 kB
Tags: Python 3
Uploaded using Trusted Publishing? No
Uploaded via: poetry/2.1.3 CPython/3.13.5 Darwin/24.6.0

File hashes

Hashes for extractor_api_lib-3.3.0-py3-none-any.whl
Algorithm	Hash digest
SHA256	`a0a4fc718b064cf06d34a9718f04cd3c2aec15c8d6a98008ef328b629666e18c`
MD5	`006e444ce9a177e0d80f6b685f28fade`
BLAKE2b-256	`7be94eed4c9837fdcf4e5c923dc05747d3565b59816c159e32e4481769e4a5a2`

See more details on using hashes here.

extractor-api-lib 3.3.0

Navigation

Verified details

Project links

Maintainers

Meta

Unverified details

Project links

Meta

Classifiers

Project description

extractor-api-lib

Responsibilities

Feature highlights

Installation

Module tour

Endpoints provided

How the file extraction endpoint works

How the source extraction endpoint works

Configuration overview

Typical usage

Extending the library

Contributing

License

Project details

Verified details

Project links

Maintainers

Meta

Unverified details

Project links

Meta

Classifiers

Release history Release notifications | RSS feed

Download files

Source Distribution

Built Distribution

File details

File metadata

File hashes

File details

File metadata

File hashes