Extracts the content of documents, websites, etc and maps it to a common format.
Project description
extractor-api-lib
Content ingestion layer for the STACKIT RAG template. This library exposes a FastAPI extraction service that ingests raw documents (files or remote sources), extracts and converts (to internal representations) the information, and hands output to admin-api-lib.
Responsibilities
- Receive binary uploads and remote source descriptors from the admin backend.
- Route each request through the appropriate extractor (file, sitemap, Confluence, etc.).
- Convert extracted fragments into the shared
InformationPieceschema expected by downstream services.
Feature highlights
- Broad format coverage – PDFs, DOCX, PPTX, XML/EPUB, Confluence spaces, and sitemap-driven websites.
- Consistent output schema – Information pieces are returned in a unified structure with content type (
TEXT,TABLE,IMAGE) and metadata. - Swappable extractors – Dependency-injector container makes it easy to add or replace file/source extractors, table converters, etc.
- Production-grade plumbing – Built-in S3-compatible file service, LangChain loaders with retry/backoff, optional PDF OCR, and throttling controls for web crawls.
Installation
pip install extractor-api-lib
Python 3.13 is required. OCR and computer-vision features expect system packages such as ffmpeg, poppler-utils, and tesseract (see services/document-extractor/README.md for the full list).
Module tour
dependency_container.py– Central dependency-injector wiring. Override providers here to plug in custom extractors, endpoints etc.api_endpoints/&impl/api_endpoints/– Thin FastAPI endpoint abstractions and implementations for file and source (like confluence & sitemaps) extractors.apis/– Extractor API abstractions and implementations.extractors/&impl/extractors/– Format-specific logic (PDF, DOCX, PPTX, XML, EPUB, Confluence, sitemap) packaged behind theInformationExtractor/InformationFileExtractorinterfaces.mapper/&impl/mapper/– Abstractions and implementations to map langchain documents, internal and external information piece representations to each other.file_services/– Default S3-compatible storage adapter; replace it if you store files elsewhere.impl/settings/– Configuration settings for dependency injection container components.table_converter/&impl/table_converter/– Abstractions and implementations to convertpandas.DataFrameto markdown and vice versa.impl/types/- Enums for content-, extractor- and file types.impl/utils/– Helper functions for hashed datetime and sitemap crawling, header injection, and custom metadata parsing.
Endpoints provided
POST /extract_from_file– Downloads the file from S3, extracts its contents, and returns normalizedInformationPiecerecords.POST /extract_from_source– Pulls from remote sources (Confluence, sitemap) using credentials and further optional kwargs.
Both endpoints stream their results back to admin-api-lib, which takes care of enrichment and persistence.
How the file extraction endpoint works
- Download the file from S3
- Chose suitable file extractor based on the filename ending
- Extract the content from the file
- Map the internal representation to the external schema
- Return the final output
How the source extraction endpoint works
- Chose suitable source extractor based on the source type
- Pull the source content using the provided credentials and parameters
- Extract the content from the source
- Map the internal representation to the external schema
- Return the final output
Configuration overview
Two pydantic-settings models ship with this package:
- S3 storage (
S3Settings) – configure the built-in file service withS3_ACCESS_KEY_ID,S3_SECRET_ACCESS_KEY,S3_ENDPOINT, andS3_BUCKET. - PDF extraction (
PDFExtractorSettings) – adjust footer trimming or diagram export viaPDF_EXTRACTOR_FOOTER_HEIGHTandPDF_EXTRACTOR_DIAGRAMS_FOLDER_NAME.
Other extractors accept their parameters at runtime through the request payload (ExtractionParameters). For example, the admin backend forwards Confluence credentials, sitemap URLs, or custom headers when it calls /extract_from_source. This keeps the library stateless and makes it easy to plug in additional sources without redeploying.
The Helm chart exposes the environment variables mentioned above under documentExtractor.envs.* so production deployments remain declarative.
Typical usage
from extractor_api_lib.main import app as perfect_extractor_app
admin-api-lib calls /extract_from_file and /extract_from_source to populate the ingestion pipeline.
Extending the library
- Implement
InformationFileExtractororInformationExtractorfor your new format/source. - Register it in
dependency_container.py(append tofile_extractorslist orsource_extractorsdict). - Update mapper or metadata handling if additional fields are required.
- Add unit tests under
libs/extractor-api-lib/testsusing fixtures and fake storage providers.
Contributing
Ensure new endpoints or adapters remain thin and defer to rag-core-lib for shared logic. Run poetry run pytest and the configured linters before opening a PR. For further instructions see the Contributing Guide.
License
Licensed under the project license. See the root LICENSE file.
Project details
Release history Release notifications | RSS feed
Download files
Download the file for your platform. If you're not sure which to choose, learn more about installing packages.
Source Distribution
Built Distribution
Filter files by name, interpreter, ABI, and platform.
If you're not sure about the file name format, learn more about wheel file names.
Copy a direct link to the current filters
File details
Details for the file extractor_api_lib-3.3.0.tar.gz.
File metadata
- Download URL: extractor_api_lib-3.3.0.tar.gz
- Upload date:
- Size: 26.2 kB
- Tags: Source
- Uploaded using Trusted Publishing? No
- Uploaded via: poetry/2.1.3 CPython/3.13.5 Darwin/24.6.0
File hashes
| Algorithm | Hash digest | |
|---|---|---|
| SHA256 |
ff8b9c5612254be3eaab4e77e25483179a1f58dde5499d2888ff465890d21015
|
|
| MD5 |
80445f211f50ad91ffec18204d61b988
|
|
| BLAKE2b-256 |
6e02aef5559449cbdc0746bf8e6a58a7da3fd638ea163b831517d82ae03f4764
|
File details
Details for the file extractor_api_lib-3.3.0-py3-none-any.whl.
File metadata
- Download URL: extractor_api_lib-3.3.0-py3-none-any.whl
- Upload date:
- Size: 48.3 kB
- Tags: Python 3
- Uploaded using Trusted Publishing? No
- Uploaded via: poetry/2.1.3 CPython/3.13.5 Darwin/24.6.0
File hashes
| Algorithm | Hash digest | |
|---|---|---|
| SHA256 |
a0a4fc718b064cf06d34a9718f04cd3c2aec15c8d6a98008ef328b629666e18c
|
|
| MD5 |
006e444ce9a177e0d80f6b685f28fade
|
|
| BLAKE2b-256 |
7be94eed4c9837fdcf4e5c923dc05747d3565b59816c159e32e4481769e4a5a2
|