Auditable AI content ingestion: safe fetch, extraction, evidence gates, provenance ledgers, RAG datasets, and configurable provider/API exports.
Project description
IngestForge
Auditable AI content ingestion for Python — from web/manual sources to evidence-aware articles, provenance ledgers, RAG datasets, and configurable API exports.
IngestForge is a lightweight, profile-driven Python library for building reviewable ingestion pipelines. It does not try to be a giant agent framework. It focuses on one practical workflow:
URL or manual source
-> safe fetch / manual ingest
-> HTML extraction
-> evidence gate
-> structured AI article generation
-> provenance ledger
-> RAG chunks + dataset card
-> optional local or REST export
The core design goal is simple: make AI-assisted content ingestion reproducible, configurable, and auditable instead of hidden inside one-off scripts.
Alpha note: IngestForge
0.4.0a6is usable for experiments, internal tools, portfolio demos, and controlled alpha workflows. It is not yet a production-grade unrestricted crawler, legal clearance engine, or complete SSRF defense layer.
Why IngestForge exists
Most ingestion tools stop at extraction, markdown conversion, crawling, or framework-level orchestration. IngestForge focuses on the missing middle layer: turning sources into standard packages with evidence, provenance, language-aware structured output, and export contracts that can be tested before any live provider call.
| You need | IngestForge approach |
|---|---|
| Safer web ingestion | URL validation, redirect checks, byte caps, robots-aware policy, conservative defaults |
| Better extraction without heavy defaults | Internal extractor by default; optional Trafilatura backend for noisy pages |
| AI article generation | Provider adapters for OpenAI, DeepSeek, Gemini, and mock/offline mode |
| Configurable languages | ai.source_language and ai.target_languages with BCP 47-style tags; no fixed fa/en lock-in |
| Auditability | Evidence bundle hashes, provenance ledger, run manifest, data card |
| RAG export | Chunked records with language coverage and source metadata |
| Provider confidence before live calls | ingestforge doctor providers --offline validates payload contracts locally |
| Destination flexibility | Local export and generic REST destination with configurable templates/field maps |
What makes it different
IngestForge is not just an HTML cleaner and not just an LLM wrapper. Its value is the connected pipeline:
safe source intake
+ extraction backend policy
+ evidence/support checks
+ prompt registry
+ provider payload contracts
+ provenance ledger
+ RAG/data-card outputs
+ release hygiene tests
That combination is intentionally narrow, testable, and easy to embed in your own products.
Core features in v0.4.0a6
| Area | Status |
|---|---|
| Manual URL ingestion | Implemented |
| Safe HTTP fetch | Implemented with alpha security limits |
| Robots-aware policy | Implemented as a crawling signal, not legal permission |
| HTML extraction | Internal BeautifulSoup-based extractor + optional Trafilatura backend |
| Standard package object | Implemented with typed models |
| Evidence bundle hash | Implemented |
| Provenance ledger | Implemented as an audit/provenance-inspired ledger |
| RAG export | Implemented |
| Data card generation | Implemented |
| Prompt registry | Implemented; unknown prompt_version fails clearly |
| Multi-language output | Configurable BCP 47-style language tags, no allowlist lock-in |
| OpenAI provider | Payload-contract tested; live calls require explicit config |
| DeepSeek provider | JSON output + explicit thinking-control payload; live calls require explicit config |
| Gemini provider | Current and legacy structured-output payload styles tested |
| Provider doctor | Offline contract validation and opt-in live smoke path |
| Generic REST destination | Implemented with configurable payload templates and response maps |
| OCR | Noop default; optional Tesseract route behind extras |
| Vision ranking | Local heuristic; AI vision remains experimental/roadmap |
| File/CSV/JSON ingestion | Roadmap |
| Fully automated publishing | Not default; human review is expected |
Installation
Install the base package:
pip install ingestforge
For high-quality optional HTML extraction with Trafilatura:
pip install "ingestforge[extraction]"
For development:
git clone https://github.com/Parvaz-Jamei/ingestforge.git
cd ingestforge
python -m venv .venv
. .venv/bin/activate # Windows: .venv\Scripts\activate
python -m pip install --upgrade pip
pip install -e ".[dev]"
Quick start: CLI
Create starter files:
ingestforge init
Validate a profile:
ingestforge validate-profile profiles/manual_safe.yaml
Run a safe dry-run ingestion:
ingestforge ingest-url https://example.com/article \
--profile profiles/manual_safe.yaml \
--dry-run \
--external-calls disabled
Validate the generated package and export RAG records:
ingestforge validate-package runs/<job_id>/package.json
ingestforge export-rag runs/<job_id>
Check provider contracts without spending API credits:
ingestforge doctor providers \
--profile src/ingestforge/profiles/strict_industrial.yaml \
--offline
Quick start: Python API
Minimal use:
from ingestforge import ingest_url
package = ingest_url(
"https://example.com/article",
dry_run=True,
external_calls="disabled",
write_dataset=True,
)
print(package.article.title.language_map())
Profile-based use:
from ingestforge import pipeline
pipe = pipeline("profiles/strict_industrial.yaml")
package = pipe.ingest_url(
"https://example.com/article",
dry_run=True,
external_calls="disabled",
write_dataset=True,
)
Configuration model
IngestForge is designed to be config-driven. Provider names, model IDs, endpoint paths, destination fields, prompt versions, language tags, limits, and safety policies are profile/env values rather than hard-coded runtime assumptions.
Configuration precedence:
library defaults -> profile file / inheritance -> environment variables -> CLI overrides -> Python API overrides
Example profile fragment:
profile_name: controlled_ingestion
pipeline:
external_calls: disabled
dry_run: true
fetch:
allowed_domains:
- example.com
max_bytes: 2000000
follow_redirects: true
extraction:
backend: auto # auto | internal | trafilatura
include_tables: true
include_comments: false
min_extracted_chars: 40
ai:
provider: mock
model: mock
prompt_version: article_builder.v1
source_language: auto
target_languages: [en, fa]
Environment override example:
export INGESTFORGE_TARGET_LANGUAGES="fa,en,de,pt-BR,zh-Hant,es-419"
export INGESTFORGE_SOURCE_LANGUAGE="auto"
Extraction backends
The default extractor is intentionally dependency-light. For stronger extraction on noisy or complex pages, install the optional Trafilatura backend:
pip install "ingestforge[extraction]"
Then choose one of these profile modes:
extraction:
backend: auto # use Trafilatura when available, fallback to internal
extraction:
backend: internal # always use the built-in BeautifulSoup-based extractor
extraction:
backend: trafilatura # require Trafilatura; no silent internal fallback
auto is recommended for most alpha users because it improves extraction when the optional dependency is installed while preserving a small base install.
Provider model policy
Live model IDs are intentionally treated as opaque provider strings. IngestForge does not maintain a hard-coded allowlist of model names.
Model resolution order:
explicit profile model
-> INGESTFORGE_<PROVIDER>_MODEL
-> INGESTFORGE_AI_MODEL
-> mock only when provider is mock
-> clear config error for live providers when external AI calls are enabled
Examples:
export INGESTFORGE_AI_PROVIDER=openai
export INGESTFORGE_OPENAI_MODEL="your-openai-model-id"
export INGESTFORGE_AI_PROVIDER=gemini
export INGESTFORGE_GEMINI_MODEL="your-gemini-model-id"
export INGESTFORGE_AI_PROVIDER=deepseek
export INGESTFORGE_DEEPSEEK_MODEL="your-deepseek-model-id"
No paid provider call is executed by the normal test suite.
Provider doctor
Use provider doctor before live provider usage:
ingestforge doctor providers --profile profiles/examples/openai_live.yaml --offline
Offline mode checks local profile validity, provider payload shape, prompt resolution, and schema contract behavior.
Live smoke tests are opt-in and require credentials:
export INGESTFORGE_RUN_LIVE_PROVIDER_TESTS=1
export INGESTFORGE_OPENAI_API_KEY="..."
export INGESTFORGE_OPENAI_MODEL="..."
ingestforge doctor providers --profile profiles/examples/openai_live.yaml --live
Output languages
IngestForge supports configurable output languages through BCP 47-style tags.
ai:
source_language: auto
target_languages:
- en
- fa
- de
- pt-BR
- zh-Hant
- es-419
The library validates language-tag shape rather than maintaining a fixed language allowlist. This keeps the core future-proof while still catching empty or malformed values.
Prompt registry
ai.prompt_version resolves to packaged prompt templates:
ai:
prompt_version: article_builder.v1
article_builder.v1 maps to:
src/ingestforge/prompts/article_builder.j2
Unknown prompt versions fail during profile validation instead of silently falling back to a hidden default.
Destination adapters
The public core contains generic destination adapters only:
local_exportfor offline package/dataset output;generic_restfor configurable API publishing with endpoint maps, payload templates, field maps, and response maps.
Private project profiles, real production API endpoints, and secrets should stay outside the public repository.
Generated artifacts
A typical run can produce:
runs/<job_id>/
package.json
data_card.json
rag_records.jsonl
provenance_ledger.jsonl
run_manifest.json
audit_log.jsonl
These artifacts are designed to make review and downstream dataset construction easier.
Security and safety model
IngestForge is conservative by default:
- no automatic publishing;
- human review is expected;
- source license status defaults to
needs_review; - raw HTML is not sent to AI by default;
- private, localhost, loopback, and link-local network targets are blocked by default;
- response bodies are streamed with byte caps;
- secrets are read from environment variables, not committed profiles.
Important limitation: URL preflight validation does not fully eliminate DNS rebinding / TOCTOU risk because the HTTP client may resolve DNS separately from the validation step. High-security deployments should combine IngestForge checks with network egress controls, strict allowlists, and infrastructure-level protections.
Claim and evidence limits
The current evidence gate is intentionally conservative and shallow. Exact support checks are useful for alpha review workflows, but they are not semantic proof and should not be marketed as legal, factual, or scientific verification.
Use IngestForge as an auditable ingestion tool, not as an authority that guarantees truth or reuse rights.
Project status
0.4.0a6 is an alpha release.
Best current uses:
- personal/internal ingestion experiments;
- portfolio and research-software demonstrations;
- controlled RAG dataset preparation;
- provider payload contract experiments;
- audited content workflow prototypes.
Not recommended yet for:
- unrestricted crawling at scale;
- unsupervised publishing;
- legal clearance decisions;
- high-security network environments without extra egress controls;
- claims of semantic fact verification.
Development and release checks
Run the full local check suite:
python -m compileall -q src tests
python -m pytest -q
python -m ruff check .
python -m ruff format --check .
python -m mypy src/ingestforge
python scripts/clean_release_artifacts.py
python scripts/release_hygiene_check.py
python -m build --sdist --wheel
python -m twine check dist/*
The repository also includes GitHub Actions for CI and package publishing. PyPI/TestPyPI publishing should use Trusted Publishing where possible rather than long-lived upload tokens.
Project release links:
- PyPI: https://pypi.org/project/ingestforge/
- TestPyPI: https://test.pypi.org/project/ingestforge/
- Zenodo: enable the GitHub integration and create a GitHub release; then replace the general Zenodo link/badge with the minted DOI record. Do not add a fake DOI before Zenodo creates one.
Repository layout
src/ingestforge/ library source
src/ingestforge/core/ config, pipeline, prompts, provider doctor
src/ingestforge/datasets/ RAG export, chunking, data card
src/ingestforge/providers/ provider adapters
src/ingestforge/providers/fetch/ safe fetch, robots policy, encoding, extraction
src/ingestforge/destinations/ destination adapters
docs/ contracts and release notes
profiles/ example user profiles
tests/ regression and contract tests
Roadmap
Near-term priorities:
- stronger extraction evaluation fixtures;
- file, CSV, JSON, and PDF ingestion paths;
- deeper semantic support checks without overclaiming;
- more destination examples;
- richer documentation and examples;
- optional live provider smoke-test guides.
Citation
See CITATION.cff. If you use IngestForge in research software, dataset construction, or portfolio demonstrations, cite the repository or release tag.
License
MIT License. See LICENSE.
Project details
Release history Release notifications | RSS feed
Download files
Download the file for your platform. If you're not sure which to choose, learn more about installing packages.
Source Distribution
Built Distribution
Filter files by name, interpreter, ABI, and platform.
If you're not sure about the file name format, learn more about wheel file names.
Copy a direct link to the current filters
File details
Details for the file ingestforge-0.4.0a6.tar.gz.
File metadata
- Download URL: ingestforge-0.4.0a6.tar.gz
- Upload date:
- Size: 79.7 kB
- Tags: Source
- Uploaded using Trusted Publishing? Yes
- Uploaded via: twine/6.1.0 CPython/3.13.12
File hashes
| Algorithm | Hash digest | |
|---|---|---|
| SHA256 |
98d5e69624e20df27f108a4133392e97ad87e0648df5681b18d7af944573ba06
|
|
| MD5 |
48f99ff96c1714eb18eb4bfac1c48aa5
|
|
| BLAKE2b-256 |
d438bbbd875cbd35df9f6ca82858b0cd17f6c1a5f211502889d55d348d31a1ff
|
Provenance
The following attestation bundles were made for ingestforge-0.4.0a6.tar.gz:
Publisher:
publish.yml on Parvaz-Jamei/IngestForge
-
Statement:
-
Statement type:
https://in-toto.io/Statement/v1 -
Predicate type:
https://docs.pypi.org/attestations/publish/v1 -
Subject name:
ingestforge-0.4.0a6.tar.gz -
Subject digest:
98d5e69624e20df27f108a4133392e97ad87e0648df5681b18d7af944573ba06 - Sigstore transparency entry: 1615867251
- Sigstore integration time:
-
Permalink:
Parvaz-Jamei/IngestForge@c7e599da8474f642a85ba3b10a3306eb57ba2aba -
Branch / Tag:
refs/tags/v0.4.0a6 - Owner: https://github.com/Parvaz-Jamei
-
Access:
public
-
Token Issuer:
https://token.actions.githubusercontent.com -
Runner Environment:
github-hosted -
Publication workflow:
publish.yml@c7e599da8474f642a85ba3b10a3306eb57ba2aba -
Trigger Event:
release
-
Statement type:
File details
Details for the file ingestforge-0.4.0a6-py3-none-any.whl.
File metadata
- Download URL: ingestforge-0.4.0a6-py3-none-any.whl
- Upload date:
- Size: 76.9 kB
- Tags: Python 3
- Uploaded using Trusted Publishing? Yes
- Uploaded via: twine/6.1.0 CPython/3.13.12
File hashes
| Algorithm | Hash digest | |
|---|---|---|
| SHA256 |
002a9b05d8c9067c49a2a83f02ba07b497f1fffa275cc91657b686b951cec130
|
|
| MD5 |
93a5beb4350e6f98329a10c7abf7f7ab
|
|
| BLAKE2b-256 |
22cfdc4d1213b605897337d962ba1e78de2e73fea934e6bcd54826af9e69beaf
|
Provenance
The following attestation bundles were made for ingestforge-0.4.0a6-py3-none-any.whl:
Publisher:
publish.yml on Parvaz-Jamei/IngestForge
-
Statement:
-
Statement type:
https://in-toto.io/Statement/v1 -
Predicate type:
https://docs.pypi.org/attestations/publish/v1 -
Subject name:
ingestforge-0.4.0a6-py3-none-any.whl -
Subject digest:
002a9b05d8c9067c49a2a83f02ba07b497f1fffa275cc91657b686b951cec130 - Sigstore transparency entry: 1615867253
- Sigstore integration time:
-
Permalink:
Parvaz-Jamei/IngestForge@c7e599da8474f642a85ba3b10a3306eb57ba2aba -
Branch / Tag:
refs/tags/v0.4.0a6 - Owner: https://github.com/Parvaz-Jamei
-
Access:
public
-
Token Issuer:
https://token.actions.githubusercontent.com -
Runner Environment:
github-hosted -
Publication workflow:
publish.yml@c7e599da8474f642a85ba3b10a3306eb57ba2aba -
Trigger Event:
release
-
Statement type: