Skip to main content

Auditable AI content ingestion: safe fetch, extraction, evidence gates, provenance ledgers, RAG datasets, and configurable provider/API exports.

Project description

IngestForge

Auditable AI content ingestion for Python — from web/manual sources to evidence-aware articles, provenance ledgers, RAG datasets, and configurable API exports.

CI Python Status License PyPI TestPyPI Zenodo Typed

IngestForge is a lightweight, profile-driven Python library for building reviewable ingestion pipelines. It does not try to be a giant agent framework. It focuses on one practical workflow:

URL or manual source
  -> safe fetch / manual ingest
  -> HTML extraction
  -> evidence gate
  -> structured AI article generation
  -> provenance ledger
  -> RAG chunks + dataset card
  -> optional local or REST export

The core design goal is simple: make AI-assisted content ingestion reproducible, configurable, and auditable instead of hidden inside one-off scripts.

Alpha note: IngestForge 0.4.0a6 is usable for experiments, internal tools, portfolio demos, and controlled alpha workflows. It is not yet a production-grade unrestricted crawler, legal clearance engine, or complete SSRF defense layer.

Why IngestForge exists

Most ingestion tools stop at extraction, markdown conversion, crawling, or framework-level orchestration. IngestForge focuses on the missing middle layer: turning sources into standard packages with evidence, provenance, language-aware structured output, and export contracts that can be tested before any live provider call.

You need IngestForge approach
Safer web ingestion URL validation, redirect checks, byte caps, robots-aware policy, conservative defaults
Better extraction without heavy defaults Internal extractor by default; optional Trafilatura backend for noisy pages
AI article generation Provider adapters for OpenAI, DeepSeek, Gemini, and mock/offline mode
Configurable languages ai.source_language and ai.target_languages with BCP 47-style tags; no fixed fa/en lock-in
Auditability Evidence bundle hashes, provenance ledger, run manifest, data card
RAG export Chunked records with language coverage and source metadata
Provider confidence before live calls ingestforge doctor providers --offline validates payload contracts locally
Destination flexibility Local export and generic REST destination with configurable templates/field maps

What makes it different

IngestForge is not just an HTML cleaner and not just an LLM wrapper. Its value is the connected pipeline:

safe source intake
  + extraction backend policy
  + evidence/support checks
  + prompt registry
  + provider payload contracts
  + provenance ledger
  + RAG/data-card outputs
  + release hygiene tests

That combination is intentionally narrow, testable, and easy to embed in your own products.

Core features in v0.4.0a6

Area Status
Manual URL ingestion Implemented
Safe HTTP fetch Implemented with alpha security limits
Robots-aware policy Implemented as a crawling signal, not legal permission
HTML extraction Internal BeautifulSoup-based extractor + optional Trafilatura backend
Standard package object Implemented with typed models
Evidence bundle hash Implemented
Provenance ledger Implemented as an audit/provenance-inspired ledger
RAG export Implemented
Data card generation Implemented
Prompt registry Implemented; unknown prompt_version fails clearly
Multi-language output Configurable BCP 47-style language tags, no allowlist lock-in
OpenAI provider Payload-contract tested; live calls require explicit config
DeepSeek provider JSON output + explicit thinking-control payload; live calls require explicit config
Gemini provider Current and legacy structured-output payload styles tested
Provider doctor Offline contract validation and opt-in live smoke path
Generic REST destination Implemented with configurable payload templates and response maps
OCR Noop default; optional Tesseract route behind extras
Vision ranking Local heuristic; AI vision remains experimental/roadmap
File/CSV/JSON ingestion Roadmap
Fully automated publishing Not default; human review is expected

Installation

Install the base package:

pip install ingestforge

For high-quality optional HTML extraction with Trafilatura:

pip install "ingestforge[extraction]"

For development:

git clone https://github.com/Parvaz-Jamei/ingestforge.git
cd ingestforge
python -m venv .venv
. .venv/bin/activate  # Windows: .venv\Scripts\activate
python -m pip install --upgrade pip
pip install -e ".[dev]"

Quick start: CLI

Create starter files:

ingestforge init

Validate a profile:

ingestforge validate-profile profiles/manual_safe.yaml

Run a safe dry-run ingestion:

ingestforge ingest-url https://example.com/article \
  --profile profiles/manual_safe.yaml \
  --dry-run \
  --external-calls disabled

Validate the generated package and export RAG records:

ingestforge validate-package runs/<job_id>/package.json
ingestforge export-rag runs/<job_id>

Check provider contracts without spending API credits:

ingestforge doctor providers \
  --profile src/ingestforge/profiles/strict_industrial.yaml \
  --offline

Quick start: Python API

Minimal use:

from ingestforge import ingest_url

package = ingest_url(
    "https://example.com/article",
    dry_run=True,
    external_calls="disabled",
    write_dataset=True,
)

print(package.article.title.language_map())

Profile-based use:

from ingestforge import pipeline

pipe = pipeline("profiles/strict_industrial.yaml")
package = pipe.ingest_url(
    "https://example.com/article",
    dry_run=True,
    external_calls="disabled",
    write_dataset=True,
)

Configuration model

IngestForge is designed to be config-driven. Provider names, model IDs, endpoint paths, destination fields, prompt versions, language tags, limits, and safety policies are profile/env values rather than hard-coded runtime assumptions.

Configuration precedence:

library defaults -> profile file / inheritance -> environment variables -> CLI overrides -> Python API overrides

Example profile fragment:

profile_name: controlled_ingestion
pipeline:
  external_calls: disabled
  dry_run: true

fetch:
  allowed_domains:
    - example.com
  max_bytes: 2000000
  follow_redirects: true

extraction:
  backend: auto        # auto | internal | trafilatura
  include_tables: true
  include_comments: false
  min_extracted_chars: 40

ai:
  provider: mock
  model: mock
  prompt_version: article_builder.v1
  source_language: auto
  target_languages: [en, fa]

Environment override example:

export INGESTFORGE_TARGET_LANGUAGES="fa,en,de,pt-BR,zh-Hant,es-419"
export INGESTFORGE_SOURCE_LANGUAGE="auto"

Extraction backends

The default extractor is intentionally dependency-light. For stronger extraction on noisy or complex pages, install the optional Trafilatura backend:

pip install "ingestforge[extraction]"

Then choose one of these profile modes:

extraction:
  backend: auto         # use Trafilatura when available, fallback to internal
extraction:
  backend: internal     # always use the built-in BeautifulSoup-based extractor
extraction:
  backend: trafilatura  # require Trafilatura; no silent internal fallback

auto is recommended for most alpha users because it improves extraction when the optional dependency is installed while preserving a small base install.

Provider model policy

Live model IDs are intentionally treated as opaque provider strings. IngestForge does not maintain a hard-coded allowlist of model names.

Model resolution order:

explicit profile model
  -> INGESTFORGE_<PROVIDER>_MODEL
  -> INGESTFORGE_AI_MODEL
  -> mock only when provider is mock
  -> clear config error for live providers when external AI calls are enabled

Examples:

export INGESTFORGE_AI_PROVIDER=openai
export INGESTFORGE_OPENAI_MODEL="your-openai-model-id"
export INGESTFORGE_AI_PROVIDER=gemini
export INGESTFORGE_GEMINI_MODEL="your-gemini-model-id"
export INGESTFORGE_AI_PROVIDER=deepseek
export INGESTFORGE_DEEPSEEK_MODEL="your-deepseek-model-id"

No paid provider call is executed by the normal test suite.

Provider doctor

Use provider doctor before live provider usage:

ingestforge doctor providers --profile profiles/examples/openai_live.yaml --offline

Offline mode checks local profile validity, provider payload shape, prompt resolution, and schema contract behavior.

Live smoke tests are opt-in and require credentials:

export INGESTFORGE_RUN_LIVE_PROVIDER_TESTS=1
export INGESTFORGE_OPENAI_API_KEY="..."
export INGESTFORGE_OPENAI_MODEL="..."
ingestforge doctor providers --profile profiles/examples/openai_live.yaml --live

Output languages

IngestForge supports configurable output languages through BCP 47-style tags.

ai:
  source_language: auto
  target_languages:
    - en
    - fa
    - de
    - pt-BR
    - zh-Hant
    - es-419

The library validates language-tag shape rather than maintaining a fixed language allowlist. This keeps the core future-proof while still catching empty or malformed values.

Prompt registry

ai.prompt_version resolves to packaged prompt templates:

ai:
  prompt_version: article_builder.v1

article_builder.v1 maps to:

src/ingestforge/prompts/article_builder.j2

Unknown prompt versions fail during profile validation instead of silently falling back to a hidden default.

Destination adapters

The public core contains generic destination adapters only:

  • local_export for offline package/dataset output;
  • generic_rest for configurable API publishing with endpoint maps, payload templates, field maps, and response maps.

Private project profiles, real production API endpoints, and secrets should stay outside the public repository.

Generated artifacts

A typical run can produce:

runs/<job_id>/
  package.json
  data_card.json
  rag_records.jsonl
  provenance_ledger.jsonl
  run_manifest.json
  audit_log.jsonl

These artifacts are designed to make review and downstream dataset construction easier.

Security and safety model

IngestForge is conservative by default:

  • no automatic publishing;
  • human review is expected;
  • source license status defaults to needs_review;
  • raw HTML is not sent to AI by default;
  • private, localhost, loopback, and link-local network targets are blocked by default;
  • response bodies are streamed with byte caps;
  • secrets are read from environment variables, not committed profiles.

Important limitation: URL preflight validation does not fully eliminate DNS rebinding / TOCTOU risk because the HTTP client may resolve DNS separately from the validation step. High-security deployments should combine IngestForge checks with network egress controls, strict allowlists, and infrastructure-level protections.

Claim and evidence limits

The current evidence gate is intentionally conservative and shallow. Exact support checks are useful for alpha review workflows, but they are not semantic proof and should not be marketed as legal, factual, or scientific verification.

Use IngestForge as an auditable ingestion tool, not as an authority that guarantees truth or reuse rights.

Project status

0.4.0a6 is an alpha release.

Best current uses:

  • personal/internal ingestion experiments;
  • portfolio and research-software demonstrations;
  • controlled RAG dataset preparation;
  • provider payload contract experiments;
  • audited content workflow prototypes.

Not recommended yet for:

  • unrestricted crawling at scale;
  • unsupervised publishing;
  • legal clearance decisions;
  • high-security network environments without extra egress controls;
  • claims of semantic fact verification.

Development and release checks

Run the full local check suite:

python -m compileall -q src tests
python -m pytest -q
python -m ruff check .
python -m ruff format --check .
python -m mypy src/ingestforge
python scripts/clean_release_artifacts.py
python scripts/release_hygiene_check.py
python -m build --sdist --wheel
python -m twine check dist/*

The repository also includes GitHub Actions for CI and package publishing. PyPI/TestPyPI publishing should use Trusted Publishing where possible rather than long-lived upload tokens.

Project release links:

Repository layout

src/ingestforge/        library source
src/ingestforge/core/   config, pipeline, prompts, provider doctor
src/ingestforge/datasets/  RAG export, chunking, data card
src/ingestforge/providers/ provider adapters
src/ingestforge/providers/fetch/   safe fetch, robots policy, encoding, extraction
src/ingestforge/destinations/ destination adapters
docs/                   contracts and release notes
profiles/               example user profiles
tests/                  regression and contract tests

Roadmap

Near-term priorities:

  • stronger extraction evaluation fixtures;
  • file, CSV, JSON, and PDF ingestion paths;
  • deeper semantic support checks without overclaiming;
  • more destination examples;
  • richer documentation and examples;
  • optional live provider smoke-test guides.

Citation

See CITATION.cff. If you use IngestForge in research software, dataset construction, or portfolio demonstrations, cite the repository or release tag.

License

MIT License. See LICENSE.

Project details


Download files

Download the file for your platform. If you're not sure which to choose, learn more about installing packages.

Source Distribution

ingestforge-0.4.0a6.tar.gz (79.7 kB view details)

Uploaded Source

Built Distribution

If you're not sure about the file name format, learn more about wheel file names.

ingestforge-0.4.0a6-py3-none-any.whl (76.9 kB view details)

Uploaded Python 3

File details

Details for the file ingestforge-0.4.0a6.tar.gz.

File metadata

  • Download URL: ingestforge-0.4.0a6.tar.gz
  • Upload date:
  • Size: 79.7 kB
  • Tags: Source
  • Uploaded using Trusted Publishing? Yes
  • Uploaded via: twine/6.1.0 CPython/3.13.12

File hashes

Hashes for ingestforge-0.4.0a6.tar.gz
Algorithm Hash digest
SHA256 98d5e69624e20df27f108a4133392e97ad87e0648df5681b18d7af944573ba06
MD5 48f99ff96c1714eb18eb4bfac1c48aa5
BLAKE2b-256 d438bbbd875cbd35df9f6ca82858b0cd17f6c1a5f211502889d55d348d31a1ff

See more details on using hashes here.

Provenance

The following attestation bundles were made for ingestforge-0.4.0a6.tar.gz:

Publisher: publish.yml on Parvaz-Jamei/IngestForge

Attestations: Values shown here reflect the state when the release was signed and may no longer be current.

File details

Details for the file ingestforge-0.4.0a6-py3-none-any.whl.

File metadata

  • Download URL: ingestforge-0.4.0a6-py3-none-any.whl
  • Upload date:
  • Size: 76.9 kB
  • Tags: Python 3
  • Uploaded using Trusted Publishing? Yes
  • Uploaded via: twine/6.1.0 CPython/3.13.12

File hashes

Hashes for ingestforge-0.4.0a6-py3-none-any.whl
Algorithm Hash digest
SHA256 002a9b05d8c9067c49a2a83f02ba07b497f1fffa275cc91657b686b951cec130
MD5 93a5beb4350e6f98329a10c7abf7f7ab
BLAKE2b-256 22cfdc4d1213b605897337d962ba1e78de2e73fea934e6bcd54826af9e69beaf

See more details on using hashes here.

Provenance

The following attestation bundles were made for ingestforge-0.4.0a6-py3-none-any.whl:

Publisher: publish.yml on Parvaz-Jamei/IngestForge

Attestations: Values shown here reflect the state when the release was signed and may no longer be current.

Supported by

AWS Cloud computing and Security Sponsor Datadog Monitoring Depot Continuous Integration Fastly CDN Google Download Analytics Pingdom Monitoring Sentry Error logging StatusPage Status page