Pure-Python article extraction library and HTTP service - Drop-in replacement for readability-js-server

These details have been verified by PyPI

Project links

GitHub Statistics

Maintainers

psjinx

These details have not been verified by PyPI

Project description

Article Extractor

Python versions

Article Extractor turns arbitrary HTML into deterministic Markdown ready for ingestion pipelines.

Problem: brittle scrapers collapse when paywalls or inline scripts mutate markup.
Why now: one fetcher abstraction keeps Playwright and httpx output identical across the CLI, FastAPI server, and Python API.
Outcome: verified tutorials, a single operations runbook, and concise reference tables keep teams unblocked in production.

Audience: Engineers shipping ingestion pipelines, doc search, or automation that needs stable Markdown from HTML.

Prerequisites: Python 3.12+ (optional: uv for tooling, Docker for server demos).

Time: 2–10 minutes depending on whether you use CLI, server, or library.

What you'll learn: How to run the CLI once, start the FastAPI server, or embed the library in your app.

Value At a Glance

Deterministic Readability-style scoring tuned for long-form docs, blogs, and knowledge bases.
GFM-compatible Markdown and sanitized HTML identical across the CLI, FastAPI server, and Python API.
Runtime knobs for Playwright storage, cache sizing, proxies, diagnostics, and StatsD metrics.
Test suite coverage above 93% plus documentation that records the exact commands and outputs.

See the Docs Home for the consolidated Tutorials, Operations, Reference, and Style sections.

Choose Your Surface

Goal	Start Here	Time	Verified Commands
Run the CLI once	CLI Fast Path	< 2 min	`uv pip install "article-extractor==0.5.8"`, `uv run article-extractor …`, `head ./tmp/article-extractor-cli.md`
Ship the FastAPI server in Docker	Docker Service	~5 min	`docker run ghcr.io/pankaj28843/article-extractor:latest`, `curl http://localhost:3000/health`, `curl -XPOST …
Embed the library	Python Embedding	~5 min	`uv run python - <<'PY' …`, `asyncio.run(fetch_remote())`
Tune caches, networking, diagnostics, releases	Operations Runbook	task-specific	Env vars, Docker overrides, StatsD flags, `gh` CLI

Install (Any Environment)

pip install "article-extractor==0.5.8"           # CLI + library
pip install "article-extractor[server]==0.5.8"   # FastAPI server extras
pip install "article-extractor[all]==0.5.8"      # Playwright, httpx, FastAPI, fake-useragent

Prefer uv? Run uv pip install "article-extractor==0.5.8" or add it to pyproject.toml via uv add "article-extractor[all]==0.5.8".

Safer Automated Installs

For unattended installs, use an exact package version plus a cooldown window so your resolver does not pick versions uploaded minutes ago:

export UV_EXCLUDE_NEWER="7 days"
uv pip install "article-extractor==0.5.8"

If you use pip-based automation instead of uv, prefer a locked requirements file with hashes and an upload cutoff timestamp rather than floating latest.

Developer / Active Development Install

When contributing or debugging locally, install as an editable tool so changes to src/ take effect immediately:

# Clone and install editable
git clone https://github.com/pankaj28843/article-extractor.git
cd article-extractor
uv tool install --editable --force --refresh --reinstall ".[all]"

# Now `article-extractor` CLI reflects local changes instantly
article-extractor https://example.com

See CONTRIBUTING.md for the full development workflow.

Crawl an Entire Site

Extract every page under a domain in one command:

# CLI: crawl up to 50 pages, output to ./output/
uv run article-extractor crawl https://example.com --max-pages 50 --output ./output

# Server: start a background crawl job
curl -X POST http://localhost:3000/crawl \
  -H "Content-Type: application/json" \
  -d '{"start_url": "https://example.com", "max_pages": 50}'
# Returns {"job_id": "abc123", "status": "running", ...}

The crawler follows internal links via BFS, respects robots.txt and sitemaps, and writes one Markdown file per page. Use --workers 3 (default is 1) to dispatch three concurrent crawl workers while --concurrency continues to cap simultaneous fetch slots. See the Crawling Guide for rate limiting, headed mode, and output structure.

Observability & Operations

All runtimes honor diagnostics toggles (ARTICLE_EXTRACTOR_LOG_DIAGNOSTICS, ARTICLE_EXTRACTOR_METRICS_*).
Playwright storage is opt-in: CLI/server runs stay ephemeral unless you pass --storage-state /path/to/storage_state.json or set ARTICLE_EXTRACTOR_STORAGE_STATE_FILE plus a bind-mounted volume. The Operations Runbook walks through mounting, warming caches, and inspecting the queue.
The Docker debug harness still exercises persistent storage for regression coverage; add --disable-storage when you want the smoke to mirror the default ephemeral behavior.
Networking, diagnostics, StatsD, validation loops, and release automation live in a single Operations Runbook.

Documentation

The MkDocs site (Overview, Tutorials, Operations, Reference, Explanations) lives at https://pankaj28843.github.io/article-extractor/. If the site is unavailable, read the Markdown sources in docs/ including style-guide.md, operations.md, and content-inventory.md.

Contributing

We welcome pull requests paired with docs. Follow the Operations Runbook for validation and include real command output in the PR description when you update documentation.

License

MIT — see LICENSE.

Project details

These details have been verified by PyPI

Project links

GitHub Statistics

Maintainers

psjinx

These details have not been verified by PyPI

Release history Release notifications | RSS feed

0.5.9

Apr 19, 2026

This version

0.5.8

Mar 25, 2026

0.5.7

Mar 25, 2026

0.5.6

Mar 5, 2026

0.5.5

Jan 20, 2026

0.5.4

Jan 19, 2026

0.5.3

Jan 9, 2026

0.5.2

Jan 8, 2026

0.5.1

Jan 7, 2026

0.5.0

Jan 7, 2026

0.4.2

Jan 6, 2026

0.4.1

Jan 4, 2026

0.4.0

Jan 3, 2026

0.3.2

Jan 2, 2026

0.3.1

Jan 2, 2026

0.3.0

Jan 2, 2026

0.2.1

Jan 1, 2026

0.2.0

Jan 1, 2026

0.1.2

Jan 1, 2026

0.1.1

Dec 29, 2025

0.1.0

Dec 29, 2025

Download files

Download the file for your platform. If you're not sure which to choose, learn more about installing packages.

Source Distribution

article_extractor-0.5.8.tar.gz (2.0 MB view details)

Uploaded Mar 25, 2026 Source

Built Distribution

If you're not sure about the file name format, learn more about wheel file names.

The dropdown lists show the available interpreters, ABIs, and platforms. Enable javascript to be able to filter the list of wheel files.

article_extractor-0.5.8-py3-none-any.whl (84.8 kB view details)

Uploaded Mar 25, 2026 Python 3

File details

Details for the file article_extractor-0.5.8.tar.gz.

File metadata

Download URL: article_extractor-0.5.8.tar.gz
Upload date: Mar 25, 2026
Size: 2.0 MB
Tags: Source
Uploaded using Trusted Publishing? Yes
Uploaded via: twine/6.1.0 CPython/3.13.7

File hashes

Hashes for article_extractor-0.5.8.tar.gz
Algorithm	Hash digest
SHA256	`57802435ad5abe7950911baadc74a964e8963bebdc42a80f8264332afff49249`
MD5	`6b6c702d5f504c8d9b558ac522842692`
BLAKE2b-256	`cbb1202c88568745cdce9b5eb1255d5614a63c89244b48a7587ac78ca190e414`

See more details on using hashes here.

Provenance

The following attestation bundles were made for article_extractor-0.5.8.tar.gz:

Publisher: publish.yml on pankaj28843/article-extractor

Attestations: Values shown here reflect the state when the release was signed and may no longer be current.

Statement:
- Statement type: https://in-toto.io/Statement/v1
- Predicate type: https://docs.pypi.org/attestations/publish/v1
- Subject name: article_extractor-0.5.8.tar.gz
- Subject digest: 57802435ad5abe7950911baadc74a964e8963bebdc42a80f8264332afff49249
- Sigstore transparency entry: 1181850919
- Sigstore integration time: Mar 25, 2026
Source repository:
- Permalink: pankaj28843/article-extractor@bb082f21b707e87511e03eda0dec85a750489b9b
- Branch / Tag: refs/tags/v0.5.8
- Owner: https://github.com/pankaj28843
- Access: public
Publication detail:
- Token Issuer: https://token.actions.githubusercontent.com
- Runner Environment: github-hosted
- Publication workflow: publish.yml@bb082f21b707e87511e03eda0dec85a750489b9b
- Trigger Event: release

File details

Details for the file article_extractor-0.5.8-py3-none-any.whl.

File metadata

Download URL: article_extractor-0.5.8-py3-none-any.whl
Upload date: Mar 25, 2026
Size: 84.8 kB
Tags: Python 3
Uploaded using Trusted Publishing? Yes
Uploaded via: twine/6.1.0 CPython/3.13.7

File hashes

Hashes for article_extractor-0.5.8-py3-none-any.whl
Algorithm	Hash digest
SHA256	`442c92becae69e3835bf62e7ceaa9bbc737dd60c764f66f1139eac5884f3a7d0`
MD5	`1a22026707006c94ab999d13f3a689ab`
BLAKE2b-256	`97132b2b0f30df981f4c5eec003a3fd07109f05dd5fde10cecf544088271ee69`

See more details on using hashes here.

Provenance

The following attestation bundles were made for article_extractor-0.5.8-py3-none-any.whl:

Publisher: publish.yml on pankaj28843/article-extractor

Attestations: Values shown here reflect the state when the release was signed and may no longer be current.

Statement:
- Statement type: https://in-toto.io/Statement/v1
- Predicate type: https://docs.pypi.org/attestations/publish/v1
- Subject name: article_extractor-0.5.8-py3-none-any.whl
- Subject digest: 442c92becae69e3835bf62e7ceaa9bbc737dd60c764f66f1139eac5884f3a7d0
- Sigstore transparency entry: 1181850955
- Sigstore integration time: Mar 25, 2026
Source repository:
- Permalink: pankaj28843/article-extractor@bb082f21b707e87511e03eda0dec85a750489b9b
- Branch / Tag: refs/tags/v0.5.8
- Owner: https://github.com/pankaj28843
- Access: public
Publication detail:
- Token Issuer: https://token.actions.githubusercontent.com
- Runner Environment: github-hosted
- Publication workflow: publish.yml@bb082f21b707e87511e03eda0dec85a750489b9b
- Trigger Event: release

article-extractor 0.5.8

Navigation

Verified details

Project links

GitHub Statistics

Maintainers

Unverified details

Meta

Classifiers

Project description

Article Extractor

Value At a Glance

Choose Your Surface

Install (Any Environment)

Safer Automated Installs

Developer / Active Development Install

Crawl an Entire Site

Observability & Operations

Documentation

Contributing

License

Project details

Verified details

Project links

GitHub Statistics

Maintainers

Unverified details

Meta

Classifiers

Release history Release notifications | RSS feed

Download files

Source Distribution

Built Distribution

File details

File metadata

File hashes

Provenance

File details

File metadata

File hashes

Provenance