Skip to main content

Pure-Python article extraction library and HTTP service - Drop-in replacement for readability-js-server

Project description

Article Extractor

PyPI version Python versions License: MIT CI codecov

Article Extractor turns arbitrary HTML into deterministic Markdown ready for ingestion pipelines.

Problem: brittle scrapers collapse when paywalls or inline scripts mutate markup.
Why now: one fetcher abstraction keeps Playwright and httpx output identical across the CLI, FastAPI server, and Python API.
Outcome: verified tutorials, a single operations runbook, and concise reference tables keep teams unblocked in production.

Audience: Engineers shipping ingestion pipelines, doc search, or automation that needs stable Markdown from HTML.

Prerequisites: Python 3.12+ (optional: uv for tooling, Docker for server demos).

Time: 2–10 minutes depending on whether you use CLI, server, or library.

What you'll learn: How to run the CLI once, start the FastAPI server, or embed the library in your app.

Value At a Glance

  • Deterministic Readability-style scoring tuned for long-form docs, blogs, and knowledge bases.
  • GFM-compatible Markdown and sanitized HTML identical across the CLI, FastAPI server, and Python API.
  • Runtime knobs for Playwright storage, cache sizing, proxies, diagnostics, and StatsD metrics.
  • Test suite coverage above 93% plus documentation that records the exact commands and outputs.

See the Docs Home for the consolidated Tutorials, Operations, Reference, and Style sections.

Choose Your Surface

Goal Start Here Time Verified Commands
Run the CLI once CLI Fast Path < 2 min uv pip install article-extractor, uv run article-extractor …, head ./tmp/article-extractor-cli.md
Ship the FastAPI server in Docker Docker Service ~5 min docker run ghcr.io/pankaj28843/article-extractor:latest, curl http://localhost:3000/health, `curl -XPOST …
Embed the library Python Embedding ~5 min uv run python - <<'PY' …, asyncio.run(fetch_remote())
Tune caches, networking, diagnostics, releases Operations Runbook task-specific Env vars, Docker overrides, StatsD flags, gh CLI

Install (Any Environment)

pip install article-extractor           # CLI + library
pip install article-extractor[server]   # FastAPI server extras
pip install article-extractor[all]      # Playwright, httpx, FastAPI, fake-useragent

Prefer uv? Run uv pip install article-extractor or add it to pyproject.toml via uv add article-extractor[all].

Developer / Active Development Install

When contributing or debugging locally, install as an editable tool so changes to src/ take effect immediately:

# Clone and install editable
git clone https://github.com/pankaj28843/article-extractor.git
cd article-extractor
uv tool install --editable --force --refresh --reinstall ".[all]"

# Now `article-extractor` CLI reflects local changes instantly
article-extractor https://example.com

See CONTRIBUTING.md for the full development workflow.

Crawl an Entire Site

Extract every page under a domain in one command:

# CLI: crawl up to 50 pages, output to ./output/
uv run article-extractor crawl https://example.com --max-pages 50 --output ./output

# Server: start a background crawl job
curl -X POST http://localhost:3000/crawl \
  -H "Content-Type: application/json" \
  -d '{"start_url": "https://example.com", "max_pages": 50}'
# Returns {"job_id": "abc123", "status": "running", ...}

The crawler follows internal links via BFS, respects robots.txt and sitemaps, and writes one Markdown file per page. Use --workers 3 (default is 1) to dispatch three concurrent crawl workers while --concurrency continues to cap simultaneous fetch slots. See the Crawling Guide for rate limiting, headed mode, and output structure.

Observability & Operations

  • All runtimes honor diagnostics toggles (ARTICLE_EXTRACTOR_LOG_DIAGNOSTICS, ARTICLE_EXTRACTOR_METRICS_*).
  • Playwright storage is opt-in: CLI/server runs stay ephemeral unless you pass --storage-state /path/to/storage_state.json or set ARTICLE_EXTRACTOR_STORAGE_STATE_FILE plus a bind-mounted volume. The Operations Runbook walks through mounting, warming caches, and inspecting the queue.
  • The Docker debug harness still exercises persistent storage for regression coverage; add --disable-storage when you want the smoke to mirror the default ephemeral behavior.
  • Networking, diagnostics, StatsD, validation loops, and release automation live in a single Operations Runbook.

Documentation

The MkDocs site (Overview, Tutorials, Operations, Reference, Explanations) lives at https://pankaj28843.github.io/article-extractor/. If the site is unavailable, read the Markdown sources in docs/ including style-guide.md, operations.md, and content-inventory.md.

Contributing

We welcome pull requests paired with docs. Follow the Operations Runbook for validation and include real command output in the PR description when you update documentation.

License

MIT — see LICENSE.

Project details


Download files

Download the file for your platform. If you're not sure which to choose, learn more about installing packages.

Source Distribution

article_extractor-0.5.2.tar.gz (229.8 kB view details)

Uploaded Source

Built Distribution

If you're not sure about the file name format, learn more about wheel file names.

article_extractor-0.5.2-py3-none-any.whl (69.2 kB view details)

Uploaded Python 3

File details

Details for the file article_extractor-0.5.2.tar.gz.

File metadata

  • Download URL: article_extractor-0.5.2.tar.gz
  • Upload date:
  • Size: 229.8 kB
  • Tags: Source
  • Uploaded using Trusted Publishing? Yes
  • Uploaded via: uv/0.9.22 {"installer":{"name":"uv","version":"0.9.22","subcommand":["publish"]},"python":null,"implementation":{"name":null,"version":null},"distro":{"name":"Ubuntu","version":"24.04","id":"noble","libc":null},"system":{"name":null,"release":null},"cpu":null,"openssl_version":null,"setuptools_version":null,"rustc_version":null,"ci":true}

File hashes

Hashes for article_extractor-0.5.2.tar.gz
Algorithm Hash digest
SHA256 beb19cabaf5f11046d20c60b400b7f91962b8721980ecf3ad65c540b5d67e459
MD5 7b5688a57581ca41d30e18736d80f705
BLAKE2b-256 f449d5a6198069fcbff98b4719e1a56bc7671e3812d1c7d968b1e49f46638984

See more details on using hashes here.

File details

Details for the file article_extractor-0.5.2-py3-none-any.whl.

File metadata

  • Download URL: article_extractor-0.5.2-py3-none-any.whl
  • Upload date:
  • Size: 69.2 kB
  • Tags: Python 3
  • Uploaded using Trusted Publishing? Yes
  • Uploaded via: uv/0.9.22 {"installer":{"name":"uv","version":"0.9.22","subcommand":["publish"]},"python":null,"implementation":{"name":null,"version":null},"distro":{"name":"Ubuntu","version":"24.04","id":"noble","libc":null},"system":{"name":null,"release":null},"cpu":null,"openssl_version":null,"setuptools_version":null,"rustc_version":null,"ci":true}

File hashes

Hashes for article_extractor-0.5.2-py3-none-any.whl
Algorithm Hash digest
SHA256 7885c463151921198e70e5b6cf9e30243d2899c83cb48bd2b0476037882aecdb
MD5 e67fb16da72b27c5f850a1124af6659e
BLAKE2b-256 6dd03f65907c68a3feb4f2779c0f3d780d2ba9ccd55335c9172d057b452a9f74

See more details on using hashes here.

Supported by

AWS Cloud computing and Security Sponsor Datadog Monitoring Depot Continuous Integration Fastly CDN Google Download Analytics Pingdom Monitoring Sentry Error logging StatusPage Status page