Skip to main content

Pure-Python article extraction library and HTTP service - Drop-in replacement for readability-js-server

Project description

Article Extractor

PyPI version Python versions License: MIT CI codecov

Article Extractor turns arbitrary HTML into deterministic Markdown ready for ingestion pipelines.

Problem: brittle scrapers collapse when paywalls or inline scripts mutate markup.
Why now: one fetcher abstraction keeps Playwright and httpx output identical across the CLI, FastAPI server, and Python API.
Outcome: verified tutorials, a single operations runbook, and concise reference tables keep teams unblocked in production.

Value At a Glance

  • Deterministic Readability-style scoring tuned for long-form docs, blogs, and knowledge bases.
  • GFM-compatible Markdown and sanitized HTML identical across the CLI, FastAPI server, and Python API.
  • Runtime knobs for Playwright storage, cache sizing, proxies, diagnostics, and StatsD metrics.
  • Test suite coverage above 93% plus documentation that records the exact commands and outputs.

See the Docs Home for the consolidated Tutorials, Operations, Reference, and Style sections.

Choose Your Surface

Goal Start Here Time Verified Commands
Run the CLI once CLI Fast Path < 2 min uv pip install article-extractor, uv run article-extractor …, head ./tmp/article-extractor-cli.md
Ship the FastAPI server in Docker Docker Service ~5 min docker run ghcr.io/pankaj28843/article-extractor:latest, curl http://localhost:3000/health, `curl -XPOST …
Embed the library Python Embedding ~5 min uv run python - <<'PY' …, asyncio.run(fetch_remote())
Tune caches, networking, diagnostics, releases Operations Runbook task-specific Env vars, Docker overrides, StatsD flags, gh CLI

Install (Any Environment)

pip install article-extractor           # CLI + library
pip install article-extractor[server]   # FastAPI server extras
pip install article-extractor[all]      # Playwright, httpx, FastAPI, fake-useragent

Prefer uv? Run uv pip install article-extractor or add it to pyproject.toml via uv add article-extractor[all].

Observability & Operations

  • All runtimes honor diagnostics toggles (ARTICLE_EXTRACTOR_LOG_DIAGNOSTICS, ARTICLE_EXTRACTOR_METRICS_*).
  • Docker image ships Chromium + Playwright state persistence; the Operations Runbook shows how to mount storage, warm caches, and inspect the queue.
  • Networking, diagnostics, StatsD, validation loops, and release automation live in a single Operations Runbook.

Documentation

The MkDocs site (Overview, Tutorials, Operations, Reference, Explanations) lives at https://pankaj28843.github.io/article-extractor/. If the site is unavailable, read the Markdown sources in docs/ including style-guide.md, operations.md, and content-inventory.md.

Contributing

We welcome pull requests paired with docs. Follow the Operations Runbook for validation and include real command output in the PR description when you update documentation.

License

MIT — see LICENSE.

Project details


Download files

Download the file for your platform. If you're not sure which to choose, learn more about installing packages.

Source Distribution

article_extractor-0.4.1.tar.gz (178.0 kB view details)

Uploaded Source

Built Distribution

If you're not sure about the file name format, learn more about wheel file names.

article_extractor-0.4.1-py3-none-any.whl (44.6 kB view details)

Uploaded Python 3

File details

Details for the file article_extractor-0.4.1.tar.gz.

File metadata

  • Download URL: article_extractor-0.4.1.tar.gz
  • Upload date:
  • Size: 178.0 kB
  • Tags: Source
  • Uploaded using Trusted Publishing? Yes
  • Uploaded via: uv/0.9.21 {"installer":{"name":"uv","version":"0.9.21","subcommand":["publish"]},"python":null,"implementation":{"name":null,"version":null},"distro":{"name":"Ubuntu","version":"24.04","id":"noble","libc":null},"system":{"name":null,"release":null},"cpu":null,"openssl_version":null,"setuptools_version":null,"rustc_version":null,"ci":true}

File hashes

Hashes for article_extractor-0.4.1.tar.gz
Algorithm Hash digest
SHA256 f9effe98cc5b1e3322f2d94a206f2a4d1ef316b6729335f73c28e5d906ed53eb
MD5 44e635c6768bb106eb9cbbacc064ce50
BLAKE2b-256 eb9c99110ede995c23b539d13c2ffd0eb77b35917a182e73632d7065ca0a2587

See more details on using hashes here.

File details

Details for the file article_extractor-0.4.1-py3-none-any.whl.

File metadata

  • Download URL: article_extractor-0.4.1-py3-none-any.whl
  • Upload date:
  • Size: 44.6 kB
  • Tags: Python 3
  • Uploaded using Trusted Publishing? Yes
  • Uploaded via: uv/0.9.21 {"installer":{"name":"uv","version":"0.9.21","subcommand":["publish"]},"python":null,"implementation":{"name":null,"version":null},"distro":{"name":"Ubuntu","version":"24.04","id":"noble","libc":null},"system":{"name":null,"release":null},"cpu":null,"openssl_version":null,"setuptools_version":null,"rustc_version":null,"ci":true}

File hashes

Hashes for article_extractor-0.4.1-py3-none-any.whl
Algorithm Hash digest
SHA256 53a4da07c72497b90076e6205baff1c7a7a4487125829165491ae4085d3a8d23
MD5 d07d302a66455c0605277ae0e4d9261c
BLAKE2b-256 32ff55939f7e0b7b242dfd70627b6c5d0b7820e314a1e9c44acf4bbe50e76d77

See more details on using hashes here.

Supported by

AWS Cloud computing and Security Sponsor Datadog Monitoring Depot Continuous Integration Fastly CDN Google Download Analytics Pingdom Monitoring Sentry Error logging StatusPage Status page