Pure-Python article extraction library and HTTP service - Drop-in replacement for readability-js-server
Project description
Article Extractor
Article Extractor turns arbitrary HTML into deterministic Markdown ready for ingestion pipelines.
Problem: brittle scrapers collapse when paywalls or inline scripts mutate markup.
Why now: one fetcher abstraction keeps Playwright and httpx output identical across the CLI, FastAPI server, and Python API.
Outcome: verified tutorials, a single operations runbook, and concise reference tables keep teams unblocked in production.
Value At a Glance
- Deterministic Readability-style scoring tuned for long-form docs, blogs, and knowledge bases.
- GFM-compatible Markdown and sanitized HTML identical across the CLI, FastAPI server, and Python API.
- Runtime knobs for Playwright storage, cache sizing, proxies, diagnostics, and StatsD metrics.
- Test suite coverage above 93% plus documentation that records the exact commands and outputs.
See the Docs Home for the consolidated Tutorials, Operations, Reference, and Style sections.
Choose Your Surface
| Goal | Start Here | Time | Verified Commands |
|---|---|---|---|
| Run the CLI once | CLI Fast Path | < 2 min | uv pip install article-extractor, uv run article-extractor …, head ./tmp/article-extractor-cli.md |
| Ship the FastAPI server in Docker | Docker Service | ~5 min | docker run ghcr.io/pankaj28843/article-extractor:latest, curl http://localhost:3000/health, `curl -XPOST … |
| Embed the library | Python Embedding | ~5 min | uv run python - <<'PY' …, asyncio.run(fetch_remote()) |
| Tune caches, networking, diagnostics, releases | Operations Runbook | task-specific | Env vars, Docker overrides, StatsD flags, gh CLI |
Install (Any Environment)
pip install article-extractor # CLI + library
pip install article-extractor[server] # FastAPI server extras
pip install article-extractor[all] # Playwright, httpx, FastAPI, fake-useragent
Prefer uv? Run uv pip install article-extractor or add it to pyproject.toml via uv add article-extractor[all].
Observability & Operations
- All runtimes honor diagnostics toggles (
ARTICLE_EXTRACTOR_LOG_DIAGNOSTICS,ARTICLE_EXTRACTOR_METRICS_*). - Docker image ships Chromium + Playwright state persistence; the Operations Runbook shows how to mount storage, warm caches, and inspect the queue.
- Networking, diagnostics, StatsD, validation loops, and release automation live in a single Operations Runbook.
Documentation
The MkDocs site (Overview, Tutorials, Operations, Reference, Explanations) lives at https://pankaj28843.github.io/article-extractor/. If the site is unavailable, read the Markdown sources in docs/ including style-guide.md, operations.md, and content-inventory.md.
Contributing
We welcome pull requests paired with docs. Follow the Operations Runbook for validation and include real command output in the PR description when you update documentation.
License
MIT — see LICENSE.
Project details
Release history Release notifications | RSS feed
Download files
Download the file for your platform. If you're not sure which to choose, learn more about installing packages.
Source Distribution
Built Distribution
Filter files by name, interpreter, ABI, and platform.
If you're not sure about the file name format, learn more about wheel file names.
Copy a direct link to the current filters
File details
Details for the file article_extractor-0.4.1.tar.gz.
File metadata
- Download URL: article_extractor-0.4.1.tar.gz
- Upload date:
- Size: 178.0 kB
- Tags: Source
- Uploaded using Trusted Publishing? Yes
- Uploaded via: uv/0.9.21 {"installer":{"name":"uv","version":"0.9.21","subcommand":["publish"]},"python":null,"implementation":{"name":null,"version":null},"distro":{"name":"Ubuntu","version":"24.04","id":"noble","libc":null},"system":{"name":null,"release":null},"cpu":null,"openssl_version":null,"setuptools_version":null,"rustc_version":null,"ci":true}
File hashes
| Algorithm | Hash digest | |
|---|---|---|
| SHA256 |
f9effe98cc5b1e3322f2d94a206f2a4d1ef316b6729335f73c28e5d906ed53eb
|
|
| MD5 |
44e635c6768bb106eb9cbbacc064ce50
|
|
| BLAKE2b-256 |
eb9c99110ede995c23b539d13c2ffd0eb77b35917a182e73632d7065ca0a2587
|
File details
Details for the file article_extractor-0.4.1-py3-none-any.whl.
File metadata
- Download URL: article_extractor-0.4.1-py3-none-any.whl
- Upload date:
- Size: 44.6 kB
- Tags: Python 3
- Uploaded using Trusted Publishing? Yes
- Uploaded via: uv/0.9.21 {"installer":{"name":"uv","version":"0.9.21","subcommand":["publish"]},"python":null,"implementation":{"name":null,"version":null},"distro":{"name":"Ubuntu","version":"24.04","id":"noble","libc":null},"system":{"name":null,"release":null},"cpu":null,"openssl_version":null,"setuptools_version":null,"rustc_version":null,"ci":true}
File hashes
| Algorithm | Hash digest | |
|---|---|---|
| SHA256 |
53a4da07c72497b90076e6205baff1c7a7a4487125829165491ae4085d3a8d23
|
|
| MD5 |
d07d302a66455c0605277ae0e4d9261c
|
|
| BLAKE2b-256 |
32ff55939f7e0b7b242dfd70627b6c5d0b7820e314a1e9c44acf4bbe50e76d77
|