Web scraping engine, HTML parsing, and search integration for the Matrx ecosystem
Project description
matrx-scraper
Web scraping + HTML parsing + site crawling + search client for Python. An 8-stage parser pipeline turns raw HTML into clean, AI-ready content plus structured extractions (tables, code blocks, links by category, metadata). Designed to work standalone with just httpx, with optional extras for headless browsing, PDF extraction, OCR, and a FastAPI server front-end.
Install
pip install matrx-scraper # core: HTTP fetch + parse + crawl + Brave Search
pip install "matrx-scraper[browser]" # + Playwright / curl_cffi for JS-rendered pages
pip install "matrx-scraper[pdf]" # + PyMuPDF for PDF extraction
pip install "matrx-scraper[ocr]" # + Tesseract OCR
pip install "matrx-scraper[connect]" # + matrx-connect (stream events to a Matrx app)
pip install "matrx-scraper[server]" # + FastAPI server + uvicorn + asyncpg
pip install "matrx-scraper[all]" # everything
Python 3.12+ required. Depends on matrx-utils; matrx-connect is optional.
What's in the box
- Scraping (
matrx_scraper.scraper,matrx_scraper.orchestrator):scrape(url, **opts),scrape_many(urls),scrape_many_stream(urls),ScrapeResult,ScrapeOptions,ScrapeService. - Parser pipeline (
matrx_scraper.parser): 8-stage HTML pipeline — normalize →NoiseRemover→ScrapeFilter→ElementExtractor→LinkExtractor→ metadata (extruct) → hashing (MinHash/SimHash) → markdown conversion. Entry points:parse_html(html, **opts)andParserOrchestrator. - Crawling (
matrx_scraper.crawler):crawl_site(base_url),SiteCrawler— async BFS site traversal, respects robots.txt. - Search (
matrx_scraper.search):BraveSearchClient. - Caching (
matrx_scraper.cache):CacheBackendwithMemoryCacheandTwoTierCache(memory + Postgres, via the optional server extras). - Per-URL / per-domain config (
matrx_scraper.domain_config):DomainConfigBackend— default is static, Postgres-backed variant available via the optional extras. - Browser automation (optional):
PlaywrightBrowserPool. - FastAPI server (optional):
matrx-scraperCLI atserver/__main__.py; routers underapi/.
Usage
One-off scrape
from matrx_scraper import scrape
result = await scrape("https://example.com/article")
print(result.title)
print(result.ai_content) # clean, AI-ready markdown
print(result.links) # categorized links
print(result.tables) # parsed tables
print(result.organized_data) # structured JSON of the page
ScrapeResult is a rich dataclass with ~20 fields: url, success, content_type, title, ai_content, ai_research_content, markdown_renderable, organized_data, tables, code_blocks, links, hashes, and more.
Parse raw HTML (no HTTP)
from matrx_scraper import parse_html
parsed = parse_html(open("page.html").read())
print(parsed.main_content)
Crawl a full site
from matrx_scraper import crawl_site
async for page in crawl_site("https://example.com", max_pages=100):
print(page.url, page.title)
Brave Search
from matrx_scraper.search import BraveSearchClient
client = BraveSearchClient(api_key=settings.BRAVE_API_KEY)
results = await client.search("matrx-scraper python")
Integration with a Matrx host
When used inside a host that has matrx-connect available, you can stream scrape progress as typed events:
import matrx_scraper
matrx_scraper.configure_ext(
info_payload_cls=InfoPayload,
warning_payload_cls=WarningPayload,
# … other Matrx event types
)
After this, scrape_many_stream and ScrapeService will emit matrx-connect event payloads. If configure_ext is not called, the package still works — it just doesn't emit stream events.
Dependency posture
Core dependencies are a small set of well-known libraries (httpx, beautifulsoup4, selectolax, markdownify, tldextract, tabulate, python-dotenv) plus matrx-utils. All heavier dependencies (Playwright, PyMuPDF, Tesseract, FastAPI) live behind optional extras so lean installs stay lean.
Migration notes
This package replaces the legacy root-level scraper/ folder in the aidream monorepo and parts of research/. Internal docs (MIGRATION_STATUS.md, GAPS_TO_FIX.md, LEGACY_AUDIT.md, MIGRATION_GUIDE.md) track what has been ported and what hasn't.
Contributing
See CLAUDE.md for package-specific rules. This package lives in the aidream monorepo at github.com/AI-Matrix-Engine/aidream-current.
License
MIT.
Project details
Release history Release notifications | RSS feed
Download files
Download the file for your platform. If you're not sure which to choose, learn more about installing packages.
Source Distribution
Built Distribution
Filter files by name, interpreter, ABI, and platform.
If you're not sure about the file name format, learn more about wheel file names.
Copy a direct link to the current filters
File details
Details for the file matrx_scraper-0.1.0.tar.gz.
File metadata
- Download URL: matrx_scraper-0.1.0.tar.gz
- Upload date:
- Size: 179.9 kB
- Tags: Source
- Uploaded using Trusted Publishing? Yes
- Uploaded via: twine/6.1.0 CPython/3.13.12
File hashes
| Algorithm | Hash digest | |
|---|---|---|
| SHA256 |
b564ccbc3a4e3d4b9db5395efef4861c5f9930cc29c252dc58a0976e0330afd8
|
|
| MD5 |
f1d1d8ba03f1b818f59106e8e5bca069
|
|
| BLAKE2b-256 |
ee660414d26011c3078c4ebf32d51326d6e40f9552440a170d0733abc316a1b4
|
Provenance
The following attestation bundles were made for matrx_scraper-0.1.0.tar.gz:
Publisher:
publish-package.yml on AI-Matrix-Engine/aidream
-
Statement:
-
Statement type:
https://in-toto.io/Statement/v1 -
Predicate type:
https://docs.pypi.org/attestations/publish/v1 -
Subject name:
matrx_scraper-0.1.0.tar.gz -
Subject digest:
b564ccbc3a4e3d4b9db5395efef4861c5f9930cc29c252dc58a0976e0330afd8 - Sigstore transparency entry: 1781474329
- Sigstore integration time:
-
Permalink:
AI-Matrix-Engine/aidream@d2e7b5fdf4a4d03309bd9f67b793b3c490aef7d4 -
Branch / Tag:
refs/tags/matrx-scraper/v0.1.0 - Owner: https://github.com/AI-Matrix-Engine
-
Access:
private
-
Token Issuer:
https://token.actions.githubusercontent.com -
Runner Environment:
github-hosted -
Publication workflow:
publish-package.yml@d2e7b5fdf4a4d03309bd9f67b793b3c490aef7d4 -
Trigger Event:
push
-
Statement type:
File details
Details for the file matrx_scraper-0.1.0-py3-none-any.whl.
File metadata
- Download URL: matrx_scraper-0.1.0-py3-none-any.whl
- Upload date:
- Size: 167.2 kB
- Tags: Python 3
- Uploaded using Trusted Publishing? Yes
- Uploaded via: twine/6.1.0 CPython/3.13.12
File hashes
| Algorithm | Hash digest | |
|---|---|---|
| SHA256 |
80fecaaa6631990f787dec4ef8b20fa47a2615975595db669ef63d8e7fd673a1
|
|
| MD5 |
603e1840821f0d15022963f930741b9b
|
|
| BLAKE2b-256 |
8845f46c3450999c7af037bd2e5f4a42aabe87d6dce7b78b9cd4a49c36e6c33b
|
Provenance
The following attestation bundles were made for matrx_scraper-0.1.0-py3-none-any.whl:
Publisher:
publish-package.yml on AI-Matrix-Engine/aidream
-
Statement:
-
Statement type:
https://in-toto.io/Statement/v1 -
Predicate type:
https://docs.pypi.org/attestations/publish/v1 -
Subject name:
matrx_scraper-0.1.0-py3-none-any.whl -
Subject digest:
80fecaaa6631990f787dec4ef8b20fa47a2615975595db669ef63d8e7fd673a1 - Sigstore transparency entry: 1781474441
- Sigstore integration time:
-
Permalink:
AI-Matrix-Engine/aidream@d2e7b5fdf4a4d03309bd9f67b793b3c490aef7d4 -
Branch / Tag:
refs/tags/matrx-scraper/v0.1.0 - Owner: https://github.com/AI-Matrix-Engine
-
Access:
private
-
Token Issuer:
https://token.actions.githubusercontent.com -
Runner Environment:
github-hosted -
Publication workflow:
publish-package.yml@d2e7b5fdf4a4d03309bd9f67b793b3c490aef7d4 -
Trigger Event:
push
-
Statement type: