Web scraping engine, HTML parsing, and search integration for the Matrx ecosystem

These details have not been verified by PyPI

Project links

Project description

matrx-scraper

Web scraping + HTML parsing + site crawling + search client for Python. An 8-stage parser pipeline turns raw HTML into clean, AI-ready content plus structured extractions (tables, code blocks, links by category, metadata). Designed to work standalone with just httpx, with optional extras for headless browsing, PDF extraction, OCR, and a FastAPI server front-end.

Install

pip install matrx-scraper                  # core: HTTP fetch + parse + crawl + Brave Search
pip install "matrx-scraper[browser]"       # + Playwright / curl_cffi for JS-rendered pages
pip install "matrx-scraper[pdf]"           # + PyMuPDF for PDF extraction
pip install "matrx-scraper[ocr]"           # + Tesseract OCR
pip install "matrx-scraper[connect]"       # + matrx-connect (stream events to a Matrx app)
pip install "matrx-scraper[server]"        # + FastAPI server + uvicorn + asyncpg
pip install "matrx-scraper[all]"           # everything

Python 3.12+ required. Depends on matrx-utils; matrx-connect is optional.

What's in the box

Scraping (matrx_scraper.scraper, matrx_scraper.orchestrator): scrape(url, **opts), scrape_many(urls), scrape_many_stream(urls), ScrapeResult, ScrapeOptions, ScrapeService.
Parser pipeline (matrx_scraper.parser): 8-stage HTML pipeline — normalize → NoiseRemover → ScrapeFilter → ElementExtractor → LinkExtractor → metadata (extruct) → hashing (MinHash/SimHash) → markdown conversion. Entry points: parse_html(html, **opts) and ParserOrchestrator.
Crawling (matrx_scraper.crawler): crawl_site(base_url), SiteCrawler — async BFS site traversal, respects robots.txt.
Search (matrx_scraper.search): BraveSearchClient.
Caching (matrx_scraper.cache): CacheBackend with MemoryCache and TwoTierCache (memory + Postgres, via the optional server extras).
Per-URL / per-domain config (matrx_scraper.domain_config): DomainConfigBackend — default is static, Postgres-backed variant available via the optional extras.
Browser automation (optional): PlaywrightBrowserPool.
FastAPI server (optional): matrx-scraper CLI at server/__main__.py; routers under api/.

Usage

One-off scrape

from matrx_scraper import scrape

result = await scrape("https://example.com/article")
print(result.title)
print(result.ai_content)           # clean, AI-ready markdown
print(result.links)                # categorized links
print(result.tables)               # parsed tables
print(result.organized_data)       # structured JSON of the page

ScrapeResult is a rich dataclass with ~20 fields: url, success, content_type, title, ai_content, ai_research_content, markdown_renderable, organized_data, tables, code_blocks, links, hashes, and more.

Parse raw HTML (no HTTP)

from matrx_scraper import parse_html

parsed = parse_html(open("page.html").read())
print(parsed.main_content)

Crawl a full site

from matrx_scraper import crawl_site

async for page in crawl_site("https://example.com", max_pages=100):
    print(page.url, page.title)

Brave Search

from matrx_scraper.search import BraveSearchClient

client = BraveSearchClient(api_key=settings.BRAVE_API_KEY)
results = await client.search("matrx-scraper python")

Integration with a Matrx host

When used inside a host that has matrx-connect available, you can stream scrape progress as typed events:

import matrx_scraper

matrx_scraper.configure_ext(
    info_payload_cls=InfoPayload,
    warning_payload_cls=WarningPayload,
    # … other Matrx event types
)

After this, scrape_many_stream and ScrapeService will emit matrx-connect event payloads. If configure_ext is not called, the package still works — it just doesn't emit stream events.

Dependency posture

Core dependencies are a small set of well-known libraries (httpx, beautifulsoup4, selectolax, markdownify, tldextract, tabulate, python-dotenv) plus matrx-utils. All heavier dependencies (Playwright, PyMuPDF, Tesseract, FastAPI) live behind optional extras so lean installs stay lean.

Migration notes

This package replaces the legacy root-level scraper/ folder in the aidream monorepo and parts of research/. Internal docs (MIGRATION_STATUS.md, GAPS_TO_FIX.md, LEGACY_AUDIT.md, MIGRATION_GUIDE.md) track what has been ported and what hasn't.

Contributing

See CLAUDE.md for package-specific rules. This package lives in the aidream monorepo at github.com/AI-Matrix-Engine/aidream-current.

License

MIT.

Project details

These details have not been verified by PyPI

Project links

Release history Release notifications | RSS feed

This version

0.1.0

Jun 10, 2026

Download files

Download the file for your platform. If you're not sure which to choose, learn more about installing packages.

Source Distribution

matrx_scraper-0.1.0.tar.gz (179.9 kB view details)

Uploaded Jun 10, 2026 Source

Built Distribution

If you're not sure about the file name format, learn more about wheel file names.

The dropdown lists show the available interpreters, ABIs, and platforms. Enable javascript to be able to filter the list of wheel files.

matrx_scraper-0.1.0-py3-none-any.whl (167.2 kB view details)

Uploaded Jun 10, 2026 Python 3

File details

Details for the file matrx_scraper-0.1.0.tar.gz.

File metadata

Download URL: matrx_scraper-0.1.0.tar.gz
Upload date: Jun 10, 2026
Size: 179.9 kB
Tags: Source
Uploaded using Trusted Publishing? Yes
Uploaded via: twine/6.1.0 CPython/3.13.12

File hashes

Hashes for matrx_scraper-0.1.0.tar.gz
Algorithm	Hash digest
SHA256	`b564ccbc3a4e3d4b9db5395efef4861c5f9930cc29c252dc58a0976e0330afd8`
MD5	`f1d1d8ba03f1b818f59106e8e5bca069`
BLAKE2b-256	`ee660414d26011c3078c4ebf32d51326d6e40f9552440a170d0733abc316a1b4`

See more details on using hashes here.

Provenance

The following attestation bundles were made for matrx_scraper-0.1.0.tar.gz:

Publisher: publish-package.yml on AI-Matrix-Engine/aidream

Attestations: Values shown here reflect the state when the release was signed and may no longer be current.

Statement:
- Statement type: https://in-toto.io/Statement/v1
- Predicate type: https://docs.pypi.org/attestations/publish/v1
- Subject name: matrx_scraper-0.1.0.tar.gz
- Subject digest: b564ccbc3a4e3d4b9db5395efef4861c5f9930cc29c252dc58a0976e0330afd8
- Sigstore transparency entry: 1781474329
- Sigstore integration time: Jun 10, 2026
Source repository:
- Permalink: AI-Matrix-Engine/aidream@d2e7b5fdf4a4d03309bd9f67b793b3c490aef7d4
- Branch / Tag: refs/tags/matrx-scraper/v0.1.0
- Owner: https://github.com/AI-Matrix-Engine
- Access: private
Publication detail:
- Token Issuer: https://token.actions.githubusercontent.com
- Runner Environment: github-hosted
- Publication workflow: publish-package.yml@d2e7b5fdf4a4d03309bd9f67b793b3c490aef7d4
- Trigger Event: push

File details

Details for the file matrx_scraper-0.1.0-py3-none-any.whl.

File metadata

Download URL: matrx_scraper-0.1.0-py3-none-any.whl
Upload date: Jun 10, 2026
Size: 167.2 kB
Tags: Python 3
Uploaded using Trusted Publishing? Yes
Uploaded via: twine/6.1.0 CPython/3.13.12

File hashes

Hashes for matrx_scraper-0.1.0-py3-none-any.whl
Algorithm	Hash digest
SHA256	`80fecaaa6631990f787dec4ef8b20fa47a2615975595db669ef63d8e7fd673a1`
MD5	`603e1840821f0d15022963f930741b9b`
BLAKE2b-256	`8845f46c3450999c7af037bd2e5f4a42aabe87d6dce7b78b9cd4a49c36e6c33b`

See more details on using hashes here.

Provenance

The following attestation bundles were made for matrx_scraper-0.1.0-py3-none-any.whl:

Publisher: publish-package.yml on AI-Matrix-Engine/aidream

Attestations: Values shown here reflect the state when the release was signed and may no longer be current.

Statement:
- Statement type: https://in-toto.io/Statement/v1
- Predicate type: https://docs.pypi.org/attestations/publish/v1
- Subject name: matrx_scraper-0.1.0-py3-none-any.whl
- Subject digest: 80fecaaa6631990f787dec4ef8b20fa47a2615975595db669ef63d8e7fd673a1
- Sigstore transparency entry: 1781474441
- Sigstore integration time: Jun 10, 2026
Source repository:
- Permalink: AI-Matrix-Engine/aidream@d2e7b5fdf4a4d03309bd9f67b793b3c490aef7d4
- Branch / Tag: refs/tags/matrx-scraper/v0.1.0
- Owner: https://github.com/AI-Matrix-Engine
- Access: private
Publication detail:
- Token Issuer: https://token.actions.githubusercontent.com
- Runner Environment: github-hosted
- Publication workflow: publish-package.yml@d2e7b5fdf4a4d03309bd9f67b793b3c490aef7d4
- Trigger Event: push

matrx-scraper 0.1.0

Navigation

Verified details

Maintainers

Unverified details

Project links

Meta

Classifiers

Project description

matrx-scraper

Install

What's in the box

Usage

One-off scrape

Parse raw HTML (no HTTP)

Crawl a full site

Brave Search

Integration with a Matrx host

Dependency posture

Migration notes

Contributing

License

Project details

Verified details

Maintainers

Unverified details

Project links

Meta

Classifiers

Release history Release notifications | RSS feed

Download files

Source Distribution

Built Distribution

File details

File metadata

File hashes

Provenance

File details

File metadata

File hashes

Provenance