Skip to main content

Web scraping engine, HTML parsing, and search integration for the Matrx ecosystem

Project description

matrx-scraper

Web scraping + HTML parsing + site crawling + search client for Python. An 8-stage parser pipeline turns raw HTML into clean, AI-ready content plus structured extractions (tables, code blocks, links by category, metadata). Designed to work standalone with just httpx, with optional extras for headless browsing, PDF extraction, OCR, and a FastAPI server front-end.

Install

pip install matrx-scraper                  # core: HTTP fetch + parse + crawl + Brave Search
pip install "matrx-scraper[browser]"       # + Playwright / curl_cffi for JS-rendered pages
pip install "matrx-scraper[pdf]"           # + PyMuPDF for PDF extraction
pip install "matrx-scraper[ocr]"           # + Tesseract OCR
pip install "matrx-scraper[connect]"       # + matrx-connect (stream events to a Matrx app)
pip install "matrx-scraper[server]"        # + FastAPI server + uvicorn + asyncpg
pip install "matrx-scraper[all]"           # everything

Python 3.12+ required. Depends on matrx-utils; matrx-connect is optional.

What's in the box

  • Scraping (matrx_scraper.scraper, matrx_scraper.orchestrator): scrape(url, **opts), scrape_many(urls), scrape_many_stream(urls), ScrapeResult, ScrapeOptions, ScrapeService.
  • Parser pipeline (matrx_scraper.parser): 8-stage HTML pipeline — normalize → NoiseRemoverScrapeFilterElementExtractorLinkExtractor → metadata (extruct) → hashing (MinHash/SimHash) → markdown conversion. Entry points: parse_html(html, **opts) and ParserOrchestrator.
  • Crawling (matrx_scraper.crawler): crawl_site(base_url), SiteCrawler — async BFS site traversal, respects robots.txt.
  • Search (matrx_scraper.search): BraveSearchClient.
  • Caching (matrx_scraper.cache): CacheBackend with MemoryCache and TwoTierCache (memory + Postgres, via the optional server extras).
  • Per-URL / per-domain config (matrx_scraper.domain_config): DomainConfigBackend — default is static, Postgres-backed variant available via the optional extras.
  • Browser automation (optional): PlaywrightBrowserPool.
  • FastAPI server (optional): matrx-scraper CLI at server/__main__.py; routers under api/.

Usage

One-off scrape

from matrx_scraper import scrape

result = await scrape("https://example.com/article")
print(result.title)
print(result.ai_content)           # clean, AI-ready markdown
print(result.links)                # categorized links
print(result.tables)               # parsed tables
print(result.organized_data)       # structured JSON of the page

ScrapeResult is a rich dataclass with ~20 fields: url, success, content_type, title, ai_content, ai_research_content, markdown_renderable, organized_data, tables, code_blocks, links, hashes, and more.

Parse raw HTML (no HTTP)

from matrx_scraper import parse_html

parsed = parse_html(open("page.html").read())
print(parsed.main_content)

Crawl a full site

from matrx_scraper import crawl_site

async for page in crawl_site("https://example.com", max_pages=100):
    print(page.url, page.title)

Brave Search

from matrx_scraper.search import BraveSearchClient

client = BraveSearchClient(api_key=settings.BRAVE_API_KEY)
results = await client.search("matrx-scraper python")

Integration with a Matrx host

When used inside a host that has matrx-connect available, you can stream scrape progress as typed events:

import matrx_scraper

matrx_scraper.configure_ext(
    info_payload_cls=InfoPayload,
    warning_payload_cls=WarningPayload,
    # … other Matrx event types
)

After this, scrape_many_stream and ScrapeService will emit matrx-connect event payloads. If configure_ext is not called, the package still works — it just doesn't emit stream events.

Dependency posture

Core dependencies are a small set of well-known libraries (httpx, beautifulsoup4, selectolax, markdownify, tldextract, tabulate, python-dotenv) plus matrx-utils. All heavier dependencies (Playwright, PyMuPDF, Tesseract, FastAPI) live behind optional extras so lean installs stay lean.

Migration notes

This package replaces the legacy root-level scraper/ folder in the aidream monorepo and parts of research/. Internal docs (MIGRATION_STATUS.md, GAPS_TO_FIX.md, LEGACY_AUDIT.md, MIGRATION_GUIDE.md) track what has been ported and what hasn't.

Contributing

See CLAUDE.md for package-specific rules. This package lives in the aidream monorepo at github.com/AI-Matrix-Engine/aidream-current.

License

MIT.

Project details


Download files

Download the file for your platform. If you're not sure which to choose, learn more about installing packages.

Source Distribution

matrx_scraper-0.1.0.tar.gz (179.9 kB view details)

Uploaded Source

Built Distribution

If you're not sure about the file name format, learn more about wheel file names.

matrx_scraper-0.1.0-py3-none-any.whl (167.2 kB view details)

Uploaded Python 3

File details

Details for the file matrx_scraper-0.1.0.tar.gz.

File metadata

  • Download URL: matrx_scraper-0.1.0.tar.gz
  • Upload date:
  • Size: 179.9 kB
  • Tags: Source
  • Uploaded using Trusted Publishing? Yes
  • Uploaded via: twine/6.1.0 CPython/3.13.12

File hashes

Hashes for matrx_scraper-0.1.0.tar.gz
Algorithm Hash digest
SHA256 b564ccbc3a4e3d4b9db5395efef4861c5f9930cc29c252dc58a0976e0330afd8
MD5 f1d1d8ba03f1b818f59106e8e5bca069
BLAKE2b-256 ee660414d26011c3078c4ebf32d51326d6e40f9552440a170d0733abc316a1b4

See more details on using hashes here.

Provenance

The following attestation bundles were made for matrx_scraper-0.1.0.tar.gz:

Publisher: publish-package.yml on AI-Matrix-Engine/aidream

Attestations: Values shown here reflect the state when the release was signed and may no longer be current.

File details

Details for the file matrx_scraper-0.1.0-py3-none-any.whl.

File metadata

  • Download URL: matrx_scraper-0.1.0-py3-none-any.whl
  • Upload date:
  • Size: 167.2 kB
  • Tags: Python 3
  • Uploaded using Trusted Publishing? Yes
  • Uploaded via: twine/6.1.0 CPython/3.13.12

File hashes

Hashes for matrx_scraper-0.1.0-py3-none-any.whl
Algorithm Hash digest
SHA256 80fecaaa6631990f787dec4ef8b20fa47a2615975595db669ef63d8e7fd673a1
MD5 603e1840821f0d15022963f930741b9b
BLAKE2b-256 8845f46c3450999c7af037bd2e5f4a42aabe87d6dce7b78b9cd4a49c36e6c33b

See more details on using hashes here.

Provenance

The following attestation bundles were made for matrx_scraper-0.1.0-py3-none-any.whl:

Publisher: publish-package.yml on AI-Matrix-Engine/aidream

Attestations: Values shown here reflect the state when the release was signed and may no longer be current.

Supported by

AWS Cloud computing and Security Sponsor Datadog Monitoring Depot Continuous Integration Fastly CDN Google Download Analytics Pingdom Monitoring Sentry Error logging StatusPage Status page