Skip to main content

Opinionated CLI wrapper around trafilatura for LLM-oriented webpage/site dumps.

Project description

sitemix

sitemix is a small, opinionated CLI that turns a webpage or a small website into a single LLM-oriented dump file.

Core extraction is powered by trafilatura.

License and dependency note

  • sitemix code: MIT (see LICENSE)
  • trafilatura: Apache 2.0
  • sitemix is an opinionated wrapper around trafilatura; MIT applies only to the sitemix codebase.

What it does

  • sitemix page URL: extract one page into Markdown (default), JSON, or XML.
  • sitemix site URL: crawl a small site politely and produce one unified dump file.
  • Outputs are deterministic and structured for RAG/LLM ingestion.

Quickstart

python -m venv .venv
source .venv/bin/activate
pip install -e .
sitemix --help

pipx usage:

pipx install sitemix

CLI examples

Single page:

sitemix page "https://example.com/post" \
  --format md \
  --min-text-chars 400

Site crawl:

sitemix site "https://example.com" \
  --max-pages 200 \
  --delay 1.0 \
  --concurrency 2 \
  --format md

URLs from stdin:

printf '%s\n' "https://example.com/a" "https://example.com/b" | sitemix site "https://example.com" --no-sitemap

Sitemap path or URL:

sitemix site "https://example.com" --sitemap ./urls.txt
sitemix site "https://example.com" --sitemap https://example.com/sitemap.xml

Write to stdout:

sitemix page "https://example.com" --stdout

Politeness defaults

  • Robots respected by default (--ignore-robots to override)
  • Delay + jitter between requests
  • Same-host crawl by default (--include-external to override)
  • Query-heavy URLs skipped by default (--allow-query-heavy to override)
  • Basic binary URL skipping

See docs/politeness.md.

Output formats

  • Markdown: strict delimiters for each page (--- SITEMIX_PAGE ---, --- SITEMIX_TEXT ---, --- END_SITEMIX_PAGE ---)
  • JSON: single object with tool/run metadata and pages
  • XML: single <sitemixDump> root

See docs/format.md.

Development commands

pip install -e .[dev]
ruff check .
pytest

Homebrew distribution

A formula file lives at packaging/homebrew/sitemix.rb and is updated by release automation.

Automated release flow:

  1. Bump project.version in pyproject.toml and merge to main.
  2. .github/workflows/release-pypi.yml publishes to PyPI (trusted publishing).
  3. .github/workflows/release-homebrew.yml renders a formula from PyPI metadata and updates shaypal5/homebrew-tap.

Required setup:

  • GitHub Environment pypi configured for trusted publishing.
  • Repository secret HOMEBREW_TAP_GITHUB_TOKEN with push access to shaypal5/homebrew-tap.

Manual maintenance command:

python scripts/render_homebrew_formula.py --version X.Y.Z

Install with:

brew tap shaypal5/tap
brew install sitemix

Limitations

  • Not a massive web crawler; intended for small sites (hundreds of pages, not millions).
  • Extraction quality depends on source HTML and trafilatura behavior.
  • JS-heavy pages without server-rendered content may produce short output.

Ethics

Use only where you have legal and ethical right to fetch and process content. Respect robots, rate limits, and site terms.

Project details


Download files

Download the file for your platform. If you're not sure which to choose, learn more about installing packages.

Source Distribution

sitemix-0.2.6.tar.gz (22.3 kB view details)

Uploaded Source

Built Distribution

If you're not sure about the file name format, learn more about wheel file names.

sitemix-0.2.6-py3-none-any.whl (20.0 kB view details)

Uploaded Python 3

File details

Details for the file sitemix-0.2.6.tar.gz.

File metadata

  • Download URL: sitemix-0.2.6.tar.gz
  • Upload date:
  • Size: 22.3 kB
  • Tags: Source
  • Uploaded using Trusted Publishing? No
  • Uploaded via: twine/6.2.0 CPython/3.12.11

File hashes

Hashes for sitemix-0.2.6.tar.gz
Algorithm Hash digest
SHA256 423a42164995afa11ce5ec334a711a734d51d93b3695bf37e99367c2664d3c1d
MD5 3774b5c25e2c59ef4323394a369d0fdc
BLAKE2b-256 7fc34736b6489df23585d92a8c2f2b0438d7a771ac7f184f242db7920badcc61

See more details on using hashes here.

File details

Details for the file sitemix-0.2.6-py3-none-any.whl.

File metadata

  • Download URL: sitemix-0.2.6-py3-none-any.whl
  • Upload date:
  • Size: 20.0 kB
  • Tags: Python 3
  • Uploaded using Trusted Publishing? No
  • Uploaded via: twine/6.2.0 CPython/3.12.11

File hashes

Hashes for sitemix-0.2.6-py3-none-any.whl
Algorithm Hash digest
SHA256 9d9d4005749e9a36997d472665e3a4085e4c6fcb1c029bdfe25d78cc2de1c638
MD5 0342e66614266f0214c86345cb534e9c
BLAKE2b-256 06c52478584c908b3f709dada94c280016c91609eb3099ccfa182351b495bf6c

See more details on using hashes here.

Supported by

AWS Cloud computing and Security Sponsor Datadog Monitoring Depot Continuous Integration Fastly CDN Google Download Analytics Pingdom Monitoring Sentry Error logging StatusPage Status page