Opinionated CLI wrapper around trafilatura for LLM-oriented webpage/site dumps.
Project description
sitemix
sitemix is a small, opinionated CLI that turns a webpage or a small website into a single LLM-oriented dump file.
Core extraction is powered by trafilatura.
License and dependency note
sitemixcode: MIT (seeLICENSE)trafilatura: Apache 2.0sitemixis an opinionated wrapper around trafilatura; MIT applies only to thesitemixcodebase.
What it does
sitemix page URL: extract one page into Markdown (default), JSON, or XML.sitemix site URL: crawl a small site politely and produce one unified dump file.- Outputs are deterministic and structured for RAG/LLM ingestion.
Quickstart
python -m venv .venv
source .venv/bin/activate
pip install -e .
sitemix --help
pipx usage:
pipx install sitemix
CLI examples
Single page:
sitemix page "https://example.com/post" \
--format md \
--min-text-chars 400
Site crawl:
sitemix site "https://example.com" \
--max-pages 200 \
--delay 1.0 \
--concurrency 2 \
--format md
URLs from stdin:
printf '%s\n' "https://example.com/a" "https://example.com/b" | sitemix site "https://example.com" --no-sitemap
Sitemap path or URL:
sitemix site "https://example.com" --sitemap ./urls.txt
sitemix site "https://example.com" --sitemap https://example.com/sitemap.xml
Write to stdout:
sitemix page "https://example.com" --stdout
Politeness defaults
- Robots respected by default (
--ignore-robotsto override) - Delay + jitter between requests
- Same-host crawl by default (
--include-externalto override) - Query-heavy URLs skipped by default (
--allow-query-heavyto override) - Basic binary URL skipping
See docs/politeness.md.
Output formats
- Markdown: strict delimiters for each page (
--- SITEMIX_PAGE ---,--- SITEMIX_TEXT ---,--- END_SITEMIX_PAGE ---) - JSON: single object with tool/run metadata and pages
- XML: single
<sitemixDump>root
See docs/format.md.
Development commands
pip install -e .[dev]
ruff check .
pytest
Homebrew distribution
A formula file lives at packaging/homebrew/sitemix.rb and is updated by release automation.
Automated release flow:
- Bump
project.versioninpyproject.tomland merge tomain. .github/workflows/release-pypi.ymlpublishes to PyPI (trusted publishing)..github/workflows/release-homebrew.ymlrenders a formula from PyPI metadata and updatesshaypal5/homebrew-tap.
Required setup:
- GitHub Environment
pypiconfigured for trusted publishing. - Repository secret
HOMEBREW_TAP_GITHUB_TOKENwith push access toshaypal5/homebrew-tap.
Manual maintenance command:
python scripts/render_homebrew_formula.py --version X.Y.Z
Install with:
brew tap shaypal5/tap
brew install sitemix
Limitations
- Not a massive web crawler; intended for small sites (hundreds of pages, not millions).
- Extraction quality depends on source HTML and trafilatura behavior.
- JS-heavy pages without server-rendered content may produce short output.
Ethics
Use only where you have legal and ethical right to fetch and process content. Respect robots, rate limits, and site terms.
Project details
Release history Release notifications | RSS feed
Download files
Download the file for your platform. If you're not sure which to choose, learn more about installing packages.
Source Distribution
Built Distribution
Filter files by name, interpreter, ABI, and platform.
If you're not sure about the file name format, learn more about wheel file names.
Copy a direct link to the current filters
File details
Details for the file sitemix-0.2.6.tar.gz.
File metadata
- Download URL: sitemix-0.2.6.tar.gz
- Upload date:
- Size: 22.3 kB
- Tags: Source
- Uploaded using Trusted Publishing? No
- Uploaded via: twine/6.2.0 CPython/3.12.11
File hashes
| Algorithm | Hash digest | |
|---|---|---|
| SHA256 |
423a42164995afa11ce5ec334a711a734d51d93b3695bf37e99367c2664d3c1d
|
|
| MD5 |
3774b5c25e2c59ef4323394a369d0fdc
|
|
| BLAKE2b-256 |
7fc34736b6489df23585d92a8c2f2b0438d7a771ac7f184f242db7920badcc61
|
File details
Details for the file sitemix-0.2.6-py3-none-any.whl.
File metadata
- Download URL: sitemix-0.2.6-py3-none-any.whl
- Upload date:
- Size: 20.0 kB
- Tags: Python 3
- Uploaded using Trusted Publishing? No
- Uploaded via: twine/6.2.0 CPython/3.12.11
File hashes
| Algorithm | Hash digest | |
|---|---|---|
| SHA256 |
9d9d4005749e9a36997d472665e3a4085e4c6fcb1c029bdfe25d78cc2de1c638
|
|
| MD5 |
0342e66614266f0214c86345cb534e9c
|
|
| BLAKE2b-256 |
06c52478584c908b3f709dada94c280016c91609eb3099ccfa182351b495bf6c
|