Tiny, zero-dependency crawler — fetch, parse, crawl, store, GUI. Works with existing apps.

These details have not been verified by PyPI

Project description

trawcsy

Tiny, zero-dependency crawler. Fetch, parse, crawl, sitemap, store, GUI. All stdlib.

pip install trawcsy

trawcsy https://example.com              # shorthand — JSONL to stdout
trawcsy page https://x.com -o data       # to file
trawcsy page https://x.com -f text       # plain text
trawcsy page https://x.com -f json       # JSON array
trawcsy crawl https://site.com --depth 2 # recursive (concurrent)
trawcsy crawl @urls.txt                  # URLs from file
trawcsy sitemap https://site.com/xml     # from sitemap
trawcsy gui                              # graphical interface
trawcsy cat < data.jsonl                 # read + print text

Library

from trawcsy import (
    fetch, parse, parse_html,           # single page
    save, load, dumps, loads,           # JSONL
    dumps_json_array, save_json_array,  # JSON array
    crawl, crawl_urls,                  # crawling
    parse_sitemap,                      # sitemaps
)

# single page
page = parse("https://example.com")
print(page.title, len(page.text))

# recursive crawl (concurrent, configurable workers)
pages = crawl("https://docs.python.org/3/", depth=2, max_pages=10, workers=8)

# fetch URL list concurrently
pages = crawl_urls(["https://a.com", "https://b.com"], workers=5)

# from sitemap
urls = parse_sitemap("https://site.com/sitemap.xml")
for url in urls[:10]:
    save(parse(url), path="crawl.jsonl")

# JSON array output
save_json_array(pages, path="output.json")

# pipe-friendly
save(page)                                  # → stdout JSONL
for p in load("data.jsonl"):
    print(p.text[:200])

Page fields

Field	Type	Content
`.url`	`str`	Source URL
`.title`	`str`	`<title>` text
`.text`	`str`	Visible page text (block-separated by newlines)
`.links`	`list[dict]`	`{"href": str, "text": str}`
`.tables`	`list[dict]`	`{"caption": str, "headers": list, "rows": list[list]}`
`.lists`	`list[dict]`	`{"tag": "ul"
`.code`	`list[dict]`	`{"lang": str, "body": str}`
`.meta`	`dict`	Meta name/OG property → content

CLI

trawcsy https://example.com              # shorthand JSONL to stdout
trawcsy page https://x.com -o data       # to file
trawcsy page https://x.com -f text       # plain text
trawcsy page https://x.com -f json       # JSON array
trawcsy crawl https://site.com --depth 2 --workers 10
trawcsy crawl @urls.txt                  # read URLs from file
trawcsy sitemap https://site.com/xml
trawcsy gui                              # tkinter GUI
trawcsy cat < data.jsonl                 # read + print
trawcsy completion bash|zsh              # shell completion
trawcsy --version                        # → trawcsy 0.3.0

Options: --rate SEC, --timeout SEC, --workers N

Why trawcsy

Zero dependencies — stdlib only (urllib, html.parser, json, xml, tkinter, concurrent)
~17KB wheel — installs in <1 second
Concurrent — thread pool crawl speeds up multi-page fetches
80 tests — CLI, parser edge cases, concurrent crawl, URL normalization, JSON array, sitemap
Modular — import only what you need
Composable — stdin/stdout JSONL/JSON, works with any pipeline
GUI included — trawcsy gui for interactive browsing

License

MIT

Project details

These details have not been verified by PyPI

Release history Release notifications | RSS feed

This version

0.3.0

May 26, 2026

Download files

Download the file for your platform. If you're not sure which to choose, learn more about installing packages.

Source Distributions

No source distribution files available for this release.See tutorial on generating distribution archives.

Built Distribution

If you're not sure about the file name format, learn more about wheel file names.

The dropdown lists show the available interpreters, ABIs, and platforms. Enable javascript to be able to filter the list of wheel files.

trawcsy-0.3.0-py3-none-any.whl (18.9 kB view details)

Uploaded May 26, 2026 Python 3

File details

Details for the file trawcsy-0.3.0-py3-none-any.whl.

File metadata

Download URL: trawcsy-0.3.0-py3-none-any.whl
Upload date: May 26, 2026
Size: 18.9 kB
Tags: Python 3
Uploaded using Trusted Publishing? No
Uploaded via: twine/6.2.0 CPython/3.9.6

File hashes

Hashes for trawcsy-0.3.0-py3-none-any.whl
Algorithm	Hash digest
SHA256	`35e3492ae55f4f9e4f688bc1e0322a5e9af3fd47104a85fb14814543c85343f5`
MD5	`cfe195b96909abad657abcd6b29cc8ec`
BLAKE2b-256	`4e120f21367c6f6aedf71fa8188555201205b93a55a8c131a8320f2de47a8900`

See more details on using hashes here.

trawcsy 0.3.0

Navigation

Verified details

Maintainers

Unverified details

Meta

Classifiers