Skip to main content

Tiny, zero-dependency crawler — fetch, parse, crawl, store, GUI. Works with existing apps.

Project description

trawcsy

Tiny, zero-dependency crawler. Fetch, parse, crawl, sitemap, store, GUI. All stdlib.

pip install trawcsy

trawcsy https://example.com              # shorthand — JSONL to stdout
trawcsy page https://x.com -o data       # to file
trawcsy page https://x.com -f text       # plain text
trawcsy page https://x.com -f json       # JSON array
trawcsy crawl https://site.com --depth 2 # recursive (concurrent)
trawcsy crawl @urls.txt                  # URLs from file
trawcsy sitemap https://site.com/xml     # from sitemap
trawcsy gui                              # graphical interface
trawcsy cat < data.jsonl                 # read + print text

Library

from trawcsy import (
    fetch, parse, parse_html,           # single page
    save, load, dumps, loads,           # JSONL
    dumps_json_array, save_json_array,  # JSON array
    crawl, crawl_urls,                  # crawling
    parse_sitemap,                      # sitemaps
)

# single page
page = parse("https://example.com")
print(page.title, len(page.text))

# recursive crawl (concurrent, configurable workers)
pages = crawl("https://docs.python.org/3/", depth=2, max_pages=10, workers=8)

# fetch URL list concurrently
pages = crawl_urls(["https://a.com", "https://b.com"], workers=5)

# from sitemap
urls = parse_sitemap("https://site.com/sitemap.xml")
for url in urls[:10]:
    save(parse(url), path="crawl.jsonl")

# JSON array output
save_json_array(pages, path="output.json")

# pipe-friendly
save(page)                                  # → stdout JSONL
for p in load("data.jsonl"):
    print(p.text[:200])

Page fields

Field Type Content
.url str Source URL
.title str <title> text
.text str Visible page text (block-separated by newlines)
.links list[dict] {"href": str, "text": str}
.tables list[dict] {"caption": str, "headers": list, "rows": list[list]}
.lists list[dict] `{"tag": "ul"
.code list[dict] {"lang": str, "body": str}
.meta dict Meta name/OG property → content

CLI

trawcsy https://example.com              # shorthand JSONL to stdout
trawcsy page https://x.com -o data       # to file
trawcsy page https://x.com -f text       # plain text
trawcsy page https://x.com -f json       # JSON array
trawcsy crawl https://site.com --depth 2 --workers 10
trawcsy crawl @urls.txt                  # read URLs from file
trawcsy sitemap https://site.com/xml
trawcsy gui                              # tkinter GUI
trawcsy cat < data.jsonl                 # read + print
trawcsy completion bash|zsh              # shell completion
trawcsy --version                        # → trawcsy 0.3.0

Options: --rate SEC, --timeout SEC, --workers N

Why trawcsy

  • Zero dependencies — stdlib only (urllib, html.parser, json, xml, tkinter, concurrent)
  • ~17KB wheel — installs in <1 second
  • Concurrent — thread pool crawl speeds up multi-page fetches
  • 80 tests — CLI, parser edge cases, concurrent crawl, URL normalization, JSON array, sitemap
  • Modular — import only what you need
  • Composable — stdin/stdout JSONL/JSON, works with any pipeline
  • GUI includedtrawcsy gui for interactive browsing

License

MIT

Project details


Download files

Download the file for your platform. If you're not sure which to choose, learn more about installing packages.

Source Distributions

No source distribution files available for this release.See tutorial on generating distribution archives.

Built Distribution

If you're not sure about the file name format, learn more about wheel file names.

trawcsy-0.3.0-py3-none-any.whl (18.9 kB view details)

Uploaded Python 3

File details

Details for the file trawcsy-0.3.0-py3-none-any.whl.

File metadata

  • Download URL: trawcsy-0.3.0-py3-none-any.whl
  • Upload date:
  • Size: 18.9 kB
  • Tags: Python 3
  • Uploaded using Trusted Publishing? No
  • Uploaded via: twine/6.2.0 CPython/3.9.6

File hashes

Hashes for trawcsy-0.3.0-py3-none-any.whl
Algorithm Hash digest
SHA256 35e3492ae55f4f9e4f688bc1e0322a5e9af3fd47104a85fb14814543c85343f5
MD5 cfe195b96909abad657abcd6b29cc8ec
BLAKE2b-256 4e120f21367c6f6aedf71fa8188555201205b93a55a8c131a8320f2de47a8900

See more details on using hashes here.

Supported by

AWS Cloud computing and Security Sponsor Datadog Monitoring Depot Continuous Integration Fastly CDN Google Download Analytics Pingdom Monitoring Sentry Error logging StatusPage Status page