Tiny, zero-dependency crawler — fetch, parse, crawl, store, GUI. Works with existing apps.
Project description
trawcsy
Tiny, zero-dependency crawler. Fetch, parse, crawl, sitemap, store, GUI. All stdlib.
pip install trawcsy
trawcsy https://example.com # shorthand — JSONL to stdout
trawcsy page https://x.com -o data # to file
trawcsy page https://x.com -f text # plain text
trawcsy page https://x.com -f json # JSON array
trawcsy crawl https://site.com --depth 2 # recursive (concurrent)
trawcsy crawl @urls.txt # URLs from file
trawcsy sitemap https://site.com/xml # from sitemap
trawcsy gui # graphical interface
trawcsy cat < data.jsonl # read + print text
Library
from trawcsy import (
fetch, parse, parse_html, # single page
save, load, dumps, loads, # JSONL
dumps_json_array, save_json_array, # JSON array
crawl, crawl_urls, # crawling
parse_sitemap, # sitemaps
)
# single page
page = parse("https://example.com")
print(page.title, len(page.text))
# recursive crawl (concurrent, configurable workers)
pages = crawl("https://docs.python.org/3/", depth=2, max_pages=10, workers=8)
# fetch URL list concurrently
pages = crawl_urls(["https://a.com", "https://b.com"], workers=5)
# from sitemap
urls = parse_sitemap("https://site.com/sitemap.xml")
for url in urls[:10]:
save(parse(url), path="crawl.jsonl")
# JSON array output
save_json_array(pages, path="output.json")
# pipe-friendly
save(page) # → stdout JSONL
for p in load("data.jsonl"):
print(p.text[:200])
Page fields
| Field | Type | Content |
|---|---|---|
.url |
str |
Source URL |
.title |
str |
<title> text |
.text |
str |
Visible page text (block-separated by newlines) |
.links |
list[dict] |
{"href": str, "text": str} |
.tables |
list[dict] |
{"caption": str, "headers": list, "rows": list[list]} |
.lists |
list[dict] |
`{"tag": "ul" |
.code |
list[dict] |
{"lang": str, "body": str} |
.meta |
dict |
Meta name/OG property → content |
CLI
trawcsy https://example.com # shorthand JSONL to stdout
trawcsy page https://x.com -o data # to file
trawcsy page https://x.com -f text # plain text
trawcsy page https://x.com -f json # JSON array
trawcsy crawl https://site.com --depth 2 --workers 10
trawcsy crawl @urls.txt # read URLs from file
trawcsy sitemap https://site.com/xml
trawcsy gui # tkinter GUI
trawcsy cat < data.jsonl # read + print
trawcsy completion bash|zsh # shell completion
trawcsy --version # → trawcsy 0.3.0
Options: --rate SEC, --timeout SEC, --workers N
Why trawcsy
- Zero dependencies — stdlib only (urllib, html.parser, json, xml, tkinter, concurrent)
- ~17KB wheel — installs in <1 second
- Concurrent — thread pool crawl speeds up multi-page fetches
- 80 tests — CLI, parser edge cases, concurrent crawl, URL normalization, JSON array, sitemap
- Modular — import only what you need
- Composable — stdin/stdout JSONL/JSON, works with any pipeline
- GUI included —
trawcsy guifor interactive browsing
License
MIT
Project details
Release history Release notifications | RSS feed
Download files
Download the file for your platform. If you're not sure which to choose, learn more about installing packages.
Source Distributions
No source distribution files available for this release.See tutorial on generating distribution archives.
Built Distribution
Filter files by name, interpreter, ABI, and platform.
If you're not sure about the file name format, learn more about wheel file names.
Copy a direct link to the current filters
trawcsy-0.3.0-py3-none-any.whl
(18.9 kB
view details)
File details
Details for the file trawcsy-0.3.0-py3-none-any.whl.
File metadata
- Download URL: trawcsy-0.3.0-py3-none-any.whl
- Upload date:
- Size: 18.9 kB
- Tags: Python 3
- Uploaded using Trusted Publishing? No
- Uploaded via: twine/6.2.0 CPython/3.9.6
File hashes
| Algorithm | Hash digest | |
|---|---|---|
| SHA256 |
35e3492ae55f4f9e4f688bc1e0322a5e9af3fd47104a85fb14814543c85343f5
|
|
| MD5 |
cfe195b96909abad657abcd6b29cc8ec
|
|
| BLAKE2b-256 |
4e120f21367c6f6aedf71fa8188555201205b93a55a8c131a8320f2de47a8900
|