Tiny, zero-dependency web crawler — fetch, parse, crawl, store, GUI.
Project description
bawl
Tiny, zero-dependency crawler. Fetch, parse, crawl, sitemap, store, GUI. All stdlib.
pip install bawl
bawl https://example.com # shorthand — JSONL to stdout
bawl page https://x.com -o data # to file
bawl page https://x.com -f text # plain text
bawl page https://x.com -f json # JSON array
bawl crawl https://site.com --depth 2 # recursive (concurrent)
bawl crawl @urls.txt # URLs from file
bawl crawl --dedup # skip duplicate text content
bawl crawl --include '*.html' # only crawl matching URLs
bawl crawl --exclude '*print*' # skip matching URLs
bawl crawl --progress # live terminal status
bawl sitemap https://site.com/xml # from sitemap
bawl gui # graphical interface
bawl cat < data.jsonl # read + print text
Library
from bawl import (
fetch, parse, parse_html, # single page
save, load, dumps, loads, # JSONL
dumps_json_array, save_json_array, # JSON array
crawl, crawl_urls, # crawling
parse_sitemap, # sitemaps
)
# single page
page = parse("https://example.com")
print(page.title, len(page.text))
# recursive crawl (concurrent, configurable workers)
pages = crawl("https://docs.python.org/3/", depth=2, max_pages=10, workers=8)
# content dedup, include/exclude filters, live progress
pages = crawl("https://site.com", depth=2, dedup=True,
include=["*.html"], exclude=["*print*"],
on_page=lambda p: ...)
# fetch URL list concurrently
pages = crawl_urls(["https://a.com", "https://b.com"], workers=5)
# from sitemap
urls = parse_sitemap("https://site.com/sitemap.xml")
for url in urls[:10]:
save(parse(url), path="crawl.jsonl")
# JSON array output
save_json_array(pages, path="output.json")
# pipe-friendly
save(page) # → stdout JSONL
for p in load("data.jsonl"):
print(p.text[:200])
Page fields
| Field | Type | Content |
|---|---|---|
.url |
str |
Source URL |
.title |
str |
<title> text |
.text |
str |
Visible page text (block-separated by newlines) |
.links |
list[dict] |
{"href": str, "text": str} |
.tables |
list[dict] |
{"caption": str, "headers": list, "rows": list[list]} |
.lists |
list[dict] |
`{"tag": "ul" |
.code |
list[dict] |
{"lang": str, "body": str} |
.meta |
dict |
Meta name/OG property → content |
CLI
bawl https://example.com # shorthand JSONL to stdout
bawl page https://x.com -o data # to file
bawl page https://x.com -f text # plain text
bawl page https://x.com -f json # JSON array
bawl crawl https://site.com --depth 2 --workers 10
bawl crawl @urls.txt # read URLs from file
bawl sitemap https://site.com/xml
bawl gui # tkinter GUI
bawl cat < data.jsonl # read + print
bawl completion bash|zsh # shell completion
bawl --version # → bawl 0.3.0
Options: --rate SEC, --timeout SEC, --workers N, --dedup,
--include PATTERN, --exclude PATTERN, --progress
Why bawl
- Zero dependencies — stdlib only (urllib, html.parser, json, xml, tkinter, concurrent)
- ~18KB wheel — installs in <1 second
- Concurrent — thread pool crawl speeds up multi-page fetches
- 97 tests — CLI, parser edge cases, concurrent crawl, URL normalization, dedup, filters, progress
- Modular — import only what you need
- Composable — stdin/stdout JSONL/JSON, works with any pipeline
- GUI included —
bawl guifor interactive browsing
License
MIT
Project details
Release history Release notifications | RSS feed
Download files
Download the file for your platform. If you're not sure which to choose, learn more about installing packages.
Source Distributions
No source distribution files available for this release.See tutorial on generating distribution archives.
Built Distribution
Filter files by name, interpreter, ABI, and platform.
If you're not sure about the file name format, learn more about wheel file names.
Copy a direct link to the current filters
bawl-0.4.0-py3-none-any.whl
(19.0 kB
view details)
File details
Details for the file bawl-0.4.0-py3-none-any.whl.
File metadata
- Download URL: bawl-0.4.0-py3-none-any.whl
- Upload date:
- Size: 19.0 kB
- Tags: Python 3
- Uploaded using Trusted Publishing? No
- Uploaded via: twine/6.2.0 CPython/3.9.6
File hashes
| Algorithm | Hash digest | |
|---|---|---|
| SHA256 |
8e6994d63a322faf13aca01c0f1b1cb95b755b113bc8fb2c88f16c3e91aae24b
|
|
| MD5 |
14347104c627b84386fa94f5d48d2f4c
|
|
| BLAKE2b-256 |
f1e2ce7b57117b7e5e5685ade67c991125a18ec509c6c6d0321aa7240d0a6eb4
|