Tiny, zero-dependency web crawler — fetch, parse, crawl, store, GUI.

These details have not been verified by PyPI

Project description

bawl

Tiny, zero-dependency crawler. Fetch, parse, crawl, sitemap, store, GUI. All stdlib.

pip install bawl

bawl https://example.com              # shorthand — JSONL to stdout
bawl page https://x.com -o data       # to file
bawl page https://x.com -f text       # plain text
bawl page https://x.com -f json       # JSON array
bawl crawl https://site.com --depth 2 # recursive (concurrent)
bawl crawl @urls.txt                  # URLs from file
bawl crawl --dedup                    # skip duplicate text content
bawl crawl --include '*.html'         # only crawl matching URLs
bawl crawl --exclude '*print*'        # skip matching URLs
bawl crawl --progress                 # live terminal status
bawl sitemap https://site.com/xml     # from sitemap
bawl gui                              # graphical interface
bawl cat < data.jsonl                 # read + print text

Library

from bawl import (
    fetch, parse, parse_html,           # single page
    save, load, dumps, loads,           # JSONL
    dumps_json_array, save_json_array,  # JSON array
    crawl, crawl_urls,                  # crawling
    parse_sitemap,                      # sitemaps
)

# single page
page = parse("https://example.com")
print(page.title, len(page.text))

# recursive crawl (concurrent, configurable workers)
pages = crawl("https://docs.python.org/3/", depth=2, max_pages=10, workers=8)

# content dedup, include/exclude filters, live progress
pages = crawl("https://site.com", depth=2, dedup=True,
              include=["*.html"], exclude=["*print*"],
              on_page=lambda p: ...)

# fetch URL list concurrently
pages = crawl_urls(["https://a.com", "https://b.com"], workers=5)

# from sitemap
urls = parse_sitemap("https://site.com/sitemap.xml")
for url in urls[:10]:
    save(parse(url), path="crawl.jsonl")

# JSON array output
save_json_array(pages, path="output.json")

# pipe-friendly
save(page)                                  # → stdout JSONL
for p in load("data.jsonl"):
    print(p.text[:200])

Page fields

Field	Type	Content
`.url`	`str`	Source URL
`.title`	`str`	`<title>` text
`.text`	`str`	Visible page text (block-separated by newlines)
`.links`	`list[dict]`	`{"href": str, "text": str}`
`.tables`	`list[dict]`	`{"caption": str, "headers": list, "rows": list[list]}`
`.lists`	`list[dict]`	`{"tag": "ul"
`.code`	`list[dict]`	`{"lang": str, "body": str}`
`.meta`	`dict`	Meta name/OG property → content

CLI

bawl https://example.com              # shorthand JSONL to stdout
bawl page https://x.com -o data       # to file
bawl page https://x.com -f text       # plain text
bawl page https://x.com -f json       # JSON array
bawl crawl https://site.com --depth 2 --workers 10
bawl crawl @urls.txt                  # read URLs from file
bawl sitemap https://site.com/xml
bawl gui                              # tkinter GUI
bawl cat < data.jsonl                 # read + print
bawl completion bash|zsh              # shell completion
bawl --version                        # → bawl 0.3.0

Options: --rate SEC, --timeout SEC, --workers N, --dedup,
         --include PATTERN, --exclude PATTERN, --progress

Why bawl

Zero dependencies — stdlib only (urllib, html.parser, json, xml, tkinter, concurrent)
~18KB wheel — installs in <1 second
Concurrent — thread pool crawl speeds up multi-page fetches
97 tests — CLI, parser edge cases, concurrent crawl, URL normalization, dedup, filters, progress
Modular — import only what you need
Composable — stdin/stdout JSONL/JSON, works with any pipeline
GUI included — bawl gui for interactive browsing

License

MIT

Project details

These details have not been verified by PyPI

Release history Release notifications | RSS feed

This version

0.4.0

May 31, 2026

Download files

Download the file for your platform. If you're not sure which to choose, learn more about installing packages.

Source Distributions

No source distribution files available for this release.See tutorial on generating distribution archives.

Built Distribution

If you're not sure about the file name format, learn more about wheel file names.

The dropdown lists show the available interpreters, ABIs, and platforms. Enable javascript to be able to filter the list of wheel files.

bawl-0.4.0-py3-none-any.whl (19.0 kB view details)

Uploaded May 31, 2026 Python 3

File details

Details for the file bawl-0.4.0-py3-none-any.whl.

File metadata

Download URL: bawl-0.4.0-py3-none-any.whl
Upload date: May 31, 2026
Size: 19.0 kB
Tags: Python 3
Uploaded using Trusted Publishing? No
Uploaded via: twine/6.2.0 CPython/3.9.6

File hashes

Hashes for bawl-0.4.0-py3-none-any.whl
Algorithm	Hash digest
SHA256	`8e6994d63a322faf13aca01c0f1b1cb95b755b113bc8fb2c88f16c3e91aae24b`
MD5	`14347104c627b84386fa94f5d48d2f4c`
BLAKE2b-256	`f1e2ce7b57117b7e5e5685ade67c991125a18ec509c6c6d0321aa7240d0a6eb4`

See more details on using hashes here.

bawl 0.4.0

Navigation

Verified details

Maintainers

Unverified details

Meta

Classifiers