Skip to main content

Tiny, zero-dependency web crawler — fetch, parse, crawl, store, GUI.

Project description

bawl

Tiny, zero-dependency crawler. Fetch, parse, crawl, sitemap, store, GUI. All stdlib.

pip install bawl

bawl https://example.com              # shorthand — JSONL to stdout
bawl page https://x.com -o data       # to file
bawl page https://x.com -f text       # plain text
bawl page https://x.com -f json       # JSON array
bawl crawl https://site.com --depth 2 # recursive (concurrent)
bawl crawl @urls.txt                  # URLs from file
bawl crawl --dedup                    # skip duplicate text content
bawl crawl --include '*.html'         # only crawl matching URLs
bawl crawl --exclude '*print*'        # skip matching URLs
bawl crawl --progress                 # live terminal status
bawl sitemap https://site.com/xml     # from sitemap
bawl gui                              # graphical interface
bawl cat < data.jsonl                 # read + print text

Library

from bawl import (
    fetch, parse, parse_html,           # single page
    save, load, dumps, loads,           # JSONL
    dumps_json_array, save_json_array,  # JSON array
    crawl, crawl_urls,                  # crawling
    parse_sitemap,                      # sitemaps
)

# single page
page = parse("https://example.com")
print(page.title, len(page.text))

# recursive crawl (concurrent, configurable workers)
pages = crawl("https://docs.python.org/3/", depth=2, max_pages=10, workers=8)

# content dedup, include/exclude filters, live progress
pages = crawl("https://site.com", depth=2, dedup=True,
              include=["*.html"], exclude=["*print*"],
              on_page=lambda p: ...)

# fetch URL list concurrently
pages = crawl_urls(["https://a.com", "https://b.com"], workers=5)

# from sitemap
urls = parse_sitemap("https://site.com/sitemap.xml")
for url in urls[:10]:
    save(parse(url), path="crawl.jsonl")

# JSON array output
save_json_array(pages, path="output.json")

# pipe-friendly
save(page)                                  # → stdout JSONL
for p in load("data.jsonl"):
    print(p.text[:200])

Page fields

Field Type Content
.url str Source URL
.title str <title> text
.text str Visible page text (block-separated by newlines)
.links list[dict] {"href": str, "text": str}
.tables list[dict] {"caption": str, "headers": list, "rows": list[list]}
.lists list[dict] `{"tag": "ul"
.code list[dict] {"lang": str, "body": str}
.meta dict Meta name/OG property → content

CLI

bawl https://example.com              # shorthand JSONL to stdout
bawl page https://x.com -o data       # to file
bawl page https://x.com -f text       # plain text
bawl page https://x.com -f json       # JSON array
bawl crawl https://site.com --depth 2 --workers 10
bawl crawl @urls.txt                  # read URLs from file
bawl sitemap https://site.com/xml
bawl gui                              # tkinter GUI
bawl cat < data.jsonl                 # read + print
bawl completion bash|zsh              # shell completion
bawl --version                        # → bawl 0.3.0

Options: --rate SEC, --timeout SEC, --workers N, --dedup,
         --include PATTERN, --exclude PATTERN, --progress

Why bawl

  • Zero dependencies — stdlib only (urllib, html.parser, json, xml, tkinter, concurrent)
  • ~18KB wheel — installs in <1 second
  • Concurrent — thread pool crawl speeds up multi-page fetches
  • 97 tests — CLI, parser edge cases, concurrent crawl, URL normalization, dedup, filters, progress
  • Modular — import only what you need
  • Composable — stdin/stdout JSONL/JSON, works with any pipeline
  • GUI includedbawl gui for interactive browsing

License

MIT

Project details


Download files

Download the file for your platform. If you're not sure which to choose, learn more about installing packages.

Source Distributions

No source distribution files available for this release.See tutorial on generating distribution archives.

Built Distribution

If you're not sure about the file name format, learn more about wheel file names.

bawl-0.4.0-py3-none-any.whl (19.0 kB view details)

Uploaded Python 3

File details

Details for the file bawl-0.4.0-py3-none-any.whl.

File metadata

  • Download URL: bawl-0.4.0-py3-none-any.whl
  • Upload date:
  • Size: 19.0 kB
  • Tags: Python 3
  • Uploaded using Trusted Publishing? No
  • Uploaded via: twine/6.2.0 CPython/3.9.6

File hashes

Hashes for bawl-0.4.0-py3-none-any.whl
Algorithm Hash digest
SHA256 8e6994d63a322faf13aca01c0f1b1cb95b755b113bc8fb2c88f16c3e91aae24b
MD5 14347104c627b84386fa94f5d48d2f4c
BLAKE2b-256 f1e2ce7b57117b7e5e5685ade67c991125a18ec509c6c6d0321aa7240d0a6eb4

See more details on using hashes here.

Supported by

AWS Cloud computing and Security Sponsor Datadog Monitoring Depot Continuous Integration Fastly CDN Google Download Analytics Pingdom Monitoring Sentry Error logging StatusPage Status page