A user-friendly and high-speed web scraping library.
Project description
WebSweep
WebSweep is a Python library for high-throughput web scraping for researchers. It is designed to stay simple for beginners while still handling large URL lists. The primary objective is to run effectively on a single computer (laptop or workstation), without requiring cloud infrastructure or distributed orchestration.
It is built for projects that start with a list of websites and need a workflow that is easy to rerun, inspect, and extend:
- crawl websites from a list of base URLs
- follow only within-domain links up to a bounded depth
- extract page-level text and metadata
- consolidate results back to one record per domain
- (if desired) repeat the same sweep monthly or quarterly with the same configured instance
The goal is research infrastructure, not cloud orchestration. WebSweep is meant to work well on a laptop or workstation, with intermediate outputs that are easy to archive, validate, and analyse later.
What WebSweep Is Good For
WebSweep fits best when you want to study many websites in a comparable, repeatable way.
Typical research uses:
- track how organisations discuss a topic over time across many domains
- build corpora from university, company, NGO, or government websites
- monitor recurring updates on the same website lists every few months
- extract page text plus a few structured fields for downstream analysis
It is especially useful when the unit of analysis is the domain or organisation, not one single large site.
WebSweep is probably not the right tool when:
- you need to interact with JavaScript-heavy websites
- you want to scrape one very complex website with highly custom logic
- you need browser automation rather than HTML crawling
For those cases, tools such as Scrapy or Selenium may be a better fit.
Install
pip install websweep
What You Need
WebSweep needs a list of URLs.
- CLI mode: CSV or TSV file with header (
url, optionalidentifier) - Library mode: Python list of URLs or
(url, identifier)tuples
Example CSV:
url,identifier
https://example.com,example
https://example2.org,example_org
Choose Your Mode
- CLI mode (
websweep ...): easiest way to run repeatable instance-based crawls from a CSV/TSV source file. - Library mode (
from websweep import ...): best when you need custom Python logic (custom extractors, analysis loops, notebooks).
Workflow
Input URLs
-> Crawler
In: URL list + crawl settings
Out:
crawled_data/*.zip (zipped pages per domain)
overview_urls.{duckdb|db|tsv} (per-page crawl status overview)
-> Extractor
In: overview file + crawled_data/*.zip
Out: extracted_data/*.ndjson (extracted data per web page)
-> Consolidator
In: extracted_data/*.ndjson
Out: consolidated_data/*.ndjson (consolidated to domain level)
One-pass mode (lower disk usage):
Input URLs -> Crawler(extract=True, save_html=False) -> extracted_data/*.ndjson
What Each Step Does
Crawler: starts from base URLs (one domain per row), downloads pages, follows only within-domain links, applies exclusion rules (for example blocked extensions), and stops at depthmax_level(default:3).Extractor: reads crawled pages and extracts structured page-level fields such as cleaned text (text), metadata (meta_*), and location fields (zipcode,address).Consolidator: merges page-level records back to one record per domain, keeping concatenated domain text and aggregated information (e.g.,zipcodefrequencies, where the most frequent can be treated as the main postcode and others as additional postcodes)
Quickstart (Python)
from pathlib import Path
from websweep import Crawler, Extractor, Consolidator
urls = [
"https://www.dggrootverbruik.nl/",
"https://www.gosliga.nl/",
"https://www.heeren2.nl/",
]
out = Path("./research_output")
# 1) Crawl
Crawler(target_folder_path=out).crawl_base_urls(urls)
# 2) Extract
Extractor(target_folder_path=out).extract_urls()
# 3) Consolidate
Consolidator(target_folder_path=out).consolidate()
Quickstart (CLI)
websweep init --headless
websweep crawl
websweep extract
websweep consolidate
For lower disk usage:
websweep crawl --extract
websweep consolidate
Optional extractor add-on file (CLI):
set it during websweep init when prompted for a custom extractor add-on path.
Leave it empty/No for the default (None).
When provided, WebSweep copies the add-on into the instance folder (next to
settings.ini) so extraction does not depend on the original source location.
Using target_temp_folder_path (CLI and library):
- Use this when you want in-progress crawl files on a fast local disk.
- Raw page files are staged under
target_temp_folder_path/crawled_data/...while crawling. - Final domain zip files are written to
target_folder_path/crawled_data/*.zip. - The overview file (
overview_urls.duckdb/.db/.tsv) is always kept intarget_folder_path. - After each domain is archived, staged raw files are removed from the temp path.
Core Options (Library)
Most users only need these options:
Crawler(...)max_level: depth of within-domain link following (default3)max_pages_per_domain: cap pages per domainextract=Trueandsave_html=False: one-pass crawl+extract modeallow_extensions/block_extensions: file type filteringtarget_temp_folder_path: optional temp folder for in-progress raw crawl files
Extractor(...)workers: extraction process countstart_date,end_date: session-date window for extractionfile_extractor: custom extractor subclass for add-on fields
Consolidator(...)target_folder_path: use default extracted input and standard consolidated outputchunk_size: consolidation chunk size for large extracted files
Advanced parameters are available in the API docs and User Guide.
Advanced example (explicit files):
Consolidator(
input_file=out / "extracted_data" / "extracted_data_2026-02-23_0-1000000.ndjson",
output_file=out / "consolidated_data" / "custom_consolidated.ndjson",
).consolidate()
Custom Extraction Add-ons
By default, core FileExtractor keeps extraction conservative:
- metadata (
meta_*) - cleaned text (
text) - zipcode/address (
zipcode,address)
It does not extract phone, email, or fax unless you add custom
methods.
Create a custom add-on by subclassing FileExtractor and adding methods named
_extract_<fieldname>:
from pathlib import Path
import re
from websweep import Extractor
from websweep.extractor.extractor import FileExtractor
class ResearchFileExtractor(FileExtractor):
def _extract_fax(self) -> list:
pattern = re.compile(
r"(?is)\b(?:faxnumber|fax|f)\b[^0-9\+]{0,12}"
r"([\+]?[0-9][0-9\-\s\(\)]{7,20})\b"
)
return sorted({m.strip() for m in re.findall(pattern, str(self.soup))})
Extractor(
target_folder_path=Path("./research_output"),
file_extractor=ResearchFileExtractor,
).extract_urls()
Repository add-on example:
addons/firmbackbone_extractor.py
CLI usage with the same add-on:
websweep init --headless
# answer the add-on question with:
# addons/firmbackbone_extractor.py
websweep extract
The add-on path is optional and defaults to None (no add-on extractor).
Once configured in the instance, websweep extract and one-pass
websweep crawl --extract use it automatically.
Choosing Which Files to Block/Allow
Rules are defined in:
src/websweep/utils/default_regex.jsonclassify_url(...)insrc/websweep/utils/utils.py
CLI overrides:
websweep crawl --allow-extensions pdf,png
websweep crawl --block-extensions pdf,png,zip
websweep crawl --classification-file /path/to/rules.json
Notebook Example
Featured end-to-end notebook:
Backend Selection
Overview storage backends:
- DuckDB (preferred for larger runs)
- SQLite
- TSV
Choose backend mode during websweep init:
use_database = True(database mode insettings.ini) uses DuckDB when available and falls back to SQLite if needed.use_database = Falseuses TSV.
WebSweep also reuses any existing overview file in the instance
(overview_urls.duckdb, overview_urls.db, or overview_urls.tsv), so backend
selection is instance-level setup, not a required websweep crawl argument.
Recurring CLI Runs (Every X Months)
WebSweep CLI does not schedule recurring crawls by itself. You run the commands manually, or schedule them with your own tool (for example cron, systemd timers, Windows Task Scheduler, or GitHub Actions).
For periodic updates, keep one configured instance and run:
END_DATE=$(uv run python -c "from datetime import date; print(date.today().isoformat())")
START_DATE=$(uv run python -c "from datetime import date, timedelta; print((date.today()-timedelta(days=90)).isoformat())")
websweep crawl
websweep extract --start-date "$START_DATE" --end-date "$END_DATE"
websweep consolidate
This keeps crawling simple while extracting only a rolling recent window
(last 90 days in this example). Adjust timedelta(days=90) as needed.
Example (Linux cron, first day of every 3rd month at 02:00):
0 2 1 */3 * cd /path/to/websweep && END_DATE=$(HOME=/path/to/home uv run python -c "from datetime import date; print(date.today().isoformat())") && START_DATE=$(HOME=/path/to/home uv run python -c "from datetime import date, timedelta; print((date.today()-timedelta(days=90)).isoformat())") && HOME=/path/to/home uv run websweep crawl && HOME=/path/to/home uv run websweep extract --start-date "$START_DATE" --end-date "$END_DATE" && HOME=/path/to/home uv run websweep consolidate
To retry failed base URLs from a specific crawl session date:
websweep crawl --complement 2026-04-01
Extractor Date Windows (CLI)
Use date filters when extracting to limit processing to specific crawl sessions:
websweep extract --start-date 2026-02-01 --end-date 2026-02-28
These filters apply to session_date in overview_urls.* and include only
successful crawl rows (status == 200). The date flags are per-run CLI
arguments and are not persisted automatically in settings.ini.
Documentation
- Docs source:
docs/ - Build locally:
make docs - Read the Docs config:
.readthedocs.yml
Troubleshooting Crawl Statuses
Common non-200 statuses in overview_urls.*:
DNS lookup failed: domain cannot be resolved from current network/DNS.Connection failed: host resolved, but TCP/SSL connection failed.Request timeout: remote host did not respond in time.Robots unavailable in __test_domain_robots:robots.txtcheck failed; crawler falls back to allow-all robots policy for that base URL.
For historical domain lists, many failures can be expected because domains may have expired or moved.
Development
uv sync --group test --group docs --group dev
uv run pytest -q
uv run make docs
Contributing
Contributions are what make the open source community an amazing place to learn, inspire, and create. Any contributions you make are greatly appreciated.
Please refer to the DEVELOPMENT file for more information on how to run the library without installing and how to install it from source.
Please refer to the CONTRIBUTING file for more information on issues and pull requests.
License and citation
The package websweep is published under an MIT license. When using websweep for academic work, please cite:
XXX
Contact
This project is developed and maintained by the ODISSEI Social Data Science (SoDa) team and the FIRMBACKBONE Project.
FIRMBACKBONE is an organically growing longitudinal data-infrastructure with information on Dutch companies for scientific research. Once it is ready, it will become available for researchers and students affiliated with member universities in the Netherlands through ODISSEI, the Open Data Infrastructure for Social Science and Economic Innovations.
FIRMBACKBONE is an initiative of Utrecht University and the Vrije Universiteit Amsterdam funded by PDI-SSH, the Platform Digital Infrastructure-Social Sciences and Humanities, for the period 2020-2025.
Do you have questions, suggestions, or remarks? File an issue in the issue tracker or feel free to contact the team via https://odissei-data.nl/en/using-soda/.
Project details
Release history Release notifications | RSS feed
Download files
Download the file for your platform. If you're not sure which to choose, learn more about installing packages.
Source Distribution
Built Distribution
Filter files by name, interpreter, ABI, and platform.
If you're not sure about the file name format, learn more about wheel file names.
Copy a direct link to the current filters
File details
Details for the file websweep-0.1.tar.gz.
File metadata
- Download URL: websweep-0.1.tar.gz
- Upload date:
- Size: 474.5 kB
- Tags: Source
- Uploaded using Trusted Publishing? No
- Uploaded via: twine/6.1.0 CPython/3.13.2
File hashes
| Algorithm | Hash digest | |
|---|---|---|
| SHA256 |
e1225ba5be181e0d3d36683ec6af8b3553b1da0618e5b08e66a23071a7a55c2f
|
|
| MD5 |
1da1e7add4df484b17f227a2a0ec262b
|
|
| BLAKE2b-256 |
69d2bcbe297b5f85c283c402d8eb778062bd0116d88aa632ac20733121aabc36
|
File details
Details for the file websweep-0.1-py3-none-any.whl.
File metadata
- Download URL: websweep-0.1-py3-none-any.whl
- Upload date:
- Size: 137.4 kB
- Tags: Python 3
- Uploaded using Trusted Publishing? No
- Uploaded via: twine/6.1.0 CPython/3.13.2
File hashes
| Algorithm | Hash digest | |
|---|---|---|
| SHA256 |
867245f3c9da7819195b39d3a193582de92be5c87f28b737b1b3fdb47e934ff4
|
|
| MD5 |
aee75d8d7da3396a443aab5fc93db535
|
|
| BLAKE2b-256 |
d465d14ad1796e23d04d200ff1dbae44a98e706c7d4eab12d0438c4cbaf95506
|