A small importable Python module.
Project description
nscraper
nscraper is a small Python package scaffolded for two use cases:
- import it from other projects
- run it directly with
python -m nscraper
License
MIT. You can fork, modify, and reuse it with minimal restrictions as long as the license notice is kept with the software.
Install
pip install nscraper
To use the SeleniumBase engine, install SeleniumBase alongside nscraper:
pip install seleniumbase
For development:
uv sync --dev
Use as a module
from nscraper import HttpScraper, ScrapeOptions
options = ScrapeOptions(
url="https://example.com",
headers={"Accept": "text/html"},
)
content = HttpScraper(options).scrape()
print(content)
Run the Module
python -m nscraper -u https://example.com -H default
Fetch a URL:
python -m nscraper -u https://example.com -H default
python -m nscraper -u https://example.com -H '{"Accept": "text/html"}'
python -m nscraper -u https://example.com -H default -c cookies.json
python -m nscraper -u https://example.com -H default -t fast -o ~/scraped_data/example.html
python -m nscraper -u https://example.com -H default -o
python -m nscraper -u https://example.com -H default --pretty --print
python -m nscraper -u https://httpbin.org/get -H default -o --pretty --print
python -m nscraper -u https://example.com -H default -t basic
python -m nscraper -u https://example.com -H default --print
python -m nscraper -u https://example.com -H default -o ~/scraped_data/example.html --print
Current API
nscraper.ScrapeOptionsnscraper.BaseScrapernscraper.HttpScrapernscraper.SeleniumBaseScrapernscraper.get_scraper(options: ScrapeOptions) -> BaseScrapernscraper.validate_url(url: str) -> strnscraper.parse_headers(raw_headers: str | None) -> dict[str, str]nscraper.load_cookies_file(path: Path | str | None) -> dict[str, str] | Nonenscraper.fast_html_transform(content: str) -> strnscraper.basic_html_transform(content: str) -> str- runtime dependency:
niquests==3.18.4 - runtime dependency:
justhtml==1.14.0 - development dependency:
pytest
Module Flags
-u/--urlrequired-H/--headersrequired, ordefault-e/--enginewithhttporseleniumbase-p/--proxy--timeoutdefault3-o/--outputwrites to a file; bare-ouses automatic output, explicit paths must be absolute--printprints the result to stdout--prettypretty-prints the final HTML output-c/--cookies-fileoptional JSON file-t/--transformwithraw,basic, orfast; optional-d/--debugcompatibility flag; runtime status lines are printed by default
Behavior:
- invalid or malformed URLs raise
InvalidUrlError - missing or malformed headers raise
InvalidHeadersError - missing or malformed cookie files raise
InvalidCookiesError - use
-H defaultto apply the built-inAcceptandUser-Agentheader dict - use
-conly when you want to send cookies; omit it to keep current behavior - no transform runs unless
-t/--transformis explicitly provided - no HTML is printed unless
--printis provided - when
--outputand--printare both provided, stdout prints the written file content - output files are always overwritten
- missing parent directories for output files are created automatically
- bare
-owrites to.nscraper/<netloc>/<path>.<ext> - bare
-ousesindexfor root URLs such as/ - bare
-opreserves nested URL path segments as directories - bare
-oappends a short query hash when the URL contains a query string - explicit output paths must be absolute; relative paths fail immediately
- auto-generated output extensions are content-aware: HTML-like responses use
.html, JSON responses use.json --prettyformats the final response after the selected transform mode is applied; JSON responses are pretty-printed as JSONrawreturns the fetched supported response with no cleanupfastremoves a small set of noisy elements such asscript,style,noscript,iframe, andtemplatefor HTML responsesbasicperforms heavier cleanup, including hidden elements, head cleanup, and ad-like selectors for HTML responses- response handling is classified by content type; only HTML and JSON responses are supported
- unsupported content types fail immediately before transform or output is written
- runtime status lines include per-step timings for request, transform, pretty-formatting, and file write operations
- the
seleniumbaseengine loads pages in a browser session and returns the final page source - the
seleniumbaseengine is optional; if SeleniumBase is not installed, that engine fails with a clear runtime error
Default User-Agent:
Mozilla/5.0 (Windows NT 10.0; Win64; x64) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/146.0.0.0 Safari/537.36
The package is intentionally minimal so you can extend it into a reusable library and publish it to PyPI.
Project details
Download files
Download the file for your platform. If you're not sure which to choose, learn more about installing packages.
Source Distribution
Built Distribution
Filter files by name, interpreter, ABI, and platform.
If you're not sure about the file name format, learn more about wheel file names.
Copy a direct link to the current filters
File details
Details for the file nscraper-0.1.5.tar.gz.
File metadata
- Download URL: nscraper-0.1.5.tar.gz
- Upload date:
- Size: 56.0 kB
- Tags: Source
- Uploaded using Trusted Publishing? Yes
- Uploaded via: twine/6.1.0 CPython/3.13.12
File hashes
| Algorithm | Hash digest | |
|---|---|---|
| SHA256 |
569114170a2b8e1386bec59d8f929d41f0e458728922f0d2660df90daef017d7
|
|
| MD5 |
742c4057329335b7818db36fa94a7ea8
|
|
| BLAKE2b-256 |
fed8b9aa89b497598c744a63faf8ccdca00499aaf2de1fa9cc578889e93885a1
|
Provenance
The following attestation bundles were made for nscraper-0.1.5.tar.gz:
Publisher:
release.yml on mikerr1/nscraper
-
Statement:
-
Statement type:
https://in-toto.io/Statement/v1 -
Predicate type:
https://docs.pypi.org/attestations/publish/v1 -
Subject name:
nscraper-0.1.5.tar.gz -
Subject digest:
569114170a2b8e1386bec59d8f929d41f0e458728922f0d2660df90daef017d7 - Sigstore transparency entry: 1296809208
- Sigstore integration time:
-
Permalink:
mikerr1/nscraper@5678c982334e6f01ad9bbc2b36f2fb8fe7d14fdf -
Branch / Tag:
refs/tags/v0.1.5 - Owner: https://github.com/mikerr1
-
Access:
public
-
Token Issuer:
https://token.actions.githubusercontent.com -
Runner Environment:
github-hosted -
Publication workflow:
release.yml@5678c982334e6f01ad9bbc2b36f2fb8fe7d14fdf -
Trigger Event:
release
-
Statement type:
File details
Details for the file nscraper-0.1.5-py3-none-any.whl.
File metadata
- Download URL: nscraper-0.1.5-py3-none-any.whl
- Upload date:
- Size: 17.2 kB
- Tags: Python 3
- Uploaded using Trusted Publishing? Yes
- Uploaded via: twine/6.1.0 CPython/3.13.12
File hashes
| Algorithm | Hash digest | |
|---|---|---|
| SHA256 |
29b0587e2577021fe4926b5fabd6f95028e531cbd9bdbb8c5c03c7a575e149aa
|
|
| MD5 |
df05c3b1a6c2f13dfa90f405c7bffd43
|
|
| BLAKE2b-256 |
319875df7413bce6318092612e26911677d84bc10e19daa8165b6e5241255710
|
Provenance
The following attestation bundles were made for nscraper-0.1.5-py3-none-any.whl:
Publisher:
release.yml on mikerr1/nscraper
-
Statement:
-
Statement type:
https://in-toto.io/Statement/v1 -
Predicate type:
https://docs.pypi.org/attestations/publish/v1 -
Subject name:
nscraper-0.1.5-py3-none-any.whl -
Subject digest:
29b0587e2577021fe4926b5fabd6f95028e531cbd9bdbb8c5c03c7a575e149aa - Sigstore transparency entry: 1296809290
- Sigstore integration time:
-
Permalink:
mikerr1/nscraper@5678c982334e6f01ad9bbc2b36f2fb8fe7d14fdf -
Branch / Tag:
refs/tags/v0.1.5 - Owner: https://github.com/mikerr1
-
Access:
public
-
Token Issuer:
https://token.actions.githubusercontent.com -
Runner Environment:
github-hosted -
Publication workflow:
release.yml@5678c982334e6f01ad9bbc2b36f2fb8fe7d14fdf -
Trigger Event:
release
-
Statement type: