Skip to main content

A small importable Python module.

Project description

nscraper

nscraper is a small Python package scaffolded for two use cases:

  • import it from other projects
  • run it directly with python -m nscraper

License

MIT. You can fork, modify, and reuse it with minimal restrictions as long as the license notice is kept with the software.

Install

pip install nscraper

To use the SeleniumBase engine, install SeleniumBase alongside nscraper:

pip install seleniumbase

For development:

uv sync --dev

Use as a module

from nscraper import HttpScraper, ScrapeOptions

options = ScrapeOptions(
    url="https://example.com",
    headers={"Accept": "text/html"},
)

content = HttpScraper(options).scrape()
print(content)

Run the Module

python -m nscraper -u https://example.com -H default

Fetch a URL:

python -m nscraper -u https://example.com -H default
python -m nscraper -u https://example.com -H '{"Accept": "text/html"}'
python -m nscraper -u https://example.com -H default -c cookies.json
python -m nscraper -u https://example.com -H default -t fast -o ~/scraped_data/example.html
python -m nscraper -u https://example.com -H default -o
python -m nscraper -u https://example.com -H default --pretty --print
python -m nscraper -u https://httpbin.org/get -H default -o --pretty --print
python -m nscraper -u https://example.com -H default -t basic
python -m nscraper -u https://example.com -H default --print
python -m nscraper -u https://example.com -H default -o ~/scraped_data/example.html --print

Current API

  • nscraper.ScrapeOptions
  • nscraper.BaseScraper
  • nscraper.HttpScraper
  • nscraper.SeleniumBaseScraper
  • nscraper.get_scraper(options: ScrapeOptions) -> BaseScraper
  • nscraper.validate_url(url: str) -> str
  • nscraper.parse_headers(raw_headers: str | None) -> dict[str, str]
  • nscraper.load_cookies_file(path: Path | str | None) -> dict[str, str] | None
  • nscraper.fast_html_transform(content: str) -> str
  • nscraper.basic_html_transform(content: str) -> str
  • runtime dependency: niquests==3.18.4
  • runtime dependency: justhtml==1.14.0
  • development dependency: pytest

Module Flags

  • -u / --url required
  • -H / --headers required, or default
  • -e / --engine with http or seleniumbase
  • -p / --proxy
  • --timeout default 3
  • -o / --output writes to a file; bare -o uses automatic output, explicit paths must be absolute
  • --print prints the result to stdout
  • --pretty pretty-prints the final HTML output
  • -c / --cookies-file optional JSON file
  • -t / --transform with raw, basic, or fast; optional
  • -d / --debug compatibility flag; runtime status lines are printed by default

Behavior:

  • invalid or malformed URLs raise InvalidUrlError
  • missing or malformed headers raise InvalidHeadersError
  • missing or malformed cookie files raise InvalidCookiesError
  • use -H default to apply the built-in Accept and User-Agent header dict
  • use -c only when you want to send cookies; omit it to keep current behavior
  • no transform runs unless -t / --transform is explicitly provided
  • no HTML is printed unless --print is provided
  • when --output and --print are both provided, stdout prints the written file content
  • output files are always overwritten
  • missing parent directories for output files are created automatically
  • bare -o writes to .nscraper/<netloc>/<path>.<ext>
  • bare -o uses index for root URLs such as /
  • bare -o preserves nested URL path segments as directories
  • bare -o appends a short query hash when the URL contains a query string
  • explicit output paths must be absolute; relative paths fail immediately
  • auto-generated output extensions are content-aware: HTML-like responses use .html, JSON responses use .json
  • --pretty formats the final response after the selected transform mode is applied; JSON responses are pretty-printed as JSON
  • raw returns the fetched supported response with no cleanup
  • fast removes a small set of noisy elements such as script, style, noscript, iframe, and template for HTML responses
  • basic performs heavier cleanup, including hidden elements, head cleanup, and ad-like selectors for HTML responses
  • response handling is classified by content type; only HTML and JSON responses are supported
  • unsupported content types fail immediately before transform or output is written
  • runtime status lines include per-step timings for request, transform, pretty-formatting, and file write operations
  • the seleniumbase engine loads pages in a browser session and returns the final page source
  • the seleniumbase engine is optional; if SeleniumBase is not installed, that engine fails with a clear runtime error

Default User-Agent:

Mozilla/5.0 (Windows NT 10.0; Win64; x64) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/146.0.0.0 Safari/537.36

The package is intentionally minimal so you can extend it into a reusable library and publish it to PyPI.

Project details


Download files

Download the file for your platform. If you're not sure which to choose, learn more about installing packages.

Source Distribution

nscraper-0.1.5.tar.gz (56.0 kB view details)

Uploaded Source

Built Distribution

If you're not sure about the file name format, learn more about wheel file names.

nscraper-0.1.5-py3-none-any.whl (17.2 kB view details)

Uploaded Python 3

File details

Details for the file nscraper-0.1.5.tar.gz.

File metadata

  • Download URL: nscraper-0.1.5.tar.gz
  • Upload date:
  • Size: 56.0 kB
  • Tags: Source
  • Uploaded using Trusted Publishing? Yes
  • Uploaded via: twine/6.1.0 CPython/3.13.12

File hashes

Hashes for nscraper-0.1.5.tar.gz
Algorithm Hash digest
SHA256 569114170a2b8e1386bec59d8f929d41f0e458728922f0d2660df90daef017d7
MD5 742c4057329335b7818db36fa94a7ea8
BLAKE2b-256 fed8b9aa89b497598c744a63faf8ccdca00499aaf2de1fa9cc578889e93885a1

See more details on using hashes here.

Provenance

The following attestation bundles were made for nscraper-0.1.5.tar.gz:

Publisher: release.yml on mikerr1/nscraper

Attestations: Values shown here reflect the state when the release was signed and may no longer be current.

File details

Details for the file nscraper-0.1.5-py3-none-any.whl.

File metadata

  • Download URL: nscraper-0.1.5-py3-none-any.whl
  • Upload date:
  • Size: 17.2 kB
  • Tags: Python 3
  • Uploaded using Trusted Publishing? Yes
  • Uploaded via: twine/6.1.0 CPython/3.13.12

File hashes

Hashes for nscraper-0.1.5-py3-none-any.whl
Algorithm Hash digest
SHA256 29b0587e2577021fe4926b5fabd6f95028e531cbd9bdbb8c5c03c7a575e149aa
MD5 df05c3b1a6c2f13dfa90f405c7bffd43
BLAKE2b-256 319875df7413bce6318092612e26911677d84bc10e19daa8165b6e5241255710

See more details on using hashes here.

Provenance

The following attestation bundles were made for nscraper-0.1.5-py3-none-any.whl:

Publisher: release.yml on mikerr1/nscraper

Attestations: Values shown here reflect the state when the release was signed and may no longer be current.

Supported by

AWS Cloud computing and Security Sponsor Datadog Monitoring Depot Continuous Integration Fastly CDN Google Download Analytics Pingdom Monitoring Sentry Error logging StatusPage Status page