Skip to main content

A small importable Python module.

Project description

nscraper

nscraper is a small Python package scaffolded for two use cases:

  • import it from other projects
  • run it directly with python -m nscraper

License

MIT. You can fork, modify, and reuse it with minimal restrictions as long as the license notice is kept with the software.

Install

pip install nscraper

For development:

uv sync --dev

Use as a module

from nscraper import HttpScraper, ScrapeOptions

options = ScrapeOptions(
    url="https://example.com",
    headers={"Accept": "text/html"},
)

content = HttpScraper(options).scrape()
print(content)

Run the Module

python -m nscraper -u https://example.com -H default

Fetch a URL:

python -m nscraper -u https://example.com -H default
python -m nscraper -u https://example.com -H '{"Accept": "text/html"}'
python -m nscraper -u https://example.com -H default -c cookies.json

Current API

  • nscraper.ScrapeOptions
  • nscraper.BaseScraper
  • nscraper.HttpScraper
  • nscraper.SeleniumBaseScraper
  • nscraper.get_scraper(options: ScrapeOptions) -> BaseScraper
  • nscraper.validate_url(url: str) -> str
  • nscraper.parse_headers(raw_headers: str | None) -> dict[str, str]
  • nscraper.load_cookies_file(path: Path | str | None) -> dict[str, str] | None
  • nscraper.basic_html_transform(content: str) -> str
  • runtime dependency: niquests==3.18.4
  • runtime dependency: justhtml==1.14.0
  • development dependency: pytest

Module Flags

  • -u / --url required
  • -H / --headers required, or default
  • -e / --engine with http or seleniumbase
  • -p / --proxy
  • --timeout default 3
  • -o / --output
  • -c / --cookies-file optional JSON file
  • -t / --transform default raw

Behavior:

  • invalid or malformed URLs raise InvalidUrlError
  • missing or malformed headers raise InvalidHeadersError
  • missing or malformed cookie files raise InvalidCookiesError
  • use -H default to apply the built-in Accept and User-Agent header dict
  • use -c only when you want to send cookies; omit it to keep current behavior
  • output files are always overwritten
  • basic_html removes non-content elements and writes cleaned HTML output

Default User-Agent:

Mozilla/5.0 (Windows NT 10.0; Win64; x64) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/146.0.0.0 Safari/537.36

The package is intentionally minimal so you can extend it into a reusable library and publish it to PyPI.

GitHub And PyPI Release Flow

  • pull requests to master run tests in GitHub Actions
  • published GitHub releases run tests, build sdist and wheel, then publish to PyPI
  • the publish workflow is in .github/workflows/release.yml

Before the release workflow can publish, configure Trusted Publishing in PyPI:

  1. create the project on PyPI if it does not exist yet
  2. in PyPI, open the project publishing settings
  3. add a trusted publisher for this GitHub repository
  4. use the release workflow on the master branch

After that, the normal flow is:

  1. push code to GitHub
  2. merge to master
  3. create a GitHub release for the version tag
  4. let GitHub Actions test, build, and publish the package

Project details


Download files

Download the file for your platform. If you're not sure which to choose, learn more about installing packages.

Source Distribution

nscraper-0.1.0.tar.gz (11.0 kB view details)

Uploaded Source

Built Distribution

If you're not sure about the file name format, learn more about wheel file names.

nscraper-0.1.0-py3-none-any.whl (8.6 kB view details)

Uploaded Python 3

File details

Details for the file nscraper-0.1.0.tar.gz.

File metadata

  • Download URL: nscraper-0.1.0.tar.gz
  • Upload date:
  • Size: 11.0 kB
  • Tags: Source
  • Uploaded using Trusted Publishing? Yes
  • Uploaded via: twine/6.1.0 CPython/3.13.7

File hashes

Hashes for nscraper-0.1.0.tar.gz
Algorithm Hash digest
SHA256 bd16ae699118f1c115fdd41ae3aa7465a4fb1bb2bc5f270b6ecd2b11da772cfc
MD5 21a9a63ae834f01c9c9f8545351b1988
BLAKE2b-256 4637c350dcc902bab07483e0dc1e342b3b961423574e1cbe1e17b434248673bc

See more details on using hashes here.

Provenance

The following attestation bundles were made for nscraper-0.1.0.tar.gz:

Publisher: release.yml on mikerr1/nscraper

Attestations: Values shown here reflect the state when the release was signed and may no longer be current.

File details

Details for the file nscraper-0.1.0-py3-none-any.whl.

File metadata

  • Download URL: nscraper-0.1.0-py3-none-any.whl
  • Upload date:
  • Size: 8.6 kB
  • Tags: Python 3
  • Uploaded using Trusted Publishing? Yes
  • Uploaded via: twine/6.1.0 CPython/3.13.7

File hashes

Hashes for nscraper-0.1.0-py3-none-any.whl
Algorithm Hash digest
SHA256 d83169d9a805dbeac8f609bf34aa837cad12da22ae02aeebcbfdd605f1721a88
MD5 6c6f79180a6d1fc6d2beef66abe934d7
BLAKE2b-256 47b16e5dd9a5aec002207bbabc7e0864463ccab735157136029d7a2fb756335c

See more details on using hashes here.

Provenance

The following attestation bundles were made for nscraper-0.1.0-py3-none-any.whl:

Publisher: release.yml on mikerr1/nscraper

Attestations: Values shown here reflect the state when the release was signed and may no longer be current.

Supported by

AWS Cloud computing and Security Sponsor Datadog Monitoring Depot Continuous Integration Fastly CDN Google Download Analytics Pingdom Monitoring Sentry Error logging StatusPage Status page