Skip to main content

Web scraping (URLs, images), PubMed search, URL summarization helpers — standalone module from the SciTeX ecosystem

Project description

scitex-web

Web scraping + PubMed search + URL summarization helpers, extracted from the SciTeX ecosystem as a standalone package.

Install

pip install scitex-web
pip install "scitex-web[readability]"   # readability-lxml for cleaner extraction

API

import scitex_web as web

# Scraping
web.get_urls(url, pattern=r"\.pdf$")
web.get_image_urls(url, min_size=128)
web.download_images(url, out_dir="imgs", same_domain=True)

# PubMed
web.search_pubmed("CRISPR Cas9 review", retmax=50)

# URL summarization (requires scitex.ai)
web.summarize_url("https://example.com/article")

Status

Standalone fork of scitex.web. Deps: requests / aiohttp / bs4 / tqdm. The umbrella package's scitex.web import path is preserved via a sys.modules-alias bridge.

Decoupling notes:

  • scitex.logging.getLogger → stdlib logging.getLogger.
  • scitex.str.printc (colored print) → tiny inline ANSI helper.
  • scitex.ai.GenAI (used by summarize_url) → deferred import that raises a clear ImportError if the umbrella scitex package isn't installed.

14/23 tests pass (7 pre-existing upstream failures around bs4 mocking that fail in scitex-python too — unrelated to extraction; 2 skipped).

License

AGPL-3.0-only (see LICENSE).

Project details


Download files

Download the file for your platform. If you're not sure which to choose, learn more about installing packages.

Source Distribution

scitex_web-0.1.0.tar.gz (25.5 kB view details)

Uploaded Source

Built Distribution

If you're not sure about the file name format, learn more about wheel file names.

scitex_web-0.1.0-py3-none-any.whl (26.5 kB view details)

Uploaded Python 3

File details

Details for the file scitex_web-0.1.0.tar.gz.

File metadata

  • Download URL: scitex_web-0.1.0.tar.gz
  • Upload date:
  • Size: 25.5 kB
  • Tags: Source
  • Uploaded using Trusted Publishing? No
  • Uploaded via: twine/6.2.0 CPython/3.10.12

File hashes

Hashes for scitex_web-0.1.0.tar.gz
Algorithm Hash digest
SHA256 887fa8cd81c7beb305d0eda6b928dc1c9fa3316dff62da94f80adff334e966a3
MD5 7c176d5eb865388f5273ca8ef24e5634
BLAKE2b-256 6f46b736db31b2b983c0cbba7f252f1f98ec337172357103e57c3d41b96acc75

See more details on using hashes here.

File details

Details for the file scitex_web-0.1.0-py3-none-any.whl.

File metadata

  • Download URL: scitex_web-0.1.0-py3-none-any.whl
  • Upload date:
  • Size: 26.5 kB
  • Tags: Python 3
  • Uploaded using Trusted Publishing? No
  • Uploaded via: twine/6.2.0 CPython/3.10.12

File hashes

Hashes for scitex_web-0.1.0-py3-none-any.whl
Algorithm Hash digest
SHA256 844dc3866f4809b7115c9e62161dff0a00496e701e22b39fb0ef19d218110dee
MD5 c5aa125f64e6eacdc067b2b8d7733794
BLAKE2b-256 da574dc4f549765e2de5fff55e2e5a76a172d878b73b565ea3cb135d71b6ef55

See more details on using hashes here.

Supported by

AWS Cloud computing and Security Sponsor Datadog Monitoring Depot Continuous Integration Fastly CDN Google Download Analytics Pingdom Monitoring Sentry Error logging StatusPage Status page