Skip to main content

A package to fetch web resources and cache HTML content.

Project description

fetch_web_resource

fetch_web_resource 是一个用于抓取网页资源并缓存 HTML 内容的 Python 包。

功能

  • 抓取网页内容
  • 提取网页中的图片和文件链接
  • 缓存抓取的 HTML 内容
  • 使用随机 User-Agent 模拟请求

安装

你可以使用以下命令安装该包:

pip install .

使用方法

初始化

首先,你需要初始化 HTMLFetcher 类:

from web_fetch.web_fetch import HTMLFetcher
import diskcache

cache = diskcache.Cache('./html_cache')
fetcher = HTMLFetcher(cache=cache, max_concurrent_per_domain=5)

抓取网页内容

你可以使用 fetch_html_batch 方法抓取一批网页内容:

results = [
    SearchResult(url="https://example.com/page1"),
    SearchResult(url="https://example.com/page2"),
]

async for result in fetcher.fetch_html_batch(results, timeout=5):
    print(result)

提取 URL 资源

你可以使用 _extract_urls 方法从 HTML 内容中提取图片和文件链接:

html_content = "<html>...</html>"
base_url = "https://example.com"
url_resource = fetcher._extract_urls(html_content, base_url)
print(url_resource)

环境变量

你需要在项目根目录下创建一个 .env 文件,并添加你的 API 密钥:

API_KEY=your_api_key_here

依赖

该项目依赖以下 Python 包:

  • requests
  • beautifulsoup4
  • diskcache
  • python-dotenv
  • htmldate
  • pydantic

贡献

欢迎贡献代码!请 fork 本仓库并提交 pull request。

许可证

该项目使用 MIT 许可证

Project details


Download files

Download the file for your platform. If you're not sure which to choose, learn more about installing packages.

Source Distribution

web_fetch-0.1.1.2.tar.gz (5.7 kB view details)

Uploaded Source

Built Distribution

web_fetch-0.1.1.2-py3-none-any.whl (5.8 kB view details)

Uploaded Python 3

File details

Details for the file web_fetch-0.1.1.2.tar.gz.

File metadata

  • Download URL: web_fetch-0.1.1.2.tar.gz
  • Upload date:
  • Size: 5.7 kB
  • Tags: Source
  • Uploaded using Trusted Publishing? No
  • Uploaded via: twine/5.1.1 CPython/3.10.15

File hashes

Hashes for web_fetch-0.1.1.2.tar.gz
Algorithm Hash digest
SHA256 15273e9a26eb320b1d5d63b1dfbf552181dc5da486d27669747be9eeccd3de4f
MD5 65028f416c7b3cbd598df45198b8469a
BLAKE2b-256 59824460a193e838d7d4bc4117d5c73506f38b8cf0df32ff9d4070c25411ae3e

See more details on using hashes here.

File details

Details for the file web_fetch-0.1.1.2-py3-none-any.whl.

File metadata

  • Download URL: web_fetch-0.1.1.2-py3-none-any.whl
  • Upload date:
  • Size: 5.8 kB
  • Tags: Python 3
  • Uploaded using Trusted Publishing? No
  • Uploaded via: twine/5.1.1 CPython/3.10.15

File hashes

Hashes for web_fetch-0.1.1.2-py3-none-any.whl
Algorithm Hash digest
SHA256 48c7d40abed76b2cb9b7224b467dcd12bc2acddb5656bc676068cbacf2644f68
MD5 7a20c2f3de5c9b87652cddfabff63d47
BLAKE2b-256 63bd246d3bbaf2bb81c8403b7102f5ddacb718d1f6ae5bf1cc3c04c94ad79901

See more details on using hashes here.

Supported by

AWS AWS Cloud computing and Security Sponsor Datadog Datadog Monitoring Fastly Fastly CDN Google Google Download Analytics Microsoft Microsoft PSF Sponsor Pingdom Pingdom Monitoring Sentry Sentry Error logging StatusPage StatusPage Status page