A package to fetch web resources and cache HTML content.
Project description
fetch_web_resource
fetch_web_resource
是一个用于抓取网页资源并缓存 HTML 内容的 Python 包。
功能
- 抓取网页内容
- 提取网页中的图片和文件链接
- 缓存抓取的 HTML 内容
- 使用随机 User-Agent 模拟请求
安装
你可以使用以下命令安装该包:
pip install .
使用方法
初始化
首先,你需要初始化 HTMLFetcher
类:
from web_fetch.web_fetch import HTMLFetcher
import diskcache
cache = diskcache.Cache('./html_cache')
fetcher = HTMLFetcher(cache=cache, max_concurrent_per_domain=5)
抓取网页内容
你可以使用 fetch_html_batch
方法抓取一批网页内容:
results = [
SearchResult(url="https://example.com/page1"),
SearchResult(url="https://example.com/page2"),
]
async for result in fetcher.fetch_html_batch(results, timeout=5):
print(result)
提取 URL 资源
你可以使用 _extract_urls
方法从 HTML 内容中提取图片和文件链接:
html_content = "<html>...</html>"
base_url = "https://example.com"
url_resource = fetcher._extract_urls(html_content, base_url)
print(url_resource)
环境变量
你需要在项目根目录下创建一个 .env
文件,并添加你的 API 密钥:
API_KEY=your_api_key_here
依赖
该项目依赖以下 Python 包:
- requests
- beautifulsoup4
- diskcache
- python-dotenv
- htmldate
- pydantic
贡献
欢迎贡献代码!请 fork 本仓库并提交 pull request。
许可证
该项目使用 MIT 许可证。
Project details
Download files
Download the file for your platform. If you're not sure which to choose, learn more about installing packages.
Source Distribution
web_fetch-0.1.1.3.tar.gz
(5.7 kB
view details)
Built Distribution
File details
Details for the file web_fetch-0.1.1.3.tar.gz
.
File metadata
- Download URL: web_fetch-0.1.1.3.tar.gz
- Upload date:
- Size: 5.7 kB
- Tags: Source
- Uploaded using Trusted Publishing? No
- Uploaded via: twine/5.1.1 CPython/3.10.15
File hashes
Algorithm | Hash digest | |
---|---|---|
SHA256 | 0b24325ef84d067665b6697a60a3e758e42c5b7c6a6d822345debdaf02b35893 |
|
MD5 | 09492eff575f97551d5d53579f0ac345 |
|
BLAKE2b-256 | a61aedb0918de8e5c2429fbd3ac252f73c899ce2be1ae1a913bf1165ae529557 |
File details
Details for the file web_fetch-0.1.1.3-py3-none-any.whl
.
File metadata
- Download URL: web_fetch-0.1.1.3-py3-none-any.whl
- Upload date:
- Size: 5.8 kB
- Tags: Python 3
- Uploaded using Trusted Publishing? No
- Uploaded via: twine/5.1.1 CPython/3.10.15
File hashes
Algorithm | Hash digest | |
---|---|---|
SHA256 | 03db5100301b327b6c0dbd589cf22363b6e1e24a869dc97bb409c727c91d49d8 |
|
MD5 | 5bcb613f9344bffeb4576873d30db883 |
|
BLAKE2b-256 | 6eca48eca3d682193c5e5e699967ef7c6c4348dafaef44c062ffe7ac967b4fad |