A fast HTML content extractor based on Mozilla's Readability.js
Project description
Fast Readability
一个基于 Mozilla Readability.js 的快速 HTML 内容提取器,用于从网页中提取干净的文章内容。
特性
- 🚀 快速: 基于 JavaScript 引擎的高性能内容提取
- 🧹 干净: 自动移除广告、导航栏、侧边栏等无关内容
- 🌐 多语言: 支持多种语言的网页内容提取
- 📱 易用: 简单的 Python API,支持 HTML 字符串和 URL
- 🔧 可配置: 支持自定义请求头、超时等参数
安装
pip install fast-readability
或者从源码安装:
git clone https://github.com/jiankaiwang/fast-readability.git
cd fast-readability
pip install -e .
快速开始
从 URL 提取内容
from fast_readability import Readability
import requests
# 创建提取器实例
reader = Readability()
# 从 URL 提取内容
url = "https://example.com/article"
html = requests.get(url).text
result = reader.extract_from_url(html)
print("标题:", result["title"])
print("正文:", result["textContent"])
print("HTML内容:", result["content"])
从 HTML 字符串提取内容
from fast_readability import Readability
# HTML 内容
html = """
<html>
<head><title>示例文章</title></head>
<body>
<article>
<h1>这是标题</h1>
<p>这是文章的正文内容...</p>
</article>
<aside>这是侧边栏,会被过滤掉</aside>
</body>
</html>
"""
reader = Readability()
result = reader.extract_from_html(html)
print("标题:", result["title"])
print("正文:", result["textContent"])
便捷函数
from fast_readability import extract_content
# 直接从 HTML 提取
result = extract_content(html)
API 参考
Readability 类
__init__(debug=False)
创建 Readability 实例。
debug(bool): 是否启用调试模式
extract_from_html(html)
从 HTML 字符串提取内容。
html(str): HTML 字符串
返回包含以下字段的字典:
title: 文章标题content: HTML 格式的文章内容textContent: 纯文本格式的文章内容length: 内容长度excerpt: 文章摘要byline: 作者信息dir: 文本方向siteName: 网站名称lang: 语言
get_text_content(html)
获取纯文本内容。
get_title(html)
获取文章标题。
is_probably_readable(html, min_content_length=140)
检查 HTML 是否包含可读内容。
便捷函数
extract_content(html, debug=False)
从 HTML 提取内容的便捷函数。
extract_from_url(url, debug=False, **kwargs)
从 URL 提取内容的便捷函数。
依赖项
- Python 3.7+
- quickjs
- beautifulsoup4
- requests
- urllib3
许可证
本项目基于 Mozilla Public License 2.0 许可证。
贡献
欢迎提交 Issues 和 Pull Requests!
致谢
本项目基于以下开源项目:
- Mozilla Readability.js - 核心内容提取算法
- JSDOMParser - JavaScript DOM 解析器
Project details
Release history Release notifications | RSS feed
Download files
Download the file for your platform. If you're not sure which to choose, learn more about installing packages.
Source Distribution
Built Distribution
Filter files by name, interpreter, ABI, and platform.
If you're not sure about the file name format, learn more about wheel file names.
Copy a direct link to the current filters
File details
Details for the file fast_readability-0.0.1.tar.gz.
File metadata
- Download URL: fast_readability-0.0.1.tar.gz
- Upload date:
- Size: 50.1 kB
- Tags: Source
- Uploaded using Trusted Publishing? No
- Uploaded via: twine/6.1.0 CPython/3.12.9
File hashes
| Algorithm | Hash digest | |
|---|---|---|
| SHA256 |
548a2941ec1a225b68e58bcb73528e40dc4d4ad10d8cdbd20b4b2b360c50b782
|
|
| MD5 |
2eb1de12eecd4ed084819ac0478a24c6
|
|
| BLAKE2b-256 |
feac40209205aa52c3ffc167b26cfc6ca0eea0bc0acfdcd7e1156fc5c6848ad3
|
File details
Details for the file fast_readability-0.0.1-py3-none-any.whl.
File metadata
- Download URL: fast_readability-0.0.1-py3-none-any.whl
- Upload date:
- Size: 46.3 kB
- Tags: Python 3
- Uploaded using Trusted Publishing? No
- Uploaded via: twine/6.1.0 CPython/3.12.9
File hashes
| Algorithm | Hash digest | |
|---|---|---|
| SHA256 |
0a2938de29523f100907f6cf867aade7a50f6505c3a028a4036227f29fb3d0f6
|
|
| MD5 |
c5bd4e995352f2d50457591478e1a75b
|
|
| BLAKE2b-256 |
83df4ecf9a0a9bad6e5e4bcc2d1959647d23d3fc3d30a517c91162d0f4edb7cb
|