A fast HTML content extractor based on Mozilla's Readability.js

These details have not been verified by PyPI

Project links

Project description

Fast Readability

一个基于 Mozilla Readability.js 的快速 HTML 内容提取器，用于从网页中提取干净的文章内容。

特性

🚀 快速: 基于 JavaScript 引擎的高性能内容提取
🧹 干净: 自动移除广告、导航栏、侧边栏等无关内容
🌐 多语言: 支持多种语言的网页内容提取
📱 易用: 简单的 Python API，支持 HTML 字符串和 URL
🔧 可配置: 支持自定义请求头、超时等参数

安装

pip install fast-readability

或者从源码安装：

git clone https://github.com/jiankaiwang/fast-readability.git
cd fast-readability
pip install -e .

快速开始

从 URL 提取内容

from fast_readability import Readability
import requests

# 创建提取器实例
reader = Readability()

# 从 URL 提取内容
url = "https://example.com/article"
html = requests.get(url).text
result = reader.extract_from_url(html)

print("标题:", result["title"])
print("正文:", result["textContent"])
print("HTML内容:", result["content"])

从 HTML 字符串提取内容

from fast_readability import Readability

# HTML 内容
html = """
<html>
<head><title>示例文章</title></head>
<body>
    <article>
        <h1>这是标题</h1>
        <p>这是文章的正文内容...</p>
    </article>
    <aside>这是侧边栏，会被过滤掉</aside>
</body>
</html>
"""

reader = Readability()
result = reader.extract_from_html(html)

print("标题:", result["title"])
print("正文:", result["textContent"])

便捷函数

from fast_readability import extract_content

# 直接从 HTML 提取
result = extract_content(html)

API 参考

Readability 类

`init(debug=False)`

创建 Readability 实例。

debug (bool): 是否启用调试模式

`extract_from_html(html)`

从 HTML 字符串提取内容。

html (str): HTML 字符串

返回包含以下字段的字典：

title: 文章标题
content: HTML 格式的文章内容
textContent: 纯文本格式的文章内容
length: 内容长度
excerpt: 文章摘要
byline: 作者信息
dir: 文本方向
siteName: 网站名称
lang: 语言

`get_text_content(html)`

获取纯文本内容。

`get_title(html)`

获取文章标题。

`is_probably_readable(html, min_content_length=140)`

检查 HTML 是否包含可读内容。

便捷函数

`extract_content(html, debug=False)`

从 HTML 提取内容的便捷函数。

`extract_from_url(url, debug=False, **kwargs)`

从 URL 提取内容的便捷函数。

依赖项

Python 3.7+
quickjs
beautifulsoup4
requests
urllib3

许可证

本项目基于 Mozilla Public License 2.0 许可证。

贡献

欢迎提交 Issues 和 Pull Requests！

致谢

本项目基于以下开源项目：

Mozilla Readability.js - 核心内容提取算法
JSDOMParser - JavaScript DOM 解析器

Project details

These details have not been verified by PyPI

Project links

Release history Release notifications | RSS feed

This version

0.0.1

Jun 3, 2025

Download files

Download the file for your platform. If you're not sure which to choose, learn more about installing packages.

Source Distribution

fast_readability-0.0.1.tar.gz (50.1 kB view details)

Uploaded Jun 3, 2025 Source

Built Distribution

If you're not sure about the file name format, learn more about wheel file names.

The dropdown lists show the available interpreters, ABIs, and platforms. Enable javascript to be able to filter the list of wheel files.

fast_readability-0.0.1-py3-none-any.whl (46.3 kB view details)

Uploaded Jun 3, 2025 Python 3

File details

Details for the file fast_readability-0.0.1.tar.gz.

File metadata

Download URL: fast_readability-0.0.1.tar.gz
Upload date: Jun 3, 2025
Size: 50.1 kB
Tags: Source
Uploaded using Trusted Publishing? No
Uploaded via: twine/6.1.0 CPython/3.12.9

File hashes

Hashes for fast_readability-0.0.1.tar.gz
Algorithm	Hash digest
SHA256	`548a2941ec1a225b68e58bcb73528e40dc4d4ad10d8cdbd20b4b2b360c50b782`
MD5	`2eb1de12eecd4ed084819ac0478a24c6`
BLAKE2b-256	`feac40209205aa52c3ffc167b26cfc6ca0eea0bc0acfdcd7e1156fc5c6848ad3`

See more details on using hashes here.

File details

Details for the file fast_readability-0.0.1-py3-none-any.whl.

File metadata

Download URL: fast_readability-0.0.1-py3-none-any.whl
Upload date: Jun 3, 2025
Size: 46.3 kB
Tags: Python 3
Uploaded using Trusted Publishing? No
Uploaded via: twine/6.1.0 CPython/3.12.9

File hashes

Hashes for fast_readability-0.0.1-py3-none-any.whl
Algorithm	Hash digest
SHA256	`0a2938de29523f100907f6cf867aade7a50f6505c3a028a4036227f29fb3d0f6`
MD5	`c5bd4e995352f2d50457591478e1a75b`
BLAKE2b-256	`83df4ecf9a0a9bad6e5e4bcc2d1959647d23d3fc3d30a517c91162d0f4edb7cb`

See more details on using hashes here.

fast-readability 0.0.1

Navigation

Verified details

Maintainers

Unverified details

Project links

Meta

Classifiers

Project description

Fast Readability

特性

安装

快速开始

从 URL 提取内容

从 HTML 字符串提取内容

便捷函数

API 参考

Readability 类

__init__(debug=False)

extract_from_html(html)

get_text_content(html)

get_title(html)

is_probably_readable(html, min_content_length=140)

便捷函数

extract_content(html, debug=False)

extract_from_url(url, debug=False, **kwargs)

依赖项

许可证

贡献

致谢

Project details

Verified details

Maintainers

Unverified details

Project links

Meta

Classifiers

Release history Release notifications | RSS feed

Download files

Source Distribution

Built Distribution

File details

File metadata

File hashes

File details

File metadata

File hashes

`init(debug=False)`

`extract_from_html(html)`

`get_text_content(html)`

`get_title(html)`

`is_probably_readable(html, min_content_length=140)`

`extract_content(html, debug=False)`

`extract_from_url(url, debug=False, **kwargs)`