Skip to main content

data press for python

Project description

pressor

介绍

从各种不同格式文件来源里提取出文本数据。pressor有压缩机的意思,希望本项目能成为数据压缩机。

支持的格式

  • html
  • epub
  • mobi
  • azw
  • azw3
  • docx

快速使用

from pressor import pressor

file_path = 'your_html_path.epub'
result = pressor(file_path)
print(result)

关于从网站网页的html中提取正文

由于各个网站的网页结构不同,目前没有一个完美的提取正文的方案,本项目采用的方案也不适配所有网站,经过实测可以提取大多数网站。

解决方案

  • 适配了一批热门网站,使用提前准备好的xpath进行解析,后续也会继续适配其他网站
  • 其他网站使用通用算法解析提取正文

适配网站目录

网站 别名
https://www.ifeng.com/ ifeng
https://www.sohu.com/ sohu
https://www.163.com/ 163
https://www.sina.com.cn/ sina
https://www.qq.com/ new_qq
https://www.huxiu.com/ huxiu
https://baijiahao.baidu.com/ baijiahao_baidu
https://baike.baidu.com/ baike_baidu
https://zhuanlan.zhihu.com/ zhuanlan_zhihu

使用教程

  • 对于html文件使用pressor提取
from pressor import pressor
file_path = 'your_html_path.html'
# 使用通用算法提取正文
result = pressor(file_path)
print(result)

# 使用白名单(已适配网站)提取正文
# url 也可使用白名单里网站的别名,例如:https://www.sina.com.cn/ 的别名是 sina, url="sina"
url = 'https://you_web.com'
result = pressor(file_path, url=url)
print(result)
  • 对于已加载至内存的html数据使用pressor提取
from pressor import html_data_to_text
import requests

url = '' 
headers = {}
resp = requests.get(url, headers=headers)
html_text = resp.text
result = html_data_to_text(html_text, url=url)
print(result)
  • 使用自定义xpath提取正文
from pressor import get_text_from_whitelist
html_text = open('xxx.html').read()
true_xpath = {'title': 'xxxxx ', 'main_body': 'yyyyy'}
result = get_text_from_whitelist(html_text, true_xpath=true_xpath)
print(result)
  • 使用通用算法提取正文
from pressor import get_text_from_main_body
html_text = open('xxx.html').read()
result = get_text_from_main_body(html_text)
print(result)

Project details


Download files

Download the file for your platform. If you're not sure which to choose, learn more about installing packages.

Source Distribution

pressor-0.0.5.tar.gz (10.2 kB view details)

Uploaded Source

File details

Details for the file pressor-0.0.5.tar.gz.

File metadata

  • Download URL: pressor-0.0.5.tar.gz
  • Upload date:
  • Size: 10.2 kB
  • Tags: Source
  • Uploaded using Trusted Publishing? No
  • Uploaded via: twine/4.0.1 CPython/3.9.12

File hashes

Hashes for pressor-0.0.5.tar.gz
Algorithm Hash digest
SHA256 52ab8e9eba5ac3185661634c82910ab4dd294064920b765e7e7bc3471ba6c331
MD5 8999e2bf806dfde2e8fa2a18345dd0fc
BLAKE2b-256 b107f1a01081f0ab5dacb88fe4a7a9a71e4f1364c962b3e46f0f47b52a187c9c

See more details on using hashes here.

Supported by

AWS AWS Cloud computing and Security Sponsor Datadog Datadog Monitoring Fastly Fastly CDN Google Google Download Analytics Microsoft Microsoft PSF Sponsor Pingdom Pingdom Monitoring Sentry Sentry Error logging StatusPage StatusPage Status page