sense_text_extractor
Project description
sense-text-extractor
sense-text-extractor是正文抽取客户端库
安装方式(当前版本0.0.1)
pip install sense-text-extractor
使用指南
基于sense-core的settings.ini的label配置调用:
from sense_text_extractor import SenseTextExtractor
extractor = SenseTextExtractor(label='text_extractor')
text = extractor.extract_text("http://sports.sina.com.cn/g/pl/2019-01-11/doc-ihqhqcis5048507.shtml", "穆里尼奥在等待复出")
print(text)
使用host和port的调用:
extractor = SenseTextExtractor('52.83.143.61', '6681')
text = extractor.extract_text("http://sports.sina.com.cn/g/pl/2019-01-11/doc-ihqhqcis5048507.shtml", "穆里尼奥在等待复出")
print(text)
使用说明
extract_text方法可能抛出异常,需要自己捕捉。返回结果是string,如果是''字符串,表示可能没有抽取出正文。 如果用于爬虫,extract_text需要传入第三个参数,也就是下载的html源码,否则extractor的sever端因为获取超时而抛出异常,也容易被反爬虫限制。
Project details
Download files
Download the file for your platform. If you're not sure which to choose, learn more about installing packages.
Source Distribution
Built Distribution
Close
Hashes for sense-text-extractor-0.0.4.tar.gz
Algorithm | Hash digest | |
---|---|---|
SHA256 | 2111d5fdd20325deefbd2f84fe5d2e5b79a97e0adcd77c54d7f490aa0f5e66af |
|
MD5 | 6244c986d4a84a99c968dadd7969e93a |
|
BLAKE2b-256 | af3c9ae1a792e19ed95550a0a0bc58643dc2e5f3b32d9f4c20ee1e627e0077e4 |
Close
Hashes for sense_text_extractor-0.0.4-py3-none-any.whl
Algorithm | Hash digest | |
---|---|---|
SHA256 | c1d12d02e4f3e172068d44b334657d65d9e0c4b34174d6f4604d93bf3efb7874 |
|
MD5 | f235e088b844f087395f2d48ba4b1197 |
|
BLAKE2b-256 | 117003c889f80a2b91dc454884475c7da8bca34b75936a152cd94b9007ec8262 |