Skip to main content

extract the content from html docs

Project description

# webfocus HTML网页正文提取

## 安装
```bash

$ pip install webfocus

```

## 使用方式
### 命令行
```bash
Usage: webfocus [OPTIONS] COMMAND [ARGS]...

webfocus system. ---- Powered by qiulimao@2017.03

Options:
--help Show this message and exit.

Commands:
extract 给定url提取相应的正文
```
目前仅仅`extract` 命令可用

```bash
Usage: webfocus extract [OPTIONS]

给定url提取相应的正文

Options:
-u, --url TEXT the target url
-n, --shownoise 仅输出噪声,默认为False
-t, --textonly 输出不带标签的正文,默认为False
--help Show this message and exit.
```
### 使用example
```bash
$ webfocus extract
INPUT TARGET URL: 输入你的url

》》》》带标签的结果显示输出
```

```bash
$ webfocus extract -t
INPUT TARGET URL: 输入你的url

》》》》带标签的结果显示输出
```

### 程序当中使用
```
from webfocus.extractor import extract_from_url,extract_from_htmlstring
>>> info,noise = extract_from_url(YOUR_URL,text_only=False) # 给定url输出 带标签的正文

>>> info,noise = extract_from_htmlstring(YOUR_HTML_STRING,text_only=True) # 给定html正文输出纯文本正文
```
### 开发日志
* 2017.03.02 正对新闻网页等题材的网站屡试不爽,但是对于博客类网站还有待改进

### 常见问题

#### `Unicode strings with encoding declaration are not supported.`
检查你所访问的url是不是ban爬虫的,可能返回了一个xml的文件给你

#### 提取出来的文字好像都是噪声,不是正文
检查你所要提取的网页的正文部分是不是依靠js加载产生的?如果是,那么肯定提取不出来,比如开源中国博客就是这种方式

### bug report
email:qiu_limao@163.com

Project details


Download files

Download the file for your platform. If you're not sure which to choose, learn more about installing packages.

Source Distribution

webfocus-0.13.tar.gz (20.0 kB view details)

Uploaded Source

Built Distribution

webfocus-0.13-py2-none-any.whl (22.3 kB view details)

Uploaded Python 2

File details

Details for the file webfocus-0.13.tar.gz.

File metadata

  • Download URL: webfocus-0.13.tar.gz
  • Upload date:
  • Size: 20.0 kB
  • Tags: Source
  • Uploaded using Trusted Publishing? No

File hashes

Hashes for webfocus-0.13.tar.gz
Algorithm Hash digest
SHA256 85ad9983028a94e00d7968bf50a211a4166d85c73fafb4a8ecf5dac87a2fe43d
MD5 61653baf640b390043ca49b045e3dd66
BLAKE2b-256 fb484a57a8f19c0d48385e2ccdaafde99d99cd0c14d50517449323b6e4e73fa2

See more details on using hashes here.

File details

Details for the file webfocus-0.13-py2-none-any.whl.

File metadata

File hashes

Hashes for webfocus-0.13-py2-none-any.whl
Algorithm Hash digest
SHA256 38a64151281a9c7685fe2d9fbbfbb9e5a3d9f031d3ddb79560a376f00fb54092
MD5 9dc4ebb0525b493286db15ef6973f661
BLAKE2b-256 bc0ef64b5a97221bc9b1a73d5c459b973eeb3fcb42e26bee4f8727876059f448

See more details on using hashes here.

Supported by

AWS AWS Cloud computing and Security Sponsor Datadog Datadog Monitoring Fastly Fastly CDN Google Google Download Analytics Microsoft Microsoft PSF Sponsor Pingdom Pingdom Monitoring Sentry Sentry Error logging StatusPage StatusPage Status page