A web parser wrapper on top of lxml and selectolax
Project description
A web content parser using Python lxml
Compatibility
-------------
The library is compatible with Python3. Python2 is currently not supported.
Usage
-----
Install the package using pip.
```
pip install webparser-py
```
**Convert to Document**
Accept the html content document, convert it to the doc element, if we want to convert relative links to absolute links,
we pass the domain url to the absolute links.
**convert_to_doc()**
```
from webparser.parser import convert_to_doc
doc = convert_to_doc('HTML content', 'http://yourwebsite.com')
```
**class FeedParser()**
Feed parser class is used for parsing the feed through the response content or using a URL.
```
from webparser.parser import FeedParser
feed = FeedParser() # optional feed URL can be provided.
parsed_links = feed.parse(url='http://viralnova.com/feed') # url will override constructor feed URL.
```
**has_rss_feed()**
Check if the website/URL has a RSS feed link present.
- Check the document with Mimetype of links using XPATH.
- Fuzzy URL search e.g /feed at the end of the website URL. (Attempted if no links for the RSS URL found)
```
from webparser.parser import has_rss_feed
rss_links = has_rss_feed(doc=html_content, url=website_url)
```
Compatibility
-------------
The library is compatible with Python3. Python2 is currently not supported.
Usage
-----
Install the package using pip.
```
pip install webparser-py
```
**Convert to Document**
Accept the html content document, convert it to the doc element, if we want to convert relative links to absolute links,
we pass the domain url to the absolute links.
**convert_to_doc()**
```
from webparser.parser import convert_to_doc
doc = convert_to_doc('HTML content', 'http://yourwebsite.com')
```
**class FeedParser()**
Feed parser class is used for parsing the feed through the response content or using a URL.
```
from webparser.parser import FeedParser
feed = FeedParser() # optional feed URL can be provided.
parsed_links = feed.parse(url='http://viralnova.com/feed') # url will override constructor feed URL.
```
**has_rss_feed()**
Check if the website/URL has a RSS feed link present.
- Check the document with Mimetype of links using XPATH.
- Fuzzy URL search e.g /feed at the end of the website URL. (Attempted if no links for the RSS URL found)
```
from webparser.parser import has_rss_feed
rss_links = has_rss_feed(doc=html_content, url=website_url)
```
Project details
Release history Release notifications | RSS feed
Download files
Download the file for your platform. If you're not sure which to choose, learn more about installing packages.
Source Distribution
webparser-py-0.3.tar.gz
(4.6 kB
view details)
Built Distribution
File details
Details for the file webparser-py-0.3.tar.gz
.
File metadata
- Download URL: webparser-py-0.3.tar.gz
- Upload date:
- Size: 4.6 kB
- Tags: Source
- Uploaded using Trusted Publishing? No
File hashes
Algorithm | Hash digest | |
---|---|---|
SHA256 | d3b510b7152d55480dd4a0a679415a63e8a4d1333f7692b9fa66010061a3d14c |
|
MD5 | 7c098143fddb3735fe09af4ec42c7849 |
|
BLAKE2b-256 | 9f944c25bce9ef18054b7e97511c8487264e43f8196ae0a62cada0ac3d438691 |
File details
Details for the file webparser_py-0.3-py3-none-any.whl
.
File metadata
- Download URL: webparser_py-0.3-py3-none-any.whl
- Upload date:
- Size: 5.3 kB
- Tags: Python 3
- Uploaded using Trusted Publishing? No
File hashes
Algorithm | Hash digest | |
---|---|---|
SHA256 | 93fb244a4a2a12639e667d473cdae6561110ad1a442ea96f6d2fb1f6f4b1ef11 |
|
MD5 | e1f2ed4430484895f9db76a65f8ac48b |
|
BLAKE2b-256 | 5fced840ea1b729a4abc789379d45595361d5a11bb6c24449852df01ba3bc910 |