A web parser wrapper on top of lxml and selectolax
Project description
A web content parser using Python lxml
Compatibility
-------------
The library is compatible with Python3. Python2 is currently not supported.
Usage
-----
Install the package using pip.
```
pip install webparser-py
```
**Convert to Document**
Accept the html content document, convert it to the doc element, if we want to convert relative links to absolute links,
we pass the domain url to the absolute links.
**convert_to_doc()**
```
from webparser.parser import convert_to_doc
doc = convert_to_doc('HTML content', 'http://yourwebsite.com')
```
**class FeedParser()**
Feed parser class is used for parsing the feed through the response content or using a URL.
```
from webparser.parser import FeedParser
feed = FeedParser() # optional feed URL can be provided.
parsed_links = feed.parse(url='http://viralnova.com/feed') # url will override constructor feed URL.
```
**has_rss_feed()**
Check if the website/URL has a RSS feed link present.
- Check the document with Mimetype of links using XPATH.
- Fuzzy URL search e.g /feed at the end of the website URL. (Attempted if no links for the RSS URL found)
```
from webparser.parser import has_rss_feed
rss_links = has_rss_feed(doc=html_content, url=website_url)
```
Compatibility
-------------
The library is compatible with Python3. Python2 is currently not supported.
Usage
-----
Install the package using pip.
```
pip install webparser-py
```
**Convert to Document**
Accept the html content document, convert it to the doc element, if we want to convert relative links to absolute links,
we pass the domain url to the absolute links.
**convert_to_doc()**
```
from webparser.parser import convert_to_doc
doc = convert_to_doc('HTML content', 'http://yourwebsite.com')
```
**class FeedParser()**
Feed parser class is used for parsing the feed through the response content or using a URL.
```
from webparser.parser import FeedParser
feed = FeedParser() # optional feed URL can be provided.
parsed_links = feed.parse(url='http://viralnova.com/feed') # url will override constructor feed URL.
```
**has_rss_feed()**
Check if the website/URL has a RSS feed link present.
- Check the document with Mimetype of links using XPATH.
- Fuzzy URL search e.g /feed at the end of the website URL. (Attempted if no links for the RSS URL found)
```
from webparser.parser import has_rss_feed
rss_links = has_rss_feed(doc=html_content, url=website_url)
```
Project details
Release history Release notifications | RSS feed
Download files
Download the file for your platform. If you're not sure which to choose, learn more about installing packages.
Source Distribution
webparser-py-0.2.tar.gz
(4.6 kB
view hashes)
Built Distribution
Close
Hashes for webparser_py-0.2-py3-none-any.whl
Algorithm | Hash digest | |
---|---|---|
SHA256 | 4f82806f665636b8046b09ce14f283950116e4a269990c030f453ba5e3ac87da |
|
MD5 | 1d90394cdfbccd09cf3ccf4f8dee2733 |
|
BLAKE2b-256 | 0ec846156d2e0300381ec2af73b8eccb9f4455be9a5207494b872b8c2316f9d4 |