Collect data from various sources
Project description
Rollet
Rollet
collects, standardizes and completes from various sources.
Installation
Pypi
The safest way to install rollet
is to go through pip
python -m pip install rollet
How to use?
Command script
rollet {extract-txt,extract-csv,extract-json} path
[-h] [-o [OUTFILE]] [-l [LINK]] [-f [FIELDS]] [--start [START]]
[--size [SIZE]] [-t [TIMESLEEP]] [--timeout [TIMEOUT]]
[--blacklist [BLACKLIST]]
positional arguments:
{extract-txt,extract-csv,extract-json} Choose file type option extraction
path file path
optional arguments:
-h, --help show this help message and exit
-o [OUTFILE], --outfile output file path
-l [LINK], --link link field if csv or json
-f [FIELDS], --fields fields to keep separated by comma
--start [START] number of rows to skip
--size [SIZE] max number of rows to keep
-t [TIMESLEEP], --timesleep sleep time in seconds between two pulling
--timeout [TIMEOUT] Max GET request timeout in second
--blacklist [BLACKLIST] 0 (do not use), 1 (use), path (one column domain blacklist file)
Python
Basic usage
from rollet import get_content
from rollet.extractor import BaseExtractor
url = 'https://example.url.com/content-id'
content_dict = get_content(url)
content_object = BaseExtractor(url)
content_object.title # Title
content_object.abstract # Abstract
content_object.lang # Language
content_object.content_type # Type (pdf, json, html, ...)
content_object.to_dict() # Same as get_content
Custom extractors
class CustomExtractor(BaseExtractor):
@property
def title(self):
return self._page.find('title')
PDF extractors
PDF extraction require Grobid service.
Assuming Grobid API runs on http://localhost:8070
from rollet import grobid_service, get_content
from rollet.extractor import PDFExtractor
grobid_service('localhost', '8070')
url = 'https://example.url.com/pdf-content-id'
content_dict = get_content(url)
pdf_content_object = PDFExtractor(url)
Reading PDF with BaseExtractor
will instanciate PDFExtractor object.
And More!
Project details
Release history Release notifications | RSS feed
Download files
Download the file for your platform. If you're not sure which to choose, learn more about installing packages.
Source Distribution
rollet-0.1.5.tar.gz
(66.7 kB
view details)
Built Distribution
rollet-0.1.5-py3-none-any.whl
(66.4 kB
view details)
File details
Details for the file rollet-0.1.5.tar.gz
.
File metadata
- Download URL: rollet-0.1.5.tar.gz
- Upload date:
- Size: 66.7 kB
- Tags: Source
- Uploaded using Trusted Publishing? No
- Uploaded via: twine/3.4.1 importlib_metadata/3.10.0 pkginfo/1.7.0 requests/2.25.1 requests-toolbelt/0.9.1 tqdm/4.61.0 CPython/3.7.3
File hashes
Algorithm | Hash digest | |
---|---|---|
SHA256 | 322270401955942af3d36c62c54f6e41f5902269c6e64886345c46c3d40ff8e4 |
|
MD5 | 51dd90a081dd2aee86013255583a52ad |
|
BLAKE2b-256 | a5305bcdcb4e1397b213508c592fda693603ffa24e187dca88d5d230e7f4c5bd |
File details
Details for the file rollet-0.1.5-py3-none-any.whl
.
File metadata
- Download URL: rollet-0.1.5-py3-none-any.whl
- Upload date:
- Size: 66.4 kB
- Tags: Python 3
- Uploaded using Trusted Publishing? No
- Uploaded via: twine/3.4.1 importlib_metadata/3.10.0 pkginfo/1.7.0 requests/2.25.1 requests-toolbelt/0.9.1 tqdm/4.61.0 CPython/3.7.3
File hashes
Algorithm | Hash digest | |
---|---|---|
SHA256 | a983c5b4f359ac8bdbdee7953b13a5b966e3ef22c5d9c9b65bc0b1f19b826920 |
|
MD5 | 1e3ef0e8d0aeae28fd66a3022c698531 |
|
BLAKE2b-256 | ccdc80cf0e514393a13042dbbcd7892243e85572675539f58f893ed8132b5464 |