Collect data from various sources
Project description
Rollet
Rollet collects, standardizes and completes from various sources.
Installation
Pypi
The safest way to install rollet is to go through pip
python -m pip install rollet
How to use?
Command script
rollet {extract-txt,extract-csv,extract-json} path
[-h] [-o [OUTFILE]] [-l [LINK]] [-f [FIELDS]] [--start [START]]
[--size [SIZE]] [-t [TIMESLEEP]] [--timeout [TIMEOUT]]
[--blacklist [BLACKLIST]]
positional arguments:
{extract-txt,extract-csv,extract-json} Choose file type option extraction
path file path
optional arguments:
-h, --help show this help message and exit
-o [OUTFILE], --outfile output file path
-l [LINK], --link link field if csv or json
-f [FIELDS], --fields fields to keep separated by comma
--start [START] number of rows to skip
--size [SIZE] max number of rows to keep
-t [TIMESLEEP], --timesleep sleep time in seconds between two pulling
--timeout [TIMEOUT] Max GET request timeout in second
--blacklist [BLACKLIST] 0 (do not use), 1 (use), path (one column domain blacklist file)
Python
Basic usage
from rollet import get_content
from rollet.extractor import BaseExtractor
url = 'https://example.url.com/content-id'
content_dict = get_content(url)
content_object = BaseExtractor(url)
content_object.title # Title
content_object.abstract # Abstract
content_object.lang # Language
content_object.content_type # Type (pdf, json, html, ...)
content_object.to_dict() # Same as get_content
Custom extractors
class CustomExtractor(BaseExtractor):
@property
def title(self):
return self._page.find('title')
PDF extractors
PDF extraction require Grobid service.
Assuming Grobid API runs on http://localhost:8070
from rollet import grobid_service, get_content
from rollet.extractor import PDFExtractor
grobid_service('localhost', '8070')
url = 'https://example.url.com/pdf-content-id'
content_dict = get_content(url)
pdf_content_object = PDFExtractor(url)
Reading PDF with BaseExtractor will instanciate PDFExtractor object.
And More!
Project details
Release history Release notifications | RSS feed
Download files
Download the file for your platform. If you're not sure which to choose, learn more about installing packages.
Source Distribution
Built Distribution
Filter files by name, interpreter, ABI, and platform.
If you're not sure about the file name format, learn more about wheel file names.
Copy a direct link to the current filters
File details
Details for the file rollet-0.1.5.tar.gz.
File metadata
- Download URL: rollet-0.1.5.tar.gz
- Upload date:
- Size: 66.7 kB
- Tags: Source
- Uploaded using Trusted Publishing? No
- Uploaded via: twine/3.4.1 importlib_metadata/3.10.0 pkginfo/1.7.0 requests/2.25.1 requests-toolbelt/0.9.1 tqdm/4.61.0 CPython/3.7.3
File hashes
| Algorithm | Hash digest | |
|---|---|---|
| SHA256 |
322270401955942af3d36c62c54f6e41f5902269c6e64886345c46c3d40ff8e4
|
|
| MD5 |
51dd90a081dd2aee86013255583a52ad
|
|
| BLAKE2b-256 |
a5305bcdcb4e1397b213508c592fda693603ffa24e187dca88d5d230e7f4c5bd
|
File details
Details for the file rollet-0.1.5-py3-none-any.whl.
File metadata
- Download URL: rollet-0.1.5-py3-none-any.whl
- Upload date:
- Size: 66.4 kB
- Tags: Python 3
- Uploaded using Trusted Publishing? No
- Uploaded via: twine/3.4.1 importlib_metadata/3.10.0 pkginfo/1.7.0 requests/2.25.1 requests-toolbelt/0.9.1 tqdm/4.61.0 CPython/3.7.3
File hashes
| Algorithm | Hash digest | |
|---|---|---|
| SHA256 |
a983c5b4f359ac8bdbdee7953b13a5b966e3ef22c5d9c9b65bc0b1f19b826920
|
|
| MD5 |
1e3ef0e8d0aeae28fd66a3022c698531
|
|
| BLAKE2b-256 |
ccdc80cf0e514393a13042dbbcd7892243e85572675539f58f893ed8132b5464
|