Collect data from various sources
Project description
Rollet
Rollet
collects, standardizes and completes from various sources.
Installation
Pypi
The safest way to install rollet
is to go through pip
python -m pip install rollet
How to use?
Command script
rollet {extract-txt,extract-csv,extract-json} path
[-h] [-o [OUTFILE]] [-l [LINK]] [-f [FIELDS]] [--start [START]]
[--size [SIZE]] [-t [TIMESLEEP]] [--timeout [TIMEOUT]]
[--blacklist [BLACKLIST]]
positional arguments:
{extract-txt,extract-csv,extract-json} Choose file type option extraction
path file path
optional arguments:
-h, --help show this help message and exit
-o [OUTFILE], --outfile output file path
-l [LINK], --link link field if csv or json
-f [FIELDS], --fields fields to keep separated by comma
--start [START] number of rows to skip
--size [SIZE] max number of rows to keep
-t [TIMESLEEP], --timesleep sleep time in seconds between two pulling
--timeout [TIMEOUT] Max GET request timeout in second
--blacklist [BLACKLIST] 0 (do not use), 1 (use), path (one column domain blacklist file)
Python
Basic usage
from rollet import get_content
from rollet.extractor import BaseExtractor
url = 'https://example.url.com/content-id'
content_dict = get_content(url)
content_object = BaseExtractor(url)
content_object.title # Title
content_object.abstract # Abstract
content_object.lang # Language
content_object.content_type # Type (pdf, json, html, ...)
content_object.to_dict() # Same as get_content
Custom extractors
class CustomExtractor(BaseExtractor):
@property
def title(self):
return self._page.find('title')
PDF extractors
PDF extraction require Grobid service.
Assuming Grobid API runs on http://localhost:8070
from rollet import grobid_service, get_content
from rollet.extractor import PDFExtractor
grobid_service('localhost', '8070')
url = 'https://example.url.com/pdf-content-id'
content_dict = get_content(url)
pdf_content_object = PDFExtractor(url)
Reading PDF with BaseExtractor
will instanciate PDFExtractor object.
And More!
Project details
Release history Release notifications | RSS feed
Download files
Download the file for your platform. If you're not sure which to choose, learn more about installing packages.
Source Distribution
rollet-0.2.0a1.tar.gz
(67.8 kB
view details)
Built Distribution
rollet-0.2.0a1-py3-none-any.whl
(67.1 kB
view details)
File details
Details for the file rollet-0.2.0a1.tar.gz
.
File metadata
- Download URL: rollet-0.2.0a1.tar.gz
- Upload date:
- Size: 67.8 kB
- Tags: Source
- Uploaded using Trusted Publishing? No
- Uploaded via: twine/5.1.1 CPython/3.11.9
File hashes
Algorithm | Hash digest | |
---|---|---|
SHA256 | 0f8c06e88c7584db8d6bdedc5c201c7f032e877973c2e8f56850be8e5aee83a4 |
|
MD5 | acce827e41dc898bf4eda8dec813b941 |
|
BLAKE2b-256 | 7968c679bfbb8a0090989f1616dd29f7f4397a6d5fe2aee89ed0e76d8aca9be0 |
File details
Details for the file rollet-0.2.0a1-py3-none-any.whl
.
File metadata
- Download URL: rollet-0.2.0a1-py3-none-any.whl
- Upload date:
- Size: 67.1 kB
- Tags: Python 3
- Uploaded using Trusted Publishing? No
- Uploaded via: twine/5.1.1 CPython/3.11.9
File hashes
Algorithm | Hash digest | |
---|---|---|
SHA256 | 1686cf05e4e4907a8d9b14f56a96576d276de08b516ea1688cc45057757d87cb |
|
MD5 | 65545d5e4bd085101ec8b62c4d62fbb2 |
|
BLAKE2b-256 | caafa8cb754916988820a942e710c203cc552448c2900a9c7f13447595c1c97d |