Collect data from various sources
Project description
Rollet
Rollet
collects, standardizes and completes from various sources.
Installation
Pypi
The safest way to install rollet
is to go through pip
python -m pip install rollet
How to use?
Command script
rollet {extract-txt,extract-csv,extract-json} path
[-h] [-o [OUTFILE]] [-l [LINK]] [-f [FIELDS]] [--start [START]]
[--size [SIZE]] [-t [TIMESLEEP]] [--timeout [TIMEOUT]]
[--blacklist [BLACKLIST]]
positional arguments:
{extract-txt,extract-csv,extract-json} Choose file type option extraction
path file path
optional arguments:
-h, --help show this help message and exit
-o [OUTFILE], --outfile output file path
-l [LINK], --link link field if csv or json
-f [FIELDS], --fields fields to keep separated by comma
--start [START] number of rows to skip
--size [SIZE] max number of rows to keep
-t [TIMESLEEP], --timesleep sleep time in seconds between two pulling
--timeout [TIMEOUT] Max GET request timeout in second
--blacklist [BLACKLIST] 0 (do not use), 1 (use), path (one column domain blacklist file)
Python
Basic usage
from rollet import get_content
from rollet.extractor import BaseExtractor
url = 'https://example.url.com/content-id'
content_dict = get_content(url)
content_object = BaseExtractor(url)
content_object.title # Title
content_object.abstract # Abstract
content_object.lang # Language
content_object.content_type # Type (pdf, json, html, ...)
content_object.to_dict() # Same as get_content
Custom extractors
class CustomExtractor(BaseExtractor):
@property
def title(self):
return self._page.find('title')
PDF extractors
PDF extraction require Grobid service.
Assuming Grobid API runs on http://localhost:8070
from rollet import grobid_service, get_content
from rollet.extractor import PDFExtractor
grobid_service('localhost', '8070')
url = 'https://example.url.com/pdf-content-id'
content_dict = get_content(url)
pdf_content_object = PDFExtractor(url)
Reading PDF with BaseExtractor
will instanciate PDFExtractor object.
And More!
Project details
Release history Release notifications | RSS feed
Download files
Download the file for your platform. If you're not sure which to choose, learn more about installing packages.
Source Distribution
rollet-0.2.0a2.tar.gz
(68.1 kB
view details)
Built Distribution
rollet-0.2.0a2-py3-none-any.whl
(67.3 kB
view details)
File details
Details for the file rollet-0.2.0a2.tar.gz
.
File metadata
- Download URL: rollet-0.2.0a2.tar.gz
- Upload date:
- Size: 68.1 kB
- Tags: Source
- Uploaded using Trusted Publishing? No
- Uploaded via: twine/5.1.1 CPython/3.11.9
File hashes
Algorithm | Hash digest | |
---|---|---|
SHA256 | bd8754b63378baad023489f38652bad8ae3ec8395630b00bb4b239f193ea5f88 |
|
MD5 | 70417f503dac9fb76a7ce452842a2bae |
|
BLAKE2b-256 | 7a026263e01cb551a2517fed8d9ad30b392eb708e08926423e97dcbc35eac4f3 |
File details
Details for the file rollet-0.2.0a2-py3-none-any.whl
.
File metadata
- Download URL: rollet-0.2.0a2-py3-none-any.whl
- Upload date:
- Size: 67.3 kB
- Tags: Python 3
- Uploaded using Trusted Publishing? No
- Uploaded via: twine/5.1.1 CPython/3.11.9
File hashes
Algorithm | Hash digest | |
---|---|---|
SHA256 | 3285f3b8b2811b4324de76cf0a940b14b372e20f96e2a58742366b569aa16c76 |
|
MD5 | 450a60df233c3c91a83c8d19b105a7df |
|
BLAKE2b-256 | 6f7a555396010303072ec9c97ea0a0f9f0bc1ffcb0fb48d176a9507cecbaa9db |