Collect data from various sources

These details have not been verified by PyPI

Project description

Rollet

Rollet collects, standardizes and completes from various sources.

PyPI - Status

Installation

Pypi

The safest way to install rollet is to go through pip

python -m pip install rollet

How to use?

Command script

rollet {extract-txt,extract-csv,extract-json} path
       [-h] [-o [OUTFILE]] [-l [LINK]] [-f [FIELDS]] [--start [START]]
       [--size [SIZE]] [-t [TIMESLEEP]] [--timeout [TIMEOUT]]
       [--blacklist [BLACKLIST]]

positional arguments:
  {extract-txt,extract-csv,extract-json} Choose file type option extraction
  path                                   file path

optional arguments:
  -h, --help                    show this help message and exit
  -o [OUTFILE], --outfile       output file path
  -l [LINK], --link             link field if csv or json
  -f [FIELDS], --fields         fields to keep separated by comma
  --start [START]               number of rows to skip
  --size [SIZE]                 max number of rows to keep
  -t [TIMESLEEP], --timesleep   sleep time in seconds between two pulling
  --timeout [TIMEOUT]           Max GET request timeout in second
  --blacklist [BLACKLIST]       0 (do not use), 1 (use), path (one column domain blacklist file)

Python

Basic usage

from rollet import get_content
from rollet.extractor import BaseExtractor

url = 'https://example.url.com/content-id'

content_dict = get_content(url)

content_object = BaseExtractor(url)
content_object.title            # Title
content_object.abstract         # Abstract
content_object.lang             # Language
content_object.content_type     # Type (pdf, json, html, ...)
content_object.to_dict()        # Same as get_content

Custom extractors

class CustomExtractor(BaseExtractor):

    @property
    def title(self):
        return self._page.find('title')

PDF extractors

PDF extraction require Grobid service.
Assuming Grobid API runs on http://localhost:8070

from rollet import grobid_service, get_content
from rollet.extractor import PDFExtractor

grobid_service('localhost', '8070')

url = 'https://example.url.com/pdf-content-id'

content_dict = get_content(url)

pdf_content_object = PDFExtractor(url)

Reading PDF with BaseExtractor will instanciate PDFExtractor object.

And More!

Project details

These details have not been verified by PyPI

Release history Release notifications | RSS feed

0.2.0a2 pre-release

Jul 31, 2024

0.2.0a1 pre-release

Jul 29, 2024

0.2.0a0 pre-release

Jul 28, 2024

This version

0.1.5

Apr 15, 2022

0.1.4

Mar 8, 2022

0.1.3

Dec 20, 2021

0.1.2

Dec 15, 2021

0.1.1

Nov 16, 2021

0.1.0

Oct 4, 2021

0.0.4a2 pre-release

Sep 30, 2021

0.0.3a1 pre-release

Sep 21, 2021

0.0.2a4 pre-release

Jun 23, 2021

0.0.2a0 pre-release

Jun 4, 2021

0.0.1a9 pre-release

Jun 1, 2021

0.0.1a8 pre-release

May 31, 2021

0.0.1a7 pre-release

May 29, 2021

Download files

Download the file for your platform. If you're not sure which to choose, learn more about installing packages.

Source Distribution

rollet-0.1.5.tar.gz (66.7 kB view details)

Uploaded Apr 15, 2022 Source

Built Distribution

rollet-0.1.5-py3-none-any.whl (66.4 kB view details)

Uploaded Apr 15, 2022 Python 3

File details

Details for the file rollet-0.1.5.tar.gz.

File metadata

Download URL: rollet-0.1.5.tar.gz
Upload date: Apr 15, 2022
Size: 66.7 kB
Tags: Source
Uploaded using Trusted Publishing? No
Uploaded via: twine/3.4.1 importlib_metadata/3.10.0 pkginfo/1.7.0 requests/2.25.1 requests-toolbelt/0.9.1 tqdm/4.61.0 CPython/3.7.3

File hashes

Hashes for rollet-0.1.5.tar.gz
Algorithm	Hash digest
SHA256	`322270401955942af3d36c62c54f6e41f5902269c6e64886345c46c3d40ff8e4`
MD5	`51dd90a081dd2aee86013255583a52ad`
BLAKE2b-256	`a5305bcdcb4e1397b213508c592fda693603ffa24e187dca88d5d230e7f4c5bd`

See more details on using hashes here.

File details

Details for the file rollet-0.1.5-py3-none-any.whl.

File metadata

Download URL: rollet-0.1.5-py3-none-any.whl
Upload date: Apr 15, 2022
Size: 66.4 kB
Tags: Python 3
Uploaded using Trusted Publishing? No
Uploaded via: twine/3.4.1 importlib_metadata/3.10.0 pkginfo/1.7.0 requests/2.25.1 requests-toolbelt/0.9.1 tqdm/4.61.0 CPython/3.7.3

File hashes

Hashes for rollet-0.1.5-py3-none-any.whl
Algorithm	Hash digest
SHA256	`a983c5b4f359ac8bdbdee7953b13a5b966e3ef22c5d9c9b65bc0b1f19b826920`
MD5	`1e3ef0e8d0aeae28fd66a3022c698531`
BLAKE2b-256	`ccdc80cf0e514393a13042dbbcd7892243e85572675539f58f893ed8132b5464`

See more details on using hashes here.

rollet 0.1.5

Navigation

Verified details

Maintainers

Unverified details

Meta

Classifiers

Project description

Rollet

Installation

Pypi

How to use?

Command script

Python

Basic usage

Custom extractors

PDF extractors

Project details

Verified details

Maintainers

Unverified details

Meta

Classifiers

Release history Release notifications | RSS feed

Download files

Source Distribution

Built Distribution

File details

File metadata

File hashes

File details

File metadata

File hashes