Skip to main content

Collect data from various sources

Project description

Rollet

Rollet collects, standardizes and completes from various sources.

PyPI PyPI - Status PyPI - Python Version

Installation

Pypi

The safest way to install rollet is to go through pip

python -m pip install rollet

How to use?

Command script

rollet {extract-txt,extract-csv,extract-json} path
       [-h] [-o [OUTFILE]] [-l [LINK]] [-f [FIELDS]] [--start [START]]
       [--size [SIZE]] [-t [TIMESLEEP]] [--timeout [TIMEOUT]]
       [--blacklist [BLACKLIST]]
positional arguments:
  {extract-txt,extract-csv,extract-json} Choose file type option extraction
  path                                   file path

optional arguments:
  -h, --help                    show this help message and exit
  -o [OUTFILE], --outfile       output file path
  -l [LINK], --link             link field if csv or json
  -f [FIELDS], --fields         fields to keep separated by comma
  --start [START]               number of rows to skip
  --size [SIZE]                 max number of rows to keep
  -t [TIMESLEEP], --timesleep   sleep time in seconds between two pulling
  --timeout [TIMEOUT]           Max GET request timeout in second
  --blacklist [BLACKLIST]       0 (do not use), 1 (use), path (one column domain blacklist file)

Python

Basic usage

from rollet import get_content
from rollet.extractor import BaseExtractor

url = 'https://example.url.com/content-id'

content_dict = get_content(url)

content_object = BaseExtractor(url)
content_object.title            # Title
content_object.abstract         # Abstract
content_object.lang             # Language
content_object.content_type     # Type (pdf, json, html, ...)
content_object.to_dict()        # Same as get_content

Custom extractors

class CustomExtractor(BaseExtractor):

    @property
    def title(self):
        return self._page.find('title')

PDF extractors

PDF extraction require Grobid service.
Assuming Grobid API runs on http://localhost:8070

from rollet import grobid_service, get_content
from rollet.extractor import PDFExtractor

grobid_service('localhost', '8070')

url = 'https://example.url.com/pdf-content-id'

content_dict = get_content(url)

pdf_content_object = PDFExtractor(url)

Reading PDF with BaseExtractor will instanciate PDFExtractor object.

And More!

Project details


Download files

Download the file for your platform. If you're not sure which to choose, learn more about installing packages.

Source Distribution

rollet-0.1.5.tar.gz (66.7 kB view details)

Uploaded Source

Built Distribution

rollet-0.1.5-py3-none-any.whl (66.4 kB view details)

Uploaded Python 3

File details

Details for the file rollet-0.1.5.tar.gz.

File metadata

  • Download URL: rollet-0.1.5.tar.gz
  • Upload date:
  • Size: 66.7 kB
  • Tags: Source
  • Uploaded using Trusted Publishing? No
  • Uploaded via: twine/3.4.1 importlib_metadata/3.10.0 pkginfo/1.7.0 requests/2.25.1 requests-toolbelt/0.9.1 tqdm/4.61.0 CPython/3.7.3

File hashes

Hashes for rollet-0.1.5.tar.gz
Algorithm Hash digest
SHA256 322270401955942af3d36c62c54f6e41f5902269c6e64886345c46c3d40ff8e4
MD5 51dd90a081dd2aee86013255583a52ad
BLAKE2b-256 a5305bcdcb4e1397b213508c592fda693603ffa24e187dca88d5d230e7f4c5bd

See more details on using hashes here.

File details

Details for the file rollet-0.1.5-py3-none-any.whl.

File metadata

  • Download URL: rollet-0.1.5-py3-none-any.whl
  • Upload date:
  • Size: 66.4 kB
  • Tags: Python 3
  • Uploaded using Trusted Publishing? No
  • Uploaded via: twine/3.4.1 importlib_metadata/3.10.0 pkginfo/1.7.0 requests/2.25.1 requests-toolbelt/0.9.1 tqdm/4.61.0 CPython/3.7.3

File hashes

Hashes for rollet-0.1.5-py3-none-any.whl
Algorithm Hash digest
SHA256 a983c5b4f359ac8bdbdee7953b13a5b966e3ef22c5d9c9b65bc0b1f19b826920
MD5 1e3ef0e8d0aeae28fd66a3022c698531
BLAKE2b-256 ccdc80cf0e514393a13042dbbcd7892243e85572675539f58f893ed8132b5464

See more details on using hashes here.

Supported by

AWS AWS Cloud computing and Security Sponsor Datadog Datadog Monitoring Fastly Fastly CDN Google Google Download Analytics Microsoft Microsoft PSF Sponsor Pingdom Pingdom Monitoring Sentry Sentry Error logging StatusPage StatusPage Status page