scrapinghub-autoextract·PyPI

Python interface to Scrapinghub Automatic Extraction API

These details have not been verified by PyPI

Project links

Homepage

Development Status
- 3 - Alpha
Intended Audience
- Developers
License
- OSI Approved :: BSD License
Natural Language
- English
Operating System
- OS Independent
Programming Language

Project description

Python client libraries for Scrapinghub AutoExtract API. It allows to extract product and article information from any website.

Both synchronous and asyncio wrappers are provided by this package.

License is BSD 3-clause.

Installation

pip install scrapinghub-autoextract

scrapinghub-autoextract requires Python 3.6+ for CLI tool and for the asyncio API; basic, synchronous API works with Python 3.5.

Usage

First, make sure you have an API key. To avoid passing it in api_key argument with every call, you can set SCRAPINGHUB_AUTOEXTRACT_KEY environment variable with the key.

Command-line interface

The most basic way to use the client is from a command line. First, create a file with urls, an URL per line (e.g. urls.txt). Second, set SCRAPINGHUB_AUTOEXTRACT_KEY env variable with your AutoExtract API key (you can also pass API key as --api-key script argument).

Then run a script, to get the results:

python -m autoextract urls.txt --page-type article > res.jl

Run python -m autoextract --help to get description of all supported options.

Synchronous API

Synchronous API provides an easy way to try autoextract in a script. For production usage asyncio API is strongly recommended.

You can send requests as described in API docs:

from autoextract.sync import request_raw
query = [{'url': 'http://example.com.foo', 'pageType': 'article'}]
results = request_raw(query)

Note that if there are several URLs in the query, results can be returned in arbitrary order.

There is also a autoextract.sync.request_batch helper, which accepts URLs and page type, and ensures results are in the same order as requested URLs:

from autoextract.sync import request_batch
urls = ['http://example.com/foo', 'http://example.com/bar']
results = request_batch(urls, page_type='article')

asyncio API

Basic usage is similar to sync API (request_raw), but asyncio event loop is used:

from autoextract.aio import request_raw

async def foo():
    results1 = await request_raw(query)
    # ...

There is also request_parallel function, which allows to process many URLs in parallel, using both batching and multiple connections:

import sys
from autoextract.aio import request_parallel, create_session

async def foo():
    async with create_session() as session:
        res_iter = request_parallel(urls, page_type='article',
                                    n_conn=10, batch_size=3,
                                    session=session)
        for f in res_iter:
            try:
                batch_result = await f
                for res in batch_result:
                    # do something with a result
            except ApiError as e:
                print(e, file=sys.stderr)
                raise

request_parallel and request_raw functions handle throttling (http 429 errors) and network errors, retrying a request in these cases.

CLI interface implementation (autoextract/__main__.py) can serve as an usage example.

Contributing

Source code: https://github.com/scrapinghub/scrapinghub-autoextract
Issue tracker: https://github.com/scrapinghub/scrapinghub-autoextract/issues

Use tox to run tests with different Python versions:

tox

The command above also runs type checks; we use mypy.

Changes

0.1.1 (2020-03-12)

allow up to 100 elements in a batch, not up to 99
custom User-Agent header is added
Python 3.8 support is declared & tested

0.1 (2019-10-09)

Initial release.

Project details

These details have not been verified by PyPI

Project links

Homepage

Development Status
- 3 - Alpha
Intended Audience
- Developers
License
- OSI Approved :: BSD License
Natural Language
- English
Operating System
- OS Independent
Programming Language

Release history Release notifications | RSS feed

0.6.1

Jan 27, 2021

0.6.0

Dec 29, 2020

0.5.2

Nov 27, 2020

0.5.1

Aug 22, 2020

0.5.0

Aug 22, 2020

0.4.0

Aug 17, 2020

0.3.0

Jul 24, 2020

0.2.0

Apr 15, 2020

This version

0.1.1

Mar 12, 2020

0.1

Oct 9, 2019

Download files

Download the file for your platform. If you're not sure which to choose, learn more about installing packages.

Source Distribution

scrapinghub-autoextract-0.1.1.tar.gz (11.8 kB view details)

Uploaded Mar 12, 2020 Source

Built Distribution

scrapinghub_autoextract-0.1.1-py3-none-any.whl (12.5 kB view details)

Uploaded Mar 12, 2020 Python 3

File details

Details for the file scrapinghub-autoextract-0.1.1.tar.gz.

File metadata

Download URL: scrapinghub-autoextract-0.1.1.tar.gz
Upload date: Mar 12, 2020
Size: 11.8 kB
Tags: Source
Uploaded using Trusted Publishing? No
Uploaded via: twine/3.1.1 pkginfo/1.5.0.1 requests/2.23.0 setuptools/46.0.0 requests-toolbelt/0.9.1 tqdm/4.43.0 CPython/3.7.1

File hashes

Hashes for scrapinghub-autoextract-0.1.1.tar.gz
Algorithm	Hash digest
SHA256	`c1dc2f0ec3112513d44b7f43d433247c62fea91c9e487d9bd53491d3a45e3fab`
MD5	`93f43fd475a0158d5fa5590b4b67302b`
BLAKE2b-256	`c38089f406181f8f408c8a763f1484e54ff247cb18ad4488745d9021f528c0cf`

See more details on using hashes here.

File details

Details for the file scrapinghub_autoextract-0.1.1-py3-none-any.whl.

File metadata

Download URL: scrapinghub_autoextract-0.1.1-py3-none-any.whl
Upload date: Mar 12, 2020
Size: 12.5 kB
Tags: Python 3
Uploaded using Trusted Publishing? No
Uploaded via: twine/3.1.1 pkginfo/1.5.0.1 requests/2.23.0 setuptools/46.0.0 requests-toolbelt/0.9.1 tqdm/4.43.0 CPython/3.7.1

File hashes

Hashes for scrapinghub_autoextract-0.1.1-py3-none-any.whl
Algorithm	Hash digest
SHA256	`d70a2266792950983b8d64503759861d6d2d4fa41d0c48364489fe00de0c2321`
MD5	`51b5cca6b7561a08484055916ab3438f`
BLAKE2b-256	`cc99bf13abb586f316554b9f43be21a74523f7b1a94cabf4e6a2c1950a794fa9`

See more details on using hashes here.

scrapinghub-autoextract 0.1.1

Navigation

Verified details

Maintainers

Unverified details

Project links

Meta

Classifiers

Project description

Installation

Usage

Command-line interface

Synchronous API

asyncio API

Contributing

Changes

0.1.1 (2020-03-12)

0.1 (2019-10-09)

Project details

Verified details

Maintainers

Unverified details

Project links

Meta

Classifiers

Release history Release notifications | RSS feed

Download files

Source Distribution

Built Distribution

File details

File metadata

File hashes

File details

File metadata

File hashes