scrapinghub

Client interface for Scrapinghub API

These details have not been verified by PyPI

Project links

Homepage

Project description

https://secure.travis-ci.org/scrapinghub/python-scrapinghub.png?branch=master

The scrapinghub is a Python library for communicating with the Scrapinghub API.

Installation

The quick way:

pip install scrapinghub

You can also install the library with MessagePack support, it provides better response time and improved bandwidth usage:

pip install scrapinghub[msgpack]

Usage

First, you connect to Scrapinghub:

>>> from scrapinghub import Connection
>>> conn = Connection('APIKEY')
>>> conn
Connection('APIKEY')

You can list the projects available to your account:

>>> conn.project_ids()
[123, 456]

And select a particular project to work with:

>>> project = conn[123]
>>> project
Project(Connection('APIKEY'), 123)
>>> project.id
123

To schedule a spider run (it returns the job id):

>>> project.schedule('myspider', arg1='val1')
u'123/1/1'

To get the list of spiders in the project:

>>> project.spiders()
[
  {u'id': u'spider1', u'tags': [], u'type': u'manual', u'version': u'123'},
  {u'id': u'spider2', u'tags': [], u'type': u'manual', u'version': u'123'}
]

To get all finished jobs:

>>> jobs = project.jobs(state='finished')

jobs is a JobSet. JobSet objects are iterable and, when iterated, return an iterable of Job objects, so you typically use it like this:

>>> for job in jobs:
...     # do something with job

Or, if you just want to get the job ids:

>>> [x.id for x in jobs]
[u'123/1/1', u'123/1/2', u'123/1/3']

To select a specific job:

>>> job = project.job(u'123/1/2')
>>> job.id
u'123/1/2'

To retrieve all scraped items from a job:

>>> for item in job.items():
...     # do something with item (it's just a dict)

To retrieve all log entries from a job:

>>> for logitem in job.log():
...     # logitem is a dict with logLevel, message, time

To get job info:

>>> job.info['spider']
'myspider'
>>> job.info['started_time']
'2010-09-28T15:09:57.629000'
>>> job.info['tags']
[]
>>> job.info['fields_count]['description']
1253

To mark a job with tag consumed:

>>> job.update(add_tag='consumed')

To mark several jobs with tag consumed (JobSet also supports the update() method):

>>> project.jobs(state='finished').update(add_tag='consumed')

To delete a job:

>>> job.delete()

To delete several jobs (JobSet also supports the update() method):

>>> project.jobs(state='finished').delete()

HubstorageClient

The library can also be used for interaction with spiders, jobs and scraped data through storage.scrapinghub.com endpoints.

First, use your API key for authorization:

>>> from scrapinghub import HubstorageClient
>>> hс = HubstorageClient(auth='apikey')
>>> hc.server_timestamp()
1446222762611

Project

To get project settings or jobs summary:

>>> project = hc.get_project('1111111')
>>> project.settings['botgroups']
[u'botgroup1', ]
>>> project.jobsummary()
{u'finished': 6,
 u'has_capacity': True,
 u'pending': 0,
 u'project': 1111111,
 u'running': 0}

Spider

To get spider id correlated with its name:

>>> project.ids.spider('foo')
1

To see last jobs summaries:

>>> summaries = project.spiders.lastjobsummary(count=3)

To get job summary per spider:

>>> summary = project.spiders.lastjobsummary(spiderid='1')

Job

Job can be retrieved directly by id (project_id/spider_id/job_id):

>>> job = hc.get_job('1111111/1/1')
>>> job.key
'1111111/1/1'
>>> job.metadata['state']
u'finished'

Creating a new job requires a spider name:

>>> job = hc.push_job(projectid='1111111', spidername='foo')
>>> job.key
'1111111/1/1'

Priority can be between 0 and 4 (from lowest to highest), the default is 2.

To push job from project level with the highest priority:

>>> job = project.push_job(spidername='foo', priority=4)
>>> job.metadata['priority']
4

Pushing a job with spider arguments:

>>> project.push_job(spidername='foo', spider_args={'arg1': 'foo', 'arg2': 'bar'})

Running job can be cancelled by calling request_cancel():

>>> job.request_cancel()
>>> job.metadata['cancelled_by']
u'John'

To delete job:

>>> job.purged()
>>> job.metadata['state']
u'deleted'

Job details

Job details can be found in jobs metadata and it’s scrapystats:

>>> job = hc.get_job('1111111/1/1')
>>> job.metadata['version']
u'5123a86-master'
>>> job.metadata['scrapystats']
...
u'downloader/response_count': 104,
u'downloader/response_status_count/200': 104,
u'finish_reason': u'finished',
u'finish_time': 1447160494937,
u'item_scraped_count': 50,
u'log_count/DEBUG': 157,
u'log_count/INFO': 1365,
u'log_count/WARNING': 3,
u'memusage/max': 182988800,
u'memusage/startup': 62439424,
...

Anything can be stored in metadata, here is example how to add tags:

>>> job.update_metadata({'tags': 'obsolete'})

Jobs

To iterate through all jobs metadata per project (descending order):

>>> jobs_metadata = project.jobq.list()
>>> [j['key'] for j in jobs_metadata]
['1111111/1/3', '1111111/1/2', '1111111/1/1']

Jobq metadata fieldset is less detailed, than job.metadata, but contains few new fields as well. Additional fields can be requested using the jobmeta parameter. If it used, then it’s up to the user to list all the required fields, so only few default fields would be added except requested ones:

>>> metadata = next(project.jobq.list())
>>> metadata.get('spider', 'missing')
u'foo'
>>> jobs_metadata = project.jobq.list(jobmeta=['scheduled_by', ])
>>> metadata = next(jobs_metadata)
>>> metadata.get('scheduled_by', 'missing')
u'John'
>>> metadata.get('spider', 'missing')
missing

By default jobq.list() returns maximum last 1000 results. Pagination is available using the start parameter:

>>> jobs_metadata = project.jobq.list(start=1000)

There are several filters like spider, state, has_tag, lacks_tag, startts and endts. To get jobs filtered by tags:

>>> jobs_metadata = project.jobq.list(has_tag=['new', 'verified'], lacks_tag='obsolete')

List of tags has OR power, so in the case above jobs with ‘new’ or ‘verified’ tag are expected.

To get certain number of last finished jobs per some spider:

>>> jobs_metadata = project.jobq.list(spider='foo', state='finished' count=3)

There are 4 possible job states, which can be used as values for filtering by state:

pending
running
finished
deleted

Items

To iterate through items:

>>> items = job.items.iter_values()
>>> for item in items:
# do something, item is just a dict

Logs

To iterate through 10 first logs for example:

>>> logs = job.logs.iter_values(count=10)
>>> for log in logs:
# do something, log is a dict with log level, message and time keys

Collections

Let’s store hash and timestamp pair for foo spider. Usual workflow with Collections would be:

>>> collections = project.collections
>>> foo_store = collections.new_store('foo_store')
>>> foo_store.set({'_key': '002d050ee3ff6192dcbecc4e4b4457d7', 'value': '1447221694537'})
>>> foo_store.count()
1
>>> foo_store.get('002d050ee3ff6192dcbecc4e4b4457d7')
'1447221694537'
>>> for result in foo_store.iter_values():
# do something with _key & value pair
>>> foo_store.delete('002d050ee3ff6192dcbecc4e4b4457d7')
>>> foo_store.count()
0

Frontier

Typical workflow with Frontier:

>>> frontier = project.frontier

Add a request to the frontier:

>>> frontier.add('test', 'example.com', [{'fp': '/some/path.html'}])
>>> frontier.flush()
>>> frontier.newcount
1

Add requests with additional parameters:

>>> frontier.add('test', 'example.com', [{'fp': '/'}, {'fp': 'page1.html', 'p': 1, 'qdata': {'depth': 1}}])
>>> frontier.flush()
>>> frontier.newcount
2

To delete the slot example.com from the frontier:

>>> frontier.delete_slot('test', 'example.com')

To retrieve requests for a given slot:

>>> reqs = frontier.read('test', 'example.com')

To delete a batch of requests:

>>> frontier.delete('test', 'example.com', '00013967d8af7b0001')

To retrieve fingerprints for a given slot:

>>> fps = [req['requests'] for req in frontier.read('test', 'example.com')]

Project details

These details have not been verified by PyPI

Project links

Homepage

Release history Release notifications | RSS feed

2.5.0

Dec 16, 2024

2.4.0

Mar 10, 2022

2.3.1

Mar 13, 2020

2.3.0

Dec 17, 2019

2.2.1

Aug 7, 2019

2.2.0

Aug 6, 2019

2.1.1

Apr 25, 2019

2.1.0

Jan 14, 2019

2.0.3

Dec 8, 2017

2.0.2

Dec 5, 2017

2.0.1

Jul 19, 2017

2.0.0

Mar 29, 2017

2.0.0.dev0 pre-release

Mar 24, 2017

This version

1.9.0

Nov 2, 2016

1.8.0

Jul 29, 2016

1.7.0

Jul 25, 2014

1.6.2

Jul 1, 2014

1.6.1

Jul 1, 2014

1.5.0

Jan 29, 2014

1.4.4

Dec 18, 2013

1.4.3

Nov 25, 2013

1.4.1

Nov 25, 2013

Download files

Download the file for your platform. If you're not sure which to choose, learn more about installing packages.

Source Distribution

scrapinghub-1.9.0.tar.gz (21.8 kB view details)

Uploaded Nov 2, 2016 Source

Built Distribution

If you're not sure about the file name format, learn more about wheel file names.

The dropdown lists show the available interpreters, ABIs, and platforms. Enable javascript to be able to filter the list of wheel files.

scrapinghub-1.9.0-py2-none-any.whl (24.6 kB view details)

Uploaded Nov 2, 2016 Python 2

File details

Details for the file scrapinghub-1.9.0.tar.gz.

File metadata

Download URL: scrapinghub-1.9.0.tar.gz
Upload date: Nov 2, 2016
Size: 21.8 kB
Tags: Source
Uploaded using Trusted Publishing? No

File hashes

Hashes for scrapinghub-1.9.0.tar.gz
Algorithm	Hash digest
SHA256	`944064e1f188deb8aaef3f980db41034151b50cf5fe7c59ad15b1d9a063044d4`
MD5	`9ddea887201728aac9b81be2d3d2b6c4`
BLAKE2b-256	`0663730d3cd776fb0d5b35a4ae5e92aafa3dff3f009c327befecb17fac5dc485`

See more details on using hashes here.

File details

Details for the file scrapinghub-1.9.0-py2-none-any.whl.

File metadata

Download URL: scrapinghub-1.9.0-py2-none-any.whl
Upload date: Nov 2, 2016
Size: 24.6 kB
Tags: Python 2
Uploaded using Trusted Publishing? No

File hashes

Hashes for scrapinghub-1.9.0-py2-none-any.whl
Algorithm	Hash digest
SHA256	`8d1c15131072686acd72cb020306ac1cf28a2a18672751acb4203dfea96879f2`
MD5	`abe6012f18c1ca2a1e3b9bdaf9a007f6`
BLAKE2b-256	`f9141c9805b9b0db47ed333a6defef1949e0f887b4454c62fa08ed4c5dc3cdaf`

See more details on using hashes here.

scrapinghub 1.9.0

Navigation

Verified details

Maintainers

Unverified details

Project links

Meta

Classifiers

Project description

Requirements

Installation

Usage

HubstorageClient

Project

Spider

Job

Job details

Jobs

Items

Logs

Collections

Frontier

Project details

Verified details

Maintainers

Unverified details

Project links

Meta

Classifiers

Release history Release notifications | RSS feed

Download files

Source Distribution

Built Distribution

File details

File metadata

File hashes

File details

File metadata

File hashes