This is a pre-production deployment of Warehouse, however changes made here WILL affect the production instance of PyPI.
Latest Version Dependencies status unknown Test status unknown Test coverage unknown
Project Description

scrapy-sqlitem

scrapy-sqlitem allows you to define scrapy items using Sqlalchemy models or tables. It also provides an easy way to save to the database in chunks.

This project is in beta. Pull requests and feedback are welcome. The regular caveats of using a sql database backend for a write heavy application still apply.

Inspiration from scrapy-redis and scrapy-djangoitem

Quickstart

pip install scrapy_sqlitem

Define items using Sqlalchemy ORM

from scrapy_sqlitem import SqlItem

class MyModel(Base):
    __tablename__ = 'mytable'
    id = Column(Integer, primary_key=True)
    name = Column(String)

class MyItem(SqlItem):
    sqlmodel = MyModel

Or Define Items using Sqlalchemy Core

from scrapy_sqlitem import SqlItem

class MyItem(SqlItem):
    sqlmodel = Table('mytable', metadata
        Column('id', Integer, primary_key=True),
        Column('name', String, nullable=False))

If tables have not been created yet make sure to create them. See sqlalchemy docs and the example spider.

Use SqlSpider to easily save scraped items to the database

settings.py

DATABASE_URI = "sqlite:///"

Define your spider

from scrapy_sqlitem import SqlSpider

class MySpider(SqlSpider):
   name = 'myspider'

   start_urls = ('http://dmoz.org',)

   def parse(self, response):
        selector = Selector(response)
        item = MyItem()
        item['name'] = selector.xpath('//title[1]/text()').extract_first()
        yield item

Run the spider

scrapy crawl myspider

Query the database

Select * from mytable;

 id |               name                |
----+-----------------------------------+
  1 | DMOZ - the Open Directory Project |

Other Information

Do not want to use SqlSpider? Write a pipeline instead.

from sqlalchemy import create_engine

class CommitSqlPipeline(object):

        def __init__(self):
                self.engine = create_engine("sqlite:///")

        def process_item(self, item, spider):
                item.commit_item(engine=self.engine)

Drop items missing required primary key data before saving to the db

from scrapy.exceptions import DropItem

class DropMissingDataPipeline(object):
        def process_item(self, item, spider):
                if item.null_required_fields:
                        raise DropItem
                else:
                        return item
# Watch out for Serial primary keys that are considered null.

Save to the database in chunks rather than item by item

Inherit from SqlSpider and..

In settings

DEFAULT_CHUNKSIZE = 500

CHUNKSIZE_BY_TABLE = {'mytable': 1000, 'othertable': 250}

If an error occurs while saving a chunk to the db it will try and save each item one at a time

Access the underlying sqlalchemy table to query the database

INSERT INTO mytable (id, name) VALUES ('1','ryan')
myitem = MyItem()
# bind the table to an engine (I could have done this when I created the table too)
myitem.table.metadata.bind = self.engine
myitem.table.select().where(item.table.c.id == 1).execute().fetchone()

(1, 'ryan')

What row in the database matches the data in my item?

myitem = MyItem()
myitem['id'] = 1
myitem.get_matching_dbrow(bind=self.engine)

(1, 'ryan')

This is same query as the one above!

Gotchas

If you subclass either item_scraped or spider_closed make sure to call super!

class MySpider(SqlSpider):

        def parse(self, response):
                pass

        def spider_closed(self, spider, reason):
                super(MySpider, self).spider_closed(spider, reason)
                self.log("Log some really important custom stats")

Be Careful with other Mixins. The inheritance structure can get a little messy. If a class early in the mro subclasses item_scraped and does not call super the item_scraped method of SqlSpider will never get called.

Other Methods of sqlitem

sqlitem.table

  • returns the sqlalchemy core table that corresponds to that item.

sqlitem.null_required_fields

  • returns a set of the database key names that are are marked not nullable and the corresponding data in the item is null.

sqlitem.null_primary_key_fields

  • returns a set of the primary key names where the corresponding data in the item is null.

sqlitem.primary_keys

sqlitem.required_keys

sqlitem.get_matching_dbrow(bind=None, use_cache=True)

  • Find the data in the database that matches the primary key data in the item

ToDo

  • Continuous integration Tests
Release History

Release History

0.1.2

This version

History Node

TODO: Figure out how to actually get changelog content.

Changelog content for this version goes here.

Donec et mollis dolor. Praesent et diam eget libero egestas mattis sit amet vitae augue. Nam tincidunt congue enim, ut porta lorem lacinia consectetur. Donec ut libero sed arcu vehicula ultricies a non tortor. Lorem ipsum dolor sit amet, consectetur adipiscing elit.

Show More

0.1.1

History Node

TODO: Figure out how to actually get changelog content.

Changelog content for this version goes here.

Donec et mollis dolor. Praesent et diam eget libero egestas mattis sit amet vitae augue. Nam tincidunt congue enim, ut porta lorem lacinia consectetur. Donec ut libero sed arcu vehicula ultricies a non tortor. Lorem ipsum dolor sit amet, consectetur adipiscing elit.

Show More

0.1.0

History Node

TODO: Figure out how to actually get changelog content.

Changelog content for this version goes here.

Donec et mollis dolor. Praesent et diam eget libero egestas mattis sit amet vitae augue. Nam tincidunt congue enim, ut porta lorem lacinia consectetur. Donec ut libero sed arcu vehicula ultricies a non tortor. Lorem ipsum dolor sit amet, consectetur adipiscing elit.

Show More

Download Files

Download Files

TODO: Brief introduction on what you do with files - including link to relevant help section.

File Name & Checksum SHA256 Checksum Help Version File Type Upload Date
scrapy-sqlitem-0.1.2.tar.gz (6.0 kB) Copy SHA256 Checksum SHA256 Source Aug 15, 2015

Supported By

WebFaction WebFaction Technical Writing Elastic Elastic Search Pingdom Pingdom Monitoring Dyn Dyn DNS HPE HPE Development Sentry Sentry Error Logging CloudAMQP CloudAMQP RabbitMQ Heroku Heroku PaaS Kabu Creative Kabu Creative UX & Design Fastly Fastly CDN DigiCert DigiCert EV Certificate Rackspace Rackspace Cloud Servers DreamHost DreamHost Log Hosting