This is a pre-production deployment of Warehouse, however changes made here WILL affect the production instance of PyPI.
Latest Version Dependencies status unknown Test status unknown Test coverage unknown
Project Description

Library to extract data from semi-structured text documents.

It’s best suited for data-processing in files that do not have a formal structure and are in plain text (or that are easy to convert). Structured files like XML, CSV and HTML doesn’t fit a good use case for Raspador, and have excellent alternatives to get data extracted, like lxml, html5lib, BeautifulSoup, and PyQuery.

The extractors are defined through classes as models, something similar to the Django ORM. Each field searches for a pattern specified by the regular expression, and captured groups are converted automatically to primitives.

The parser is implemented as a generator, where each item found can be consumed before the end of the analysis, featuring a pipeline.

The analysis is forward-only, which makes it extremely quick, and thus any iterator that returns a string can be analyzed, including infinite streams.

Install

Raspador works on CPython 2.6+, CPython 3.2+ and PyPy. To install it, use:

pip install raspador

or easy install:

easy_install raspador

From source

Download and install from source:

git clone https://github.com/fgmacedo/raspador.git
cd raspador
python setup.py install

Dependencies

There are no external dependencies.

Note

Python 2.6

With Python 2.6, you must install ordereddict.

You can install it with pip:

pip install ordereddict

Tests

To automate tests with all supported Python versions at once, we use tox.

Run all tests with:

$ tox

Tests depend on several third party libraries, but these are installed by tox on each Python’s virtualenv:

nose==1.3.0
coverage==3.6
flake8==2.0

Examples

Extract data from logs

from __future__ import print_function
import json
from raspador import Parser, StringField

out = """
PART:/dev/sda1 UUID:423k34-3423lk423-sdfsd-43 TYPE:ext4
PART:/dev/sda2 UUID:74928389-852893-sdfdf-g8 TYPE:ext4
PART:/dev/sda3 UUID:sdkj9d93-sdf9df-3kr3l-d8 TYPE:swap
"""


class LogParser(Parser):
    begin = r'^PART.*'
    end = r'^PART.*'
    PART = StringField(r'PART:([^\s]+)')
    UUID = StringField(r'UUID:([^\s]+)')
    TYPE = StringField(r'TYPE:([^\s]+)')


a = LogParser()

# res is a generator
res = a.parse(iter(out.splitlines()))

out_as_json = json.dumps(list(res), indent=2)
print (out_as_json)

# Output:
"""
[
  {
    "PART": "/dev/sda1",
    "TYPE": "ext4",
    "UUID": "423k34-3423lk423-sdfsd-43"
  },
  {
    "PART": "/dev/sda2",
    "TYPE": "ext4",
    "UUID": "74928389-852893-sdfdf-g8"
  },
  {
    "PART": "/dev/sda3",
    "TYPE": "swap",
    "UUID": "sdkj9d93-sdf9df-3kr3l-d8"
  }
]
"""
Release History

Release History

0.2.2

This version

History Node

TODO: Figure out how to actually get changelog content.

Changelog content for this version goes here.

Donec et mollis dolor. Praesent et diam eget libero egestas mattis sit amet vitae augue. Nam tincidunt congue enim, ut porta lorem lacinia consectetur. Donec ut libero sed arcu vehicula ultricies a non tortor. Lorem ipsum dolor sit amet, consectetur adipiscing elit.

Show More

0.2.1

History Node

TODO: Figure out how to actually get changelog content.

Changelog content for this version goes here.

Donec et mollis dolor. Praesent et diam eget libero egestas mattis sit amet vitae augue. Nam tincidunt congue enim, ut porta lorem lacinia consectetur. Donec ut libero sed arcu vehicula ultricies a non tortor. Lorem ipsum dolor sit amet, consectetur adipiscing elit.

Show More

0.2.0

History Node

TODO: Figure out how to actually get changelog content.

Changelog content for this version goes here.

Donec et mollis dolor. Praesent et diam eget libero egestas mattis sit amet vitae augue. Nam tincidunt congue enim, ut porta lorem lacinia consectetur. Donec ut libero sed arcu vehicula ultricies a non tortor. Lorem ipsum dolor sit amet, consectetur adipiscing elit.

Show More

0.1.3

History Node

TODO: Figure out how to actually get changelog content.

Changelog content for this version goes here.

Donec et mollis dolor. Praesent et diam eget libero egestas mattis sit amet vitae augue. Nam tincidunt congue enim, ut porta lorem lacinia consectetur. Donec ut libero sed arcu vehicula ultricies a non tortor. Lorem ipsum dolor sit amet, consectetur adipiscing elit.

Show More

0.1.2

History Node

TODO: Figure out how to actually get changelog content.

Changelog content for this version goes here.

Donec et mollis dolor. Praesent et diam eget libero egestas mattis sit amet vitae augue. Nam tincidunt congue enim, ut porta lorem lacinia consectetur. Donec ut libero sed arcu vehicula ultricies a non tortor. Lorem ipsum dolor sit amet, consectetur adipiscing elit.

Show More

0.1.1

History Node

TODO: Figure out how to actually get changelog content.

Changelog content for this version goes here.

Donec et mollis dolor. Praesent et diam eget libero egestas mattis sit amet vitae augue. Nam tincidunt congue enim, ut porta lorem lacinia consectetur. Donec ut libero sed arcu vehicula ultricies a non tortor. Lorem ipsum dolor sit amet, consectetur adipiscing elit.

Show More

0.1.0

History Node

TODO: Figure out how to actually get changelog content.

Changelog content for this version goes here.

Donec et mollis dolor. Praesent et diam eget libero egestas mattis sit amet vitae augue. Nam tincidunt congue enim, ut porta lorem lacinia consectetur. Donec ut libero sed arcu vehicula ultricies a non tortor. Lorem ipsum dolor sit amet, consectetur adipiscing elit.

Show More

Download Files

Download Files

TODO: Brief introduction on what you do with files - including link to relevant help section.

File Name & Checksum SHA256 Checksum Help Version File Type Upload Date
raspador-0.2.2.zip (11.8 kB) Copy SHA256 Checksum SHA256 Source Oct 30, 2013

Supported By

WebFaction WebFaction Technical Writing Elastic Elastic Search Pingdom Pingdom Monitoring Dyn Dyn DNS HPE HPE Development Sentry Sentry Error Logging CloudAMQP CloudAMQP RabbitMQ Heroku Heroku PaaS Kabu Creative Kabu Creative UX & Design Fastly Fastly CDN DigiCert DigiCert EV Certificate Rackspace Rackspace Cloud Servers DreamHost DreamHost Log Hosting