Library to extract data from semi-structured text documents
Project description
Library to extract data from semi-structured text documents.
It’s best suited for data-processing in files that do not have a formal structure and are in plain text (or that are easy to convert). Structured files like XML, CSV and HTML doesn’t fit a good use case for Raspador, and have excellent alternatives to get data extracted, like lxml, html5lib, BeautifulSoup, and PyQuery.
The extractors are defined through classes as models, something similar to the Django ORM. Each field searches for a pattern specified by the regular expression, and captured groups are converted automatically to primitives.
The parser is implemented as a generator, where each item found can be consumed before the end of the analysis, featuring a pipeline.
The analysis is forward-only, which makes it extremely quick, and thus any iterator that returns a string can be analyzed, including infinite streams.
Install
Raspador works on CPython 2.6+, CPython 3.2+ and PyPy. To install it, use:
pip install raspador
or easy install:
easy_install raspador
From source
Download and install from source:
git clone https://github.com/fgmacedo/raspador.git cd raspador python setup.py install
Dependencies
There are no external dependencies.
Tests
To automate tests with all supported Python versions at once, we use tox.
Run all tests with:
$ tox
Tests depend on several third party libraries, but these are installed by tox on each Python’s virtualenv:
nose==1.3.0
coverage==3.6
flake8==2.0
Examples
Extract data from logs
from __future__ import print_function
import json
from raspador import Parser, StringField
out = """
PART:/dev/sda1 UUID:423k34-3423lk423-sdfsd-43 TYPE:ext4
PART:/dev/sda2 UUID:74928389-852893-sdfdf-g8 TYPE:ext4
PART:/dev/sda3 UUID:sdkj9d93-sdf9df-3kr3l-d8 TYPE:swap
"""
class LogParser(Parser):
begin = r'^PART.*'
end = r'^PART.*'
PART = StringField(r'PART:([^\s]+)')
UUID = StringField(r'UUID:([^\s]+)')
TYPE = StringField(r'TYPE:([^\s]+)')
a = LogParser()
# res is a generator
res = a.parse(iter(out.splitlines()))
out_as_json = json.dumps(list(res), indent=2)
print (out_as_json)
# Output:
"""
[
{
"PART": "/dev/sda1",
"TYPE": "ext4",
"UUID": "423k34-3423lk423-sdfsd-43"
},
{
"PART": "/dev/sda2",
"TYPE": "ext4",
"UUID": "74928389-852893-sdfdf-g8"
},
{
"PART": "/dev/sda3",
"TYPE": "swap",
"UUID": "sdkj9d93-sdf9df-3kr3l-d8"
}
]
"""
Project details
Download files
Download the file for your platform. If you're not sure which to choose, learn more about installing packages.
Source Distribution
File details
Details for the file raspador-0.2.2.zip
.
File metadata
- Download URL: raspador-0.2.2.zip
- Upload date:
- Size: 11.8 kB
- Tags: Source
- Uploaded using Trusted Publishing? No
File hashes
Algorithm | Hash digest | |
---|---|---|
SHA256 | 664cf4e6b1bd0ab60c4f5032d6aa6e0d221301bbadad067fc28f718af60efd94 |
|
MD5 | 1e235b68c7b1704ea7d0ffeeb8b24f46 |
|
BLAKE2b-256 | 8ad53ff6ce348211782b047fb1649b943090ad68a2ae45a96ef9a26013024983 |