Search pymarc.Record using a string expression
Project description
pymarcspec
Summary
An implementation of MarcSpec on top of pymarc for searching MARC records.
Usage
The idea is to easily use strings to search over MARC without writing complicated code to handle data.
import sys
from pymarcspec import MarcSearchParser
from pymarc import MARCReader
parser = MarcSearchParser()
spec = parser.parse('650$a$0')
with open(sys.argv[1], 'rb') as f:
for record in MARCReader(f):
subjects = spec.search(record)
print(subjects)
The TextStyle
class governs how results are combined into strings (or not).
You can subclass TextStyle
or BaseTextStyle
to do anything you want with combining
the results, or you can handle it yourself.
There is also a MarcSearch
object that memoizes each search expression, so that
you can conveniently run a number of different searches without creating several
parsed specs. For example:
import csv
import sys
from pymarcspec import MarcSearch, TextStyle
from pymarc import MARCReader
writer = csv.writer(sys.stdout, dialect='unix', quoting=csv.QUOTE_MINIMAL)
writer.writerow(['id', 'title', 'subjects'])
style = TextStyle(field_delimiter=':')
marcsearch = MarcSearch(style)
with open(sys.argv[1], 'rb') as f:
for record in MARCReader(f):
control_id = marcsearch.search('100', record)
title = marcsearch.search('245[0]$a-c', record)
subjects = marcsearch.search('650$a', record)
writer.writerow([control_id, title, subjects])
Development
Building the Parser
To build the parser, run:
python -m tatsu -o marcparser/parser.py marcparser/marcparser.ebnf
Note that this builds a class MarcSpecParser
, which implements the full specification from
MarcSpec, the MarcSearchParser
is a subclass
that builds an instance of MarcSpec
; building this structure has some
restrictions for what I needed when I wrote it.
Testing for freshness
The test in test/test_ebnf.py
compiles the parser from the EBNF into a temporary path, which makes sure
that coffee driven programmers like me remember to compile the parser and check in the changes.
Performance
It is not obvious this is needed. It may be fine for instance to use XPath expressions.
Suppose we are going to do a lot of these conversions - if XPath is fast enough, the work of converting
from a pymarc.Record
to MARCXML will be amoritized by many searches. Jupyter Notebooks have a %timeit
magic that allows us to check this:
Let us check the performance of the simplest such XPath expression:
In [34]: %timeit ''.join(doc.xpath('./controlfield[@tag="001"]/text()'))
19.4 µs ± 1.07 µs per loop (mean ± std. dev. of 7 runs, 100000 loops each)
And compare it to parsing a spec and searching:
In [37]: from pymarcspec import MarcSearchParser
In [38]: parser = MarcSearchParser()
In [39]: spec = parser.parse('001')
In [40]: spec.search(record)
Out[40]: '1589530'
In [41]: %timeit spec.search(record)
7.89 µs ± 253 ns per loop (mean ± std. dev. of 7 runs, 100000 loops each)
So, from a performance perspective this is clearly a win, and the expression is much closer to library IT.
Project details
Release history Release notifications | RSS feed
Download files
Download the file for your platform. If you're not sure which to choose, learn more about installing packages.
Source Distribution
Built Distribution
File details
Details for the file pymarcspec-0.0.3.tar.gz
.
File metadata
- Download URL: pymarcspec-0.0.3.tar.gz
- Upload date:
- Size: 17.3 kB
- Tags: Source
- Uploaded using Trusted Publishing? No
- Uploaded via: twine/3.3.0 pkginfo/1.7.0 requests/2.25.1 setuptools/53.1.0 requests-toolbelt/0.9.1 tqdm/4.58.0 CPython/3.7.5
File hashes
Algorithm | Hash digest | |
---|---|---|
SHA256 | 7c4bd5a1496585ec8e5d189df862ef714ed4ec51bd3e42d3db76edebd611056b |
|
MD5 | 02f64d398ea72cdd5bb64379ae4df424 |
|
BLAKE2b-256 | ce6ce9070c6b25891df09ace180d156abc900e17db00f23b0036fe825c3785e2 |
File details
Details for the file pymarcspec-0.0.3-py3-none-any.whl
.
File metadata
- Download URL: pymarcspec-0.0.3-py3-none-any.whl
- Upload date:
- Size: 11.0 kB
- Tags: Python 3
- Uploaded using Trusted Publishing? No
- Uploaded via: twine/3.3.0 pkginfo/1.7.0 requests/2.25.1 setuptools/53.1.0 requests-toolbelt/0.9.1 tqdm/4.58.0 CPython/3.7.5
File hashes
Algorithm | Hash digest | |
---|---|---|
SHA256 | 6ad388830bc596396850f75633dae9dd5f696d4af47d72c9f7872227746954f3 |
|
MD5 | 1b9878d6548da0bf5ba6c7b8b163426b |
|
BLAKE2b-256 | 42c10b1d28df009d4033fdbb92bdec4932fe03f4187bbe377162e8ef5b59aab1 |