pent Extracts Numerical Text
Mini-language driven parser for structured numerical (or other) data in free text
Current Development Version:
Most Recent Stable Release:
Do you have structured numerical data stored as text?
Does the idea of writing regex to parse it fill you with loathing?
pent can help!
Say you have data in a text file that looks like this:
$vibrational_frequencies 18 0 0.000000 1 0.000000 2 0.000000 3 0.000000 4 0.000000 5 0.000000 6 194.490162 7 198.587114 8 389.931897 9 402.713910 10 538.244274 11 542.017838 12 548.246738 13 800.613516 14 1203.096114 15 1342.200360 16 1349.543713 17 1885.157022
What’s the most efficient way to get that list of floats extracted into a numpy array? There’s clearly structure here, but how to exploit it?
It would work to import the text into a spreadsheet, split columns appropriately, re-export just the one column to CSV, and import to Python from there, but that’s just exhausting drudgery if there are dozens of files involved.
Automating the parsing via a line-by-line string search would work fine (this is how cclib implements its data imports), but a new line-by-line method is needed for every new kind of dataset, and any time the formatting of a given dataset changes.
It’s not too hard to write regex that will parse it, but because of the mechanics of regex group captures you have to write two patterns: one to capture the entire block, including the header (to ensure other, similarly-formatted data isn’t also captured); and then one to iterate line-by-line over just the data block to extract the individual values. And, of course, one has to actually write (and proofread, and maintain) the regex.
pent provides a better way.
The data above comes from this file, C2F4_01.hess. With pent, the data can be pulled into numpy in just a couple of lines, without writing any regex at all:
>>> data = pathlib.Path("pent", "test", "C2F4_01.hess").read_text() >>> prs = pent.Parser( ... head=("@.$vibrational_frequencies", "#.+i"), ... body=("#.+i #!..f") ... ) >>> arr = np.array(prs.capture_body(data), dtype=float) >>> print(arr) [[[ 0. ] [ 0. ] [ 0. ] [ 0. ] [ 0. ] [ 0. ] [ 194.490162] [ 198.587114] [ 389.931897] [ 402.71391 ] [ 538.244274] [ 542.017838] [ 548.246738] [ 800.613516] [1203.096114] [1342.20036 ] [1349.543713] [1885.157022]]]
The result comes out as a length-one list of 2-D matrices, since the search pattern occurs only once in the data file. The single 2-D matrix is laid out as a column vector, because the data runs down the column in the file.
pent can handle larger, more deeply nested data as well. Take this 18x18 matrix within C2F4_01.hess, for example. Here, it’s necessary to pass a Parser as the body of another Parser:
>>> prs_hess = pent.Parser( ... head=("@.$hessian", "#.+i"), ... body=pent.Parser( ... head="#++i", ... body="#.+i #!+.f" ... ) ... ) >>> result = prs_hess.capture_body(data) >>> arr = np.column_stack([np.array(_, dtype=float) for _ in result]) >>> print(arr[:3, :7]) [[ 0.468819 -0.006771 0.020586 -0.38269 0.017874 -0.05449 -0.044552] [-0.006719 0.022602 -0.016183 0.010997 -0.033397 0.014422 -0.01501 ] [ 0.020559 -0.016184 0.066859 -0.033601 0.014417 -0.072836 0.045825]]
The need for the generator expression, the  index into result, and the composition via np.column_stack arises due to the manner in which pent returns data from a nested match like this. See the documentation, in particular this example, for more information.
The grammar of the pent mini-language is designed to be flexible enough that it should handle essentially all well-formed structured data, and even some data that’s not especially well formed. Some datasets will require post-processing of the data structures generated by pent before they can be pulled into numpy (see, e.g., this test, parsing this data block).
Beta releases available on PyPI: pip install pent
Full documentation is hosted at Read The Docs.
Copyright (c) Brian Skinn 2018-2019
License: The MIT License. See LICENSE.txt for full license terms.
Download the file for your platform. If you're not sure which to choose, learn more about installing packages.
|Filename, size||File type||Python version||Upload date||Hashes|
|Filename, size pent-0.2-py3-none-any.whl (16.8 kB)||File type Wheel||Python version py3||Upload date||Hashes View hashes|