Skip to main content

A text shaping package.

Reason this release was yanked:

Bad structure

Project description

textform

A data transformation pipeline module based on the seminal Potter's Wheel data wrangling formalism. The name is a portmanteau of "text" and "transform".

Overview

textform (abbreviated txf) is a text-oriented data transformation module. With it, you can create sequential record processing pipelines that convert data from (say) lines of text into records and then route the final record stream for another use (e.g, write the records to a csv file.)

Pipelines are cosntructed from a sequence of transforms that take in a record and modify it in some way. For example, the Split transform will replace an input field with several new fields that are derived from the input by splitting on a pattern.

While inspired by the Potter's Wheel transform list, textform is designed for practical everyday use. This means it includes transforms for limiting the number of rows, writing intermediate results to files and capturing via regular expressions.

Audience

How do I know if textform is right for me? The simplest use case is where you want to use Python's DictReader but the file isn't a csv. With textform you can write a pipeline that will end up producing the records you would get from DictReader.

More complex use cases can be built on top of this kind of record stream. Reshaping, computing values, splitting, dividing, merging, filling in blanks and other kinds of data cleaning and preparation tasks can all be implemented in a reusable fashion with textform. A pipeline effectively describes the format of a text file in an executable fashion that can be reused.

Example

I created textform because I had worked on a similar research system in the past and had two text files produced by the DuckDB performance test suite that I needed to convert into csvs:

------------------
|| Q01_PARALLEL ||
------------------
Cold Run...Done!
Run 1/5...0.12345
Run 1/5...0.12345
Run 1/5...0.12345
Run 1/5...0.12345
Run 1/5...0.12345
------------------
|| Q02_PARALLEL ||
------------------
...

This file is esssentially a sequence of records grouped by higher attributes. Instead of writing a one-off Python script, I decided to write some simple transforms and build a pipeline, which looked like this:

p = Text(sys.stdin, 'Line')                         # Read a line
p = Add(p, 'Branch', sys.argv[1])                   # Tag the file with the branch name
p = Match(p, 'Line', r'------', invert=True).       # Remove horizontal lines
p = Divide(p, 'Line', 'Query', 'Run', r'Q')         # Separate the query names from the run data
p = Fill(p, 'Query', '00')                          # Fill down the blank query names
p = Capture(p, 'Query', ('Query',), r'\|\|\s+Q(\w+)\s+\|\|')  # Capture the query number
# Split the execution mode from the query name
p = Split(p, 'Query', ('Query', 'Mode',), r'_', ('00', 'SERIAL',))
p = Cast(p, 'Query', int)                           # Cast the query number to an integer
p = Match(p, 'Run', r'\d')                          # Filter to the runs with data
# Capture the run components
p = Capture(p, 'Run', ('Run #', 'Run Count', 'Time',), r'(\d+)/(\d+)...(\d+\.\d+)')
p = Cast(p, 'Run #', int)                           # Cast the run components
p = Cast(p, 'Run Count', int)
p = Cast(p, 'Time', float)
p = Write(p, sys.stdout)                            # Write the records to stdout as a csv
p.pump()

We can now invoke the pipeline script as:

$ python3 pipeline.py master < performance.txt > performance.csv

Contributing

You know the drill: Fork, branch, test submit a PR. This is a completely open source, free as in beer project.

Project details


Download files

Download the file for your platform. If you're not sure which to choose, learn more about installing packages.

Source Distribution

textform-0.8.0.tar.gz (3.7 kB view details)

Uploaded Source

Built Distribution

textform-0.8.0-py3-none-any.whl (3.7 kB view details)

Uploaded Python 3

File details

Details for the file textform-0.8.0.tar.gz.

File metadata

  • Download URL: textform-0.8.0.tar.gz
  • Upload date:
  • Size: 3.7 kB
  • Tags: Source
  • Uploaded using Trusted Publishing? No
  • Uploaded via: twine/3.4.1 importlib_metadata/3.10.1 pkginfo/1.7.0 requests/2.25.1 requests-toolbelt/0.9.1 tqdm/4.60.0 CPython/3.7.9

File hashes

Hashes for textform-0.8.0.tar.gz
Algorithm Hash digest
SHA256 f7f567400bdf56a34586ee8b87daa4a4b3dbf7aeaa391b3e3bc4da9e569028e3
MD5 81302c398a6b799993a2333955995c4f
BLAKE2b-256 1520c8601883ff53d099e826b4904b10657e8064ba87dbd3e1cd7708c66d9c69

See more details on using hashes here.

File details

Details for the file textform-0.8.0-py3-none-any.whl.

File metadata

  • Download URL: textform-0.8.0-py3-none-any.whl
  • Upload date:
  • Size: 3.7 kB
  • Tags: Python 3
  • Uploaded using Trusted Publishing? No
  • Uploaded via: twine/3.4.1 importlib_metadata/3.10.1 pkginfo/1.7.0 requests/2.25.1 requests-toolbelt/0.9.1 tqdm/4.60.0 CPython/3.7.9

File hashes

Hashes for textform-0.8.0-py3-none-any.whl
Algorithm Hash digest
SHA256 6b10f47c124662a39eca22b7cec226a13a0d3c82e53255d7c2462b84e2106229
MD5 5b43912b891767fc1fe6e28b73785c78
BLAKE2b-256 a5dba120ceca552bad58e58df6413d503f5863fa3375eaea8a7ad207cf8e3b7b

See more details on using hashes here.

Supported by

AWS AWS Cloud computing and Security Sponsor Datadog Datadog Monitoring Fastly Fastly CDN Google Google Download Analytics Microsoft Microsoft PSF Sponsor Pingdom Pingdom Monitoring Sentry Sentry Error logging StatusPage StatusPage Status page