Skip to main content

Parse semi-structured text into Python dictionaries without well-defined schemas

Project description

🗂️ Tabularize



Purpose

Tabularize aids in the parsing of semi-structured data in a table-like format into Python dictionaries given minimal knowledge of the expected data format.

While packages such as csv, pandas, and TextFSM exist, they require the input data to be in a more structured form. For example, requiring clearly distinguishable delimiters, fixed column widths, or knowledge about the data to deduce the start and end of a column based on data types. Tabularize is designed for instances where there can be guess-work due to input data not following these constraints.

This package's design takes influence from the Name/Finger protocol due to its non-standardized, human-readable status reports that tend to give machines a harder time.

Tabularize is probably not the solution for you - that is, modern protocols are often machine-readable, or they offer a means to make it easily machine-readable. It shines when you need to parse semi-structured, tabular data where the schema is unknown (a situation you should avoid) or when you need tabular data parsed quickly.

Usage

Tabularize is offered as both an API for developers and a command-line tool. To install it:

python3 -m pip install tabularize

Command-Line Usage

The tabularize command is available upon installation. The command takes as a parameter a list of files, where it will locate the first non-blank line of each one to determine headers then print out a JSON object for each later, parsed entry. For example:

tabularize path-to-file path-to-another-file

Sometimes, automatic header detection may not function as expected when there is a degree of ambiguity since Tabularize only analyzes the singular header line, not the content, to derive column names. For example, given the following data:

    Line      User       Host(s)              Idle Location
   1 vty 0               idle                 00:00:05 192.168.1.1
*  2 vty 1               idle                 00:00:00 192.168.1.2

By default, Tabularize will misinterpret the headers and assume that a Idle Location header exists rather than two separate Idle and Location headers. Since Tabularize works sequentially, you can specify an Idle header, and it will resolve the error without having to specify a Location header:

tabularize -H Idle path-to-finger-output

The tabularize command also supports piping. When piping is desired, use the file name -:

cat file-to-parse | tabularize -

Tabularize operates at the byte level; however, it prints out data as JSON, which does not support bytes. As a result, it decodes the data before printing it to the terminal. You can customize the encoding and error resolution strategy using the --encoding and --errors options:

tabularize --encoding utf-8 --errors backslashreplace path-to-file

API Usage

Programs integrating Tabularize will need to independently determine the appropriate line to extract headers from alongside body lines. The headers are then reused for body line parsing. For example:

import tabularize


data = b"""Name    Ice Cream Preference
James   Mint Chocolate Chip
""".splitlines()

headers = tabularize.parse_headers(
        data[0]
    )

for line in data[1:]:
    print(tabularize.parse_body(headers, line))

Samples

Tabularize is particularly useful for parsing the Name/Finger Protocol given that the fingerd server implementation is unknown due to its lack of standardization. However, if the server implementation is known, consider using a regular expression-based solution instead such as TextFSM as the data types can help indicate the start and end of output.

🐧 Debian fingerd
Login     Name       Tty      Idle  Login Time   Office     Office Phone
alfred              *pts/0      1d  Oct 06 19:56 (192.168.1.1)
bert                 pts/1      2d  Oct 06 12:34 (:pts/0:S.0)
chase                pts/2      3d  Oct 06 05:43 (:pts/0:S.1)
[
  {"Login": "alfred", "Tty": "*pts/0", "Idle": "1d", "Login Time": "Oct 06 19:56", "Office": "(192.168.1.1)"},
  {"Login": "bert", "Tty": "pts/1", "Idle": "2d", "Login Time": "Oct 06 12:34", "Office": "(:pts/0:S.0)"},
  {"Login": "chase", "Tty": "pts/2", "Idle": "3d", "Login Time": "Oct 06 05:43", "Office": "(:pts/0:S.1)"}
]
📡 Cisco fingerd
    Line       User       Host(s)              Idle       Location
   1 vty 0                idle                 00:00:00 
[
  {"Line": "1 vty 0", "Host(s)": "idle", "Idle": "00:00:00"}
]

Project details


Download files

Download the file for your platform. If you're not sure which to choose, learn more about installing packages.

Source Distribution

tabularize-0.0.3.tar.gz (8.7 kB view details)

Uploaded Source

Built Distribution

If you're not sure about the file name format, learn more about wheel file names.

tabularize-0.0.3-py3-none-any.whl (4.4 kB view details)

Uploaded Python 3

File details

Details for the file tabularize-0.0.3.tar.gz.

File metadata

  • Download URL: tabularize-0.0.3.tar.gz
  • Upload date:
  • Size: 8.7 kB
  • Tags: Source
  • Uploaded using Trusted Publishing? Yes
  • Uploaded via: twine/6.1.0 CPython/3.13.7

File hashes

Hashes for tabularize-0.0.3.tar.gz
Algorithm Hash digest
SHA256 8ff8b36ff45e7eec36b6d64e7980876f8296e187df466c0822d5e351d0769001
MD5 71c900cb722babeaf3ea8ebda6e26b2b
BLAKE2b-256 ee130c43fc560fda9976bae45aac06712746cf8b8ed584c0270c6da657ebceea

See more details on using hashes here.

Provenance

The following attestation bundles were made for tabularize-0.0.3.tar.gz:

Publisher: python-publish.yml on Jayson-Fong/tabularize

Attestations: Values shown here reflect the state when the release was signed and may no longer be current.

File details

Details for the file tabularize-0.0.3-py3-none-any.whl.

File metadata

  • Download URL: tabularize-0.0.3-py3-none-any.whl
  • Upload date:
  • Size: 4.4 kB
  • Tags: Python 3
  • Uploaded using Trusted Publishing? Yes
  • Uploaded via: twine/6.1.0 CPython/3.13.7

File hashes

Hashes for tabularize-0.0.3-py3-none-any.whl
Algorithm Hash digest
SHA256 65bb4c25f1fc8301acbd1c0527e3ca0975fc7abab149980ed3e6231a1765f455
MD5 119b5d7b0694cc0e43ddc902f273d44b
BLAKE2b-256 079862627ab486d217ad5091de8fb3d77919b20a7c363cba3fe360b4fea5fc43

See more details on using hashes here.

Provenance

The following attestation bundles were made for tabularize-0.0.3-py3-none-any.whl:

Publisher: python-publish.yml on Jayson-Fong/tabularize

Attestations: Values shown here reflect the state when the release was signed and may no longer be current.

Supported by

AWS Cloud computing and Security Sponsor Datadog Monitoring Depot Continuous Integration Fastly CDN Google Download Analytics Pingdom Monitoring Sentry Error logging StatusPage Status page