Skip to main content

High-performance parser and generator for PostgreSQL-compatible tab-separated values (TSV)

Project description

Parse and generate tab-separated values (TSV) data

Tab-separated values (TSV) is a simple and popular format for data storage, data transfer, exporting data from and importing data to relational databases. For example, PostgreSQL COPY moves data between PostgreSQL tables and standard file-system files or in-memory stores, and its text format (a text file with one line per table row) is a generic version of TSV. Meanwhile, packages like asyncpg help efficiently insert, update or query data in bulk with binary data transfer between Python and PostgreSQL.

This package offers a high-performance alternative to convert data between a TSV text file and Python objects. The parser can read a TSV record into a Python tuple consisting of built-in Python types, one for each field. The generator can produce a TSV record from a tuple.

Installation

Even though tsv2py contains native code, the package is already pre-built for several target architectures. In most cases, you can install directly from a binary wheel, selected automatically by pip:

python3 -m pip install tsv2py

If a binary wheel is not available for the target platform, pip will attempt to install tsv2py from the source distribution. This will build the package on the fly as part of the installation process, which requires a C compiler such as gcc or clang. The following commands install a C compiler and the Python development headers on AWS Linux:

sudo yum groupinstall -y "Development Tools"
sudo yum install -y python3-devel python3-pip

If you lack a C compiler or the Python development headers, you will get error messages similar to the following:

error: command 'gcc' failed: No such file or directory
lib/tsv_parser.c:2:10: fatal error: Python.h: No such file or directory

Quick start

from tsv.helper import Parser

# specify the column structure
parser = Parser(fields=(bytes, date, datetime, float, int, str, UUID, bool))

# read and parse an entire file
with open(tsv_path, "rb") as f:
    py_records = parser.parse_file(f)

# read and parse a file line by line
with open(tsv_path, "rb") as f:
    for line in f:
        py_record = parser.parse_line(line)

TSV format

Text format is a simple tabular format in which each record (table row) occupies a single line.

  • Output always begins with a header row, which lists data field names.
  • Fields (table columns) are delimited by tab characters.
  • Non-printable characters and special values are escaped with backslash (\), as shown below:
Escape Interpretation
\N NULL value
\0 NUL character (ASCII 0)
\b Backspace (ASCII 8)
\f Form feed (ASCII 12)
\n Newline (ASCII 10)
\r Carriage return (ASCII 13)
\t Tab (ASCII 9)
\v Vertical tab (ASCII 11)
\\ Backslash (single character)

This format allows data to be easily imported into a database engine, e.g. with PostgreSQL COPY.

Output in this format is transmitted as media type text/plain or text/tab-separated-values in UTF-8 encoding.

Parser

The parser understands the following Python types:

  • None. This special value is returned for the TSV escape sequence \N.
  • bool. A literal true or false is converted into a boolean value.
  • bytes. TSV escape sequences are reversed before the data is passed to Python as a bytes object. NUL bytes are permitted.
  • datetime. The input has to comply with RFC 3339 and ISO 8601. The timezone must be UTC (a.k.a. suffix Z).
  • date. The input has to conform to the format YYYY-MM-DD.
  • time. The input has to conform to the format hh:mm:ssZ with no fractional seconds, or hh:mm:ss.ffffffZ with fractional seconds. Fractional seconds allow up to 6 digits of precision.
  • float. Interpreted as double precision floating point numbers.
  • int. Arbitrary-length integers are allowed.
  • str. TSV escape sequences are reversed before the data is passed to Python as a str. NUL bytes are not allowed.
  • uuid.UUID. The input has to comply with RFC 4122, or be a string of 32 hexadecimal digits.
  • decimal.Decimal. Interpreted as arbitrary precision decimal numbers.
  • ipaddress.IPv4Address.
  • ipaddress.IPv6Address.
  • list and dict, which are understood as JSON, and invoke the equivalent of json.loads to parse a serialized JSON string.

The backslash character \ is both a TSV and a JSON escape sequence initiator. When JSON data is written to TSV, several backslash characters may be needed, e.g. \\n in a quoted JSON string translates to a single newline character. First, \\ in \\n is understood as an escape sequence by the TSV parser to produce a single \ character followed by an n character, and in turn \n is understood as a single newline embedded in a JSON string by the JSON parser. Specifically, you need four consecutive backslash characters in TSV to represent a single backslash in a JSON quoted string.

Internally, the implementation uses AVX2 instructions to

  • parse RFC 3339 date-time strings into Python datetime objects,
  • parse RFC 4122 UUID strings or 32-digit hexadecimal strings into Python UUID objects,
  • and find \t delimiters between fields in a line.

For parsing integers up to the range of the long type, the parser calls the C standard library function strtol.

For parsing IPv4 and IPv6 addresses, the parser calls the C function inet_pton in libc or Windows Sockets (WinSock2).

If installed, the parser employs orjson to improve parsing speed of nested JSON structures. If not available, the library falls back to the built-in JSON decoder.

Date-time format

YYYY-MM-DDThh:mm:ssZ
YYYY-MM-DDThh:mm:ss.fZ
YYYY-MM-DDThh:mm:ss.ffZ
YYYY-MM-DDThh:mm:ss.fffZ
YYYY-MM-DDThh:mm:ss.ffffZ
YYYY-MM-DDThh:mm:ss.fffffZ
YYYY-MM-DDThh:mm:ss.ffffffZ

Date format

YYYY-MM-DD

Time format

hh:mm:ssZ
hh:mm:ss.fZ
hh:mm:ss.ffZ
hh:mm:ss.fffZ
hh:mm:ss.ffffZ
hh:mm:ss.fffffZ
hh:mm:ss.ffffffZ

Performance

Depending on the field types, tsv2py is up to 7 times faster to parse TSV records than a functionally equivalent Python implementation based on the Python standard library. Savings in execution time are more substantial for dates, UUIDs and longer strings with special characters (up to 90% savings), and they are more moderate for simple types like small integers (approx. 60% savings).

Project details


Download files

Download the file for your platform. If you're not sure which to choose, learn more about installing packages.

Source Distribution

instructure_tsv2py-1.0.0.tar.gz (24.1 kB view details)

Uploaded Source

Built Distributions

If you're not sure about the file name format, learn more about wheel file names.

instructure_tsv2py-1.0.0-cp310-abi3-win_amd64.whl (34.0 kB view details)

Uploaded CPython 3.10+Windows x86-64

instructure_tsv2py-1.0.0-cp310-abi3-win32.whl (32.1 kB view details)

Uploaded CPython 3.10+Windows x86

instructure_tsv2py-1.0.0-cp310-abi3-musllinux_1_2_x86_64.whl (78.0 kB view details)

Uploaded CPython 3.10+musllinux: musl 1.2+ x86-64

instructure_tsv2py-1.0.0-cp310-abi3-manylinux2014_x86_64.manylinux_2_17_x86_64.manylinux_2_28_x86_64.whl (78.2 kB view details)

Uploaded CPython 3.10+manylinux: glibc 2.17+ x86-64manylinux: glibc 2.28+ x86-64

instructure_tsv2py-1.0.0-cp310-abi3-macosx_11_0_arm64.whl (18.8 kB view details)

Uploaded CPython 3.10+macOS 11.0+ ARM64

File details

Details for the file instructure_tsv2py-1.0.0.tar.gz.

File metadata

  • Download URL: instructure_tsv2py-1.0.0.tar.gz
  • Upload date:
  • Size: 24.1 kB
  • Tags: Source
  • Uploaded using Trusted Publishing? Yes
  • Uploaded via: twine/6.1.0 CPython/3.13.13

File hashes

Hashes for instructure_tsv2py-1.0.0.tar.gz
Algorithm Hash digest
SHA256 fdb120e11c82ed90681af8cd9c27334b9972d204368a10de2e3bc03773a12562
MD5 354401a222c9906e93843aae8f12c6eb
BLAKE2b-256 4c39a2aa1851fb8f0dfd9ea59b98bf36655ac0f9287c66e1e96951627ec2c2c1

See more details on using hashes here.

Provenance

The following attestation bundles were made for instructure_tsv2py-1.0.0.tar.gz:

Publisher: release-pypi.yml on instructure-internal/tsv2py

Attestations: Values shown here reflect the state when the release was signed and may no longer be current.

File details

Details for the file instructure_tsv2py-1.0.0-cp310-abi3-win_amd64.whl.

File metadata

File hashes

Hashes for instructure_tsv2py-1.0.0-cp310-abi3-win_amd64.whl
Algorithm Hash digest
SHA256 c363e6620cc4d9061ab2033fe61e38691434319787f981de3cc435b30c1ab783
MD5 2b51b0fca271b502677f5bf553747476
BLAKE2b-256 f23b61c9482f1de6e8b0c294fbc41d7a8be5f92915f536981f6ed2c29ad0f977

See more details on using hashes here.

Provenance

The following attestation bundles were made for instructure_tsv2py-1.0.0-cp310-abi3-win_amd64.whl:

Publisher: release-pypi.yml on instructure-internal/tsv2py

Attestations: Values shown here reflect the state when the release was signed and may no longer be current.

File details

Details for the file instructure_tsv2py-1.0.0-cp310-abi3-win32.whl.

File metadata

File hashes

Hashes for instructure_tsv2py-1.0.0-cp310-abi3-win32.whl
Algorithm Hash digest
SHA256 54bb7447b38e842afca03a2fa609cfd3ebc721d9e4631715052154b3132c8139
MD5 173b81e8fe681dcd7f42eeb5ba752050
BLAKE2b-256 93b86d914db5b8d6b7c7d3bc560d43eab08578035c58e9a1530524cd1c88ad77

See more details on using hashes here.

Provenance

The following attestation bundles were made for instructure_tsv2py-1.0.0-cp310-abi3-win32.whl:

Publisher: release-pypi.yml on instructure-internal/tsv2py

Attestations: Values shown here reflect the state when the release was signed and may no longer be current.

File details

Details for the file instructure_tsv2py-1.0.0-cp310-abi3-musllinux_1_2_x86_64.whl.

File metadata

File hashes

Hashes for instructure_tsv2py-1.0.0-cp310-abi3-musllinux_1_2_x86_64.whl
Algorithm Hash digest
SHA256 78018ab5227f7b6b440fbb2f249670d086219c345405f6441eab0ad54c4317e2
MD5 c8c5d1623ca76784677ff57603397281
BLAKE2b-256 4a6416b7e6f7d6724a5333f32c13d79969047e96a126d26f9c3d5139eff3c694

See more details on using hashes here.

Provenance

The following attestation bundles were made for instructure_tsv2py-1.0.0-cp310-abi3-musllinux_1_2_x86_64.whl:

Publisher: release-pypi.yml on instructure-internal/tsv2py

Attestations: Values shown here reflect the state when the release was signed and may no longer be current.

File details

Details for the file instructure_tsv2py-1.0.0-cp310-abi3-manylinux2014_x86_64.manylinux_2_17_x86_64.manylinux_2_28_x86_64.whl.

File metadata

File hashes

Hashes for instructure_tsv2py-1.0.0-cp310-abi3-manylinux2014_x86_64.manylinux_2_17_x86_64.manylinux_2_28_x86_64.whl
Algorithm Hash digest
SHA256 5ddf596f19806eb788f52a3209aa279fe6e149f6ef05ddbe94498279315e5a8b
MD5 118e7fc17d0b92b699dc64a66f02ff7b
BLAKE2b-256 0b518824fdc1e39b988a8a631091e0f69743a555b4d83690170fdb8944f87817

See more details on using hashes here.

Provenance

The following attestation bundles were made for instructure_tsv2py-1.0.0-cp310-abi3-manylinux2014_x86_64.manylinux_2_17_x86_64.manylinux_2_28_x86_64.whl:

Publisher: release-pypi.yml on instructure-internal/tsv2py

Attestations: Values shown here reflect the state when the release was signed and may no longer be current.

File details

Details for the file instructure_tsv2py-1.0.0-cp310-abi3-macosx_11_0_arm64.whl.

File metadata

File hashes

Hashes for instructure_tsv2py-1.0.0-cp310-abi3-macosx_11_0_arm64.whl
Algorithm Hash digest
SHA256 ba23e19446c5f25b0c5986286b3d66d7622be862fd7a4fd3d5b53fb55bcd0922
MD5 782c7ccd455b05678b5f81c2f60e524d
BLAKE2b-256 5525416c537f1ffb7ddf03aea2bf5ade75f446ecc1a09b952a7e2178e04c6333

See more details on using hashes here.

Provenance

The following attestation bundles were made for instructure_tsv2py-1.0.0-cp310-abi3-macosx_11_0_arm64.whl:

Publisher: release-pypi.yml on instructure-internal/tsv2py

Attestations: Values shown here reflect the state when the release was signed and may no longer be current.

Supported by

AWS Cloud computing and Security Sponsor Datadog Monitoring Depot Continuous Integration Fastly CDN Google Download Analytics Pingdom Monitoring Sentry Error logging StatusPage Status page