Skip to main content

High-performance parser and generator for PostgreSQL-compatible tab-separated values (TSV)

Project description

Parse and generate tab-separated values (TSV) data

Tab-separated values (TSV) is a simple and popular format for data storage, data transfer, exporting data from and importing data to relational databases. For example, PostgreSQL COPY moves data between PostgreSQL tables and standard file-system files or in-memory stores, and its text format (a text file with one line per table row) is a generic version of TSV. Meanwhile, packages like asyncpg help efficiently insert, update or query data in bulk with binary data transfer between Python and PostgreSQL.

This package offers a high-performance alternative to convert data between a TSV text file and Python objects. The parser can read a TSV record into a Python tuple consisting of built-in Python types, one for each field. The generator can produce a TSV record from a tuple.

Installation

Even though tsv2py contains native code, the package is already pre-built for several target architectures. In most cases, you can install directly from a binary wheel, selected automatically by pip:

python3 -m pip install tsv2py

If a binary wheel is not available for the target platform, pip will attempt to install tsv2py from the source distribution. This will build the package on the fly as part of the installation process, which requires a C compiler such as gcc or clang. The following commands install a C compiler and the Python development headers on AWS Linux:

sudo yum groupinstall -y "Development Tools"
sudo yum install -y python3-devel python3-pip

If you lack a C compiler or the Python development headers, you will get error messages similar to the following:

error: command 'gcc' failed: No such file or directory
lib/tsv_parser.c:2:10: fatal error: Python.h: No such file or directory

Quick start

from tsv.helper import Parser

# specify the column structure
parser = Parser(fields=(bytes, date, datetime, float, int, str, UUID, bool))

# read and parse an entire file
with open(tsv_path, "rb") as f:
    py_records = parser.parse_file(f)

# read and parse a file line by line
with open(tsv_path, "rb") as f:
    for line in f:
        py_record = parser.parse_line(line)

TSV format

Text format is a simple tabular format in which each record (table row) occupies a single line.

  • Output always begins with a header row, which lists data field names.
  • Fields (table columns) are delimited by tab characters.
  • Non-printable characters and special values are escaped with backslash (\), as shown below:
Escape Interpretation
\N NULL value
\0 NUL character (ASCII 0)
\b Backspace (ASCII 8)
\f Form feed (ASCII 12)
\n Newline (ASCII 10)
\r Carriage return (ASCII 13)
\t Tab (ASCII 9)
\v Vertical tab (ASCII 11)
\\ Backslash (single character)

This format allows data to be easily imported into a database engine, e.g. with PostgreSQL COPY.

Output in this format is transmitted as media type text/plain or text/tab-separated-values in UTF-8 encoding.

Parser

The parser understands the following Python types:

  • None. This special value is returned for the TSV escape sequence \N.
  • bool. A literal true or false is converted into a boolean value.
  • bytes. TSV escape sequences are reversed before the data is passed to Python as a bytes object. NUL bytes are permitted.
  • datetime. The input has to comply with RFC 3339 and ISO 8601. The timezone must be UTC (a.k.a. suffix Z).
  • date. The input has to conform to the format YYYY-MM-DD.
  • time. The input has to conform to the format hh:mm:ssZ with no fractional seconds, or hh:mm:ss.ffffffZ with fractional seconds. Fractional seconds allow up to 6 digits of precision.
  • float. Interpreted as double precision floating point numbers.
  • int. Arbitrary-length integers are allowed.
  • str. TSV escape sequences are reversed before the data is passed to Python as a str. NUL bytes are not allowed.
  • uuid.UUID. The input has to comply with RFC 4122, or be a string of 32 hexadecimal digits.
  • decimal.Decimal. Interpreted as arbitrary precision decimal numbers.
  • ipaddress.IPv4Address.
  • ipaddress.IPv6Address.
  • list and dict, which are understood as JSON, and invoke the equivalent of json.loads to parse a serialized JSON string.

The backslash character \ is both a TSV and a JSON escape sequence initiator. When JSON data is written to TSV, several backslash characters may be needed, e.g. \\n in a quoted JSON string translates to a single newline character. First, \\ in \\n is understood as an escape sequence by the TSV parser to produce a single \ character followed by an n character, and in turn \n is understood as a single newline embedded in a JSON string by the JSON parser. Specifically, you need four consecutive backslash characters in TSV to represent a single backslash in a JSON quoted string.

Internally, the implementation uses AVX2 instructions to

  • parse RFC 3339 date-time strings into Python datetime objects,
  • parse RFC 4122 UUID strings or 32-digit hexadecimal strings into Python UUID objects,
  • and find \t delimiters between fields in a line.

For parsing integers up to the range of the long type, the parser calls the C standard library function strtol.

For parsing IPv4 and IPv6 addresses, the parser calls the C function inet_pton in libc or Windows Sockets (WinSock2).

If installed, the parser employs orjson to improve parsing speed of nested JSON structures. If not available, the library falls back to the built-in JSON decoder.

Date-time format

YYYY-MM-DDThh:mm:ssZ
YYYY-MM-DDThh:mm:ss.fZ
YYYY-MM-DDThh:mm:ss.ffZ
YYYY-MM-DDThh:mm:ss.fffZ
YYYY-MM-DDThh:mm:ss.ffffZ
YYYY-MM-DDThh:mm:ss.fffffZ
YYYY-MM-DDThh:mm:ss.ffffffZ

Date format

YYYY-MM-DD

Time format

hh:mm:ssZ
hh:mm:ss.fZ
hh:mm:ss.ffZ
hh:mm:ss.fffZ
hh:mm:ss.ffffZ
hh:mm:ss.fffffZ
hh:mm:ss.ffffffZ

Performance

Depending on the field types, tsv2py is up to 7 times faster to parse TSV records than a functionally equivalent Python implementation based on the Python standard library. Savings in execution time are more substantial for dates, UUIDs and longer strings with special characters (up to 90% savings), and they are more moderate for simple types like small integers (approx. 60% savings).

Project details


Download files

Download the file for your platform. If you're not sure which to choose, learn more about installing packages.

Source Distribution

instructure_tsv2py-0.7.2.dev0.tar.gz (24.0 kB view details)

Uploaded Source

Built Distributions

If you're not sure about the file name format, learn more about wheel file names.

instructure_tsv2py-0.7.2.dev0-cp310-abi3-win_amd64.whl (33.8 kB view details)

Uploaded CPython 3.10+Windows x86-64

instructure_tsv2py-0.7.2.dev0-cp310-abi3-win32.whl (32.0 kB view details)

Uploaded CPython 3.10+Windows x86

instructure_tsv2py-0.7.2.dev0-cp310-abi3-musllinux_1_2_x86_64.whl (77.8 kB view details)

Uploaded CPython 3.10+musllinux: musl 1.2+ x86-64

instructure_tsv2py-0.7.2.dev0-cp310-abi3-manylinux2014_x86_64.manylinux_2_17_x86_64.manylinux_2_28_x86_64.whl (78.1 kB view details)

Uploaded CPython 3.10+manylinux: glibc 2.17+ x86-64manylinux: glibc 2.28+ x86-64

instructure_tsv2py-0.7.2.dev0-cp310-abi3-macosx_11_0_arm64.whl (18.6 kB view details)

Uploaded CPython 3.10+macOS 11.0+ ARM64

File details

Details for the file instructure_tsv2py-0.7.2.dev0.tar.gz.

File metadata

  • Download URL: instructure_tsv2py-0.7.2.dev0.tar.gz
  • Upload date:
  • Size: 24.0 kB
  • Tags: Source
  • Uploaded using Trusted Publishing? Yes
  • Uploaded via: twine/6.1.0 CPython/3.13.13

File hashes

Hashes for instructure_tsv2py-0.7.2.dev0.tar.gz
Algorithm Hash digest
SHA256 12422d45f973612b5d40e37b548355d3f199eb70a9cf6e1ad0f0ce5fd6dd0f32
MD5 187156caca0799ba3901faad8bf02e09
BLAKE2b-256 adbb090bd5401d0926a61abc2ad0fb0683d5345858b79cf8980bb3e14bb4d5de

See more details on using hashes here.

Provenance

The following attestation bundles were made for instructure_tsv2py-0.7.2.dev0.tar.gz:

Publisher: release-pypi.yml on instructure-internal/tsv2py

Attestations: Values shown here reflect the state when the release was signed and may no longer be current.

File details

Details for the file instructure_tsv2py-0.7.2.dev0-cp310-abi3-win_amd64.whl.

File metadata

File hashes

Hashes for instructure_tsv2py-0.7.2.dev0-cp310-abi3-win_amd64.whl
Algorithm Hash digest
SHA256 abb6acaf599084335b4047674fedc101de8df7a554135b88674087abec096fd6
MD5 2722f7d750e02c74ac531d6cca68ce8d
BLAKE2b-256 60210869c7c01d3345d2868e7443d6be194791c2fea737a8afd014a972bef72b

See more details on using hashes here.

Provenance

The following attestation bundles were made for instructure_tsv2py-0.7.2.dev0-cp310-abi3-win_amd64.whl:

Publisher: release-pypi.yml on instructure-internal/tsv2py

Attestations: Values shown here reflect the state when the release was signed and may no longer be current.

File details

Details for the file instructure_tsv2py-0.7.2.dev0-cp310-abi3-win32.whl.

File metadata

File hashes

Hashes for instructure_tsv2py-0.7.2.dev0-cp310-abi3-win32.whl
Algorithm Hash digest
SHA256 1b49b85c14ebd3e92966a25eeb880f0f124f531218f3a0a4390a56e92a946319
MD5 1b1712810b486ec67d979730ebe73cb3
BLAKE2b-256 29946c1f9f5144ea03165ef6444c4b4fe74f01606311b45eee6efe524f4f1f00

See more details on using hashes here.

Provenance

The following attestation bundles were made for instructure_tsv2py-0.7.2.dev0-cp310-abi3-win32.whl:

Publisher: release-pypi.yml on instructure-internal/tsv2py

Attestations: Values shown here reflect the state when the release was signed and may no longer be current.

File details

Details for the file instructure_tsv2py-0.7.2.dev0-cp310-abi3-musllinux_1_2_x86_64.whl.

File metadata

File hashes

Hashes for instructure_tsv2py-0.7.2.dev0-cp310-abi3-musllinux_1_2_x86_64.whl
Algorithm Hash digest
SHA256 6d3e081043dbc3b576f88f69979c1927da9628367f04a66f903aa6d5a560e0a1
MD5 d9fd5757dbccfd6942166d6b055f465c
BLAKE2b-256 44ab2d88e41e411dd8f3887f0715611d6ae6f8ebb130e288edc31a2abb29c1d3

See more details on using hashes here.

Provenance

The following attestation bundles were made for instructure_tsv2py-0.7.2.dev0-cp310-abi3-musllinux_1_2_x86_64.whl:

Publisher: release-pypi.yml on instructure-internal/tsv2py

Attestations: Values shown here reflect the state when the release was signed and may no longer be current.

File details

Details for the file instructure_tsv2py-0.7.2.dev0-cp310-abi3-manylinux2014_x86_64.manylinux_2_17_x86_64.manylinux_2_28_x86_64.whl.

File metadata

File hashes

Hashes for instructure_tsv2py-0.7.2.dev0-cp310-abi3-manylinux2014_x86_64.manylinux_2_17_x86_64.manylinux_2_28_x86_64.whl
Algorithm Hash digest
SHA256 dafa82b4803fdf2e9a9c03cd95b6f6bac8cc7aabe9a95028ef07333f1bae06f7
MD5 e0ba0022be2523134b8ea1882ec6010a
BLAKE2b-256 de6cf47101346a30ad1212da4383ccb5073380e9525bdf58b02546fc5fa917ed

See more details on using hashes here.

Provenance

The following attestation bundles were made for instructure_tsv2py-0.7.2.dev0-cp310-abi3-manylinux2014_x86_64.manylinux_2_17_x86_64.manylinux_2_28_x86_64.whl:

Publisher: release-pypi.yml on instructure-internal/tsv2py

Attestations: Values shown here reflect the state when the release was signed and may no longer be current.

File details

Details for the file instructure_tsv2py-0.7.2.dev0-cp310-abi3-macosx_11_0_arm64.whl.

File metadata

File hashes

Hashes for instructure_tsv2py-0.7.2.dev0-cp310-abi3-macosx_11_0_arm64.whl
Algorithm Hash digest
SHA256 77b20bc5e0e51009d36fefaa6eaf801b8a878e24fad17b7695f576583e925532
MD5 a0cc74a1823297ed2c62a160bc0c083a
BLAKE2b-256 373c893f0b8f98f6caab6d03ad5ea605305cdd143b890648c5c5091c91b0e8da

See more details on using hashes here.

Provenance

The following attestation bundles were made for instructure_tsv2py-0.7.2.dev0-cp310-abi3-macosx_11_0_arm64.whl:

Publisher: release-pypi.yml on instructure-internal/tsv2py

Attestations: Values shown here reflect the state when the release was signed and may no longer be current.

Supported by

AWS Cloud computing and Security Sponsor Datadog Monitoring Depot Continuous Integration Fastly CDN Google Download Analytics Pingdom Monitoring Sentry Error logging StatusPage Status page