Skip to main content

High-performance parser and generator for PostgreSQL-compatible tab-separated values (TSV)

Project description

Parse and generate tab-separated values (TSV) data

Tab-separated values (TSV) is a simple and popular format for data storage, data transfer, exporting data from and importing data to relational databases. For example, PostgreSQL COPY moves data between PostgreSQL tables and standard file-system files or in-memory stores, and its text format (a text file with one line per table row) is a generic version of TSV. Meanwhile, packages like asyncpg help efficiently insert, update or query data in bulk with binary data transfer between Python and PostgreSQL.

This package offers a high-performance alternative to convert data between a TSV text file and Python objects. The parser can read a TSV record into a Python tuple consisting of built-in Python types, one for each field. The generator can produce a TSV record from a tuple.

Quick start

from tsv.helper import Parser

# specify the column structure
parser = Parser(fields=(bytes, datetime, float, int, str, UUID, bool))

# read and parse an entire file 
with open(tsv_path, "rb") as f:
    py_records = parser.parse_file(f)

# read and parse a file line by line
with open(tsv_path, "rb") as f:
    for line in f:
        py_record = parser.parse_line(line)

TSV format

Text format is a simple tabular format in which each record (table row) occupies a single line.

  • Output always begins with a header row, which lists data field names.
  • Fields (table columns) are delimited by tab characters.
  • Non-printable characters and special values are escaped with backslash (\), as shown below:
Escape Interpretation
\N NULL value
\0 NUL character (ASCII 0)
\b Backspace (ASCII 8)
\f Form feed (ASCII 12)
\n Newline (ASCII 10)
\r Carriage return (ASCII 13)
\t Tab (ASCII 9)
\v Vertical tab (ASCII 11)
\\ Backslash (single character)

This format allows data to be easily imported into a database engine, e.g. with PostgreSQL COPY.

Output in this format is transmitted as media type text/plain or text/tab-separated-values in UTF-8 encoding.

Parser

The parser understands the following Python types:

  • None. This special value is returned for the TSV escape sequence \N.
  • bool. A literal true or false is converted into a boolean value.
  • bytes. TSV escape sequences are reversed before the data is passed to Python as a bytes object. NUL bytes are permitted.
  • datetime. The input has to comply with RFC 3339 and ISO 8601. The timezone must be UTC (a.k.a. suffix Z).
  • float.
  • int. Arbitrary-length integers are allowed.
  • str. TSV escape sequences are reversed before the data is passed to Python as a str. NUL bytes are not allowed.
  • uuid.UUID. The input has to comply with RFC 4122, or be a string of 32 hexadecimal digits.

Internally, the implementation uses AVX2 instructions to

  • parse RFC 3339 date-time strings into Python datetime objects,
  • parse RFC 4122 UUID strings or 32-digit hexadecimal strings into Python UUID objects,
  • and find \t delimiters between fields in a line.

Project details


Download files

Download the file for your platform. If you're not sure which to choose, learn more about installing packages.

Source Distribution

tsv2py-0.3.tar.gz (16.2 kB view hashes)

Uploaded Source

Built Distributions

tsv2py-0.3-pp39-pypy39_pp73-manylinux_2_17_x86_64.manylinux2014_x86_64.whl (13.1 kB view hashes)

Uploaded PyPy manylinux: glibc 2.17+ x86-64

tsv2py-0.3-pp39-pypy39_pp73-manylinux_2_5_i686.manylinux1_i686.manylinux_2_17_i686.manylinux2014_i686.whl (13.3 kB view hashes)

Uploaded PyPy manylinux: glibc 2.17+ i686 manylinux: glibc 2.5+ i686

tsv2py-0.3-pp38-pypy38_pp73-manylinux_2_17_x86_64.manylinux2014_x86_64.whl (13.1 kB view hashes)

Uploaded PyPy manylinux: glibc 2.17+ x86-64

tsv2py-0.3-pp38-pypy38_pp73-manylinux_2_5_i686.manylinux1_i686.manylinux_2_17_i686.manylinux2014_i686.whl (13.3 kB view hashes)

Uploaded PyPy manylinux: glibc 2.17+ i686 manylinux: glibc 2.5+ i686

tsv2py-0.3-cp38-abi3-win_amd64.whl (14.3 kB view hashes)

Uploaded CPython 3.8+ Windows x86-64

tsv2py-0.3-cp38-abi3-musllinux_1_1_x86_64.whl (32.4 kB view hashes)

Uploaded CPython 3.8+ musllinux: musl 1.1+ x86-64

tsv2py-0.3-cp38-abi3-musllinux_1_1_i686.whl (31.9 kB view hashes)

Uploaded CPython 3.8+ musllinux: musl 1.1+ i686

tsv2py-0.3-cp38-abi3-manylinux_2_17_x86_64.manylinux2014_x86_64.whl (30.2 kB view hashes)

Uploaded CPython 3.8+ manylinux: glibc 2.17+ x86-64

tsv2py-0.3-cp38-abi3-manylinux_2_5_i686.manylinux1_i686.manylinux_2_17_i686.manylinux2014_i686.whl (29.2 kB view hashes)

Uploaded CPython 3.8+ manylinux: glibc 2.17+ i686 manylinux: glibc 2.5+ i686

tsv2py-0.3-cp38-abi3-macosx_13_0_x86_64.whl (10.9 kB view hashes)

Uploaded CPython 3.8+ macOS 13.0+ x86-64

Supported by

AWS AWS Cloud computing and Security Sponsor Datadog Datadog Monitoring Fastly Fastly CDN Google Google Download Analytics Microsoft Microsoft PSF Sponsor Pingdom Pingdom Monitoring Sentry Sentry Error logging StatusPage StatusPage Status page