Skip to main content

High-performance parser and generator for PostgreSQL-compatible tab-separated values (TSV)

Project description

Parse and generate tab-separated values (TSV) data

Tab-separated values (TSV) is a simple and popular format for data storage, data transfer, exporting data from and importing data to relational databases. For example, PostgreSQL COPY moves data between PostgreSQL tables and standard file-system files or in-memory stores, and its text format (a text file with one line per table row) is a generic version of TSV. Meanwhile, packages like asyncpg help efficiently insert, update or query data in bulk with binary data transfer between Python and PostgreSQL.

This package offers a high-performance alternative to convert data between a TSV text file and Python objects. The parser can read a TSV record into a Python tuple consisting of built-in Python types, one for each field. The generator can produce a TSV record from a tuple.

Quick start

from tsv.helper import Parser

# specify the column structure
parser = Parser(fields=(bytes, date, datetime, float, int, str, UUID, bool))

# read and parse an entire file
with open(tsv_path, "rb") as f:
    py_records = parser.parse_file(f)

# read and parse a file line by line
with open(tsv_path, "rb") as f:
    for line in f:
        py_record = parser.parse_line(line)

TSV format

Text format is a simple tabular format in which each record (table row) occupies a single line.

  • Output always begins with a header row, which lists data field names.
  • Fields (table columns) are delimited by tab characters.
  • Non-printable characters and special values are escaped with backslash (\), as shown below:
Escape Interpretation
\N NULL value
\0 NUL character (ASCII 0)
\b Backspace (ASCII 8)
\f Form feed (ASCII 12)
\n Newline (ASCII 10)
\r Carriage return (ASCII 13)
\t Tab (ASCII 9)
\v Vertical tab (ASCII 11)
\\ Backslash (single character)

This format allows data to be easily imported into a database engine, e.g. with PostgreSQL COPY.

Output in this format is transmitted as media type text/plain or text/tab-separated-values in UTF-8 encoding.

Parser

The parser understands the following Python types:

  • None. This special value is returned for the TSV escape sequence \N.
  • bool. A literal true or false is converted into a boolean value.
  • bytes. TSV escape sequences are reversed before the data is passed to Python as a bytes object. NUL bytes are permitted.
  • datetime. The input has to comply with RFC 3339 and ISO 8601. The timezone must be UTC (a.k.a. suffix Z).
  • date. The input has to conform to the format YYYY-MM-DD.
  • time. The input has to conform to the format hh:mm:ssZ with no fractional seconds, or hh:mm:ss.ffffffZ with fractional seconds. Fractional seconds allow up to 6 digits of precision.
  • float. Interpreted as double precision floating point numbers.
  • int. Arbitrary-length integers are allowed.
  • str. TSV escape sequences are reversed before the data is passed to Python as a str. NUL bytes are not allowed.
  • uuid.UUID. The input has to comply with RFC 4122, or be a string of 32 hexadecimal digits.
  • decimal.Decimal. Interpreted as arbitrary precision decimal numbers.
  • ipaddress.IPv4Address.
  • ipaddress.IPv6Address.
  • list and dict, which are understood as JSON, and invoke the equivalent of json.loads to parse a serialized JSON string.

The backslash character \ is both a TSV and a JSON escape sequence initiator. When JSON data is written to TSV, several backslash characters may be needed, e.g. \\n in a quoted JSON string translates to a single newline character. First, \\ in \\n is understood as an escape sequence by the TSV parser to produce a single \ character followed by an n character, and in turn \n is understood as a single newline embedded in a JSON string by the JSON parser. Specifically, you need four consecutive backslash characters in TSV to represent a single backslash in a JSON quoted string.

Internally, the implementation uses AVX2 instructions to

  • parse RFC 3339 date-time strings into Python datetime objects,
  • parse RFC 4122 UUID strings or 32-digit hexadecimal strings into Python UUID objects,
  • and find \t delimiters between fields in a line.

For parsing integers up to the range of the long type, the parser calls the C standard library function strtol.

For parsing IPv4 and IPv6 addresses, the parser calls the C function inet_pton in libc or Windows Sockets (WinSock2).

If installed, the parser employs orjson to improve parsing speed of nested JSON structures. If not available, the library falls back to the built-in JSON decoder.

Date-time format

YYYY-MM-DDThh:mm:ssZ
YYYY-MM-DDThh:mm:ss.fZ
YYYY-MM-DDThh:mm:ss.ffZ
YYYY-MM-DDThh:mm:ss.fffZ
YYYY-MM-DDThh:mm:ss.ffffZ
YYYY-MM-DDThh:mm:ss.fffffZ
YYYY-MM-DDThh:mm:ss.ffffffZ

Date format

YYYY-MM-DD

Time format

hh:mm:ssZ
hh:mm:ss.fZ
hh:mm:ss.ffZ
hh:mm:ss.fffZ
hh:mm:ss.ffffZ
hh:mm:ss.fffffZ
hh:mm:ss.ffffffZ

Performance

Depending on the field types, tsv2py is up to 7 times faster to parse TSV records than a functionally equivalent Python implementation based on the Python standard library. Savings in execution time are more substantial for dates, UUIDs and longer strings with special characters (up to 90% savings), and they are more moderate for simple types like small integers (approx. 60% savings).

Project details


Download files

Download the file for your platform. If you're not sure which to choose, learn more about installing packages.

Source Distribution

tsv2py-0.6.0.tar.gz (23.0 kB view details)

Uploaded Source

Built Distributions

tsv2py-0.6.0-pp310-pypy310_pp73-manylinux_2_17_x86_64.manylinux2014_x86_64.whl (17.4 kB view details)

Uploaded PyPy manylinux: glibc 2.17+ x86-64

tsv2py-0.6.0-pp310-pypy310_pp73-manylinux_2_5_i686.manylinux1_i686.manylinux_2_17_i686.manylinux2014_i686.whl (17.7 kB view details)

Uploaded PyPy manylinux: glibc 2.17+ i686 manylinux: glibc 2.5+ i686

tsv2py-0.6.0-pp39-pypy39_pp73-manylinux_2_17_x86_64.manylinux2014_x86_64.whl (17.4 kB view details)

Uploaded PyPy manylinux: glibc 2.17+ x86-64

tsv2py-0.6.0-pp39-pypy39_pp73-manylinux_2_5_i686.manylinux1_i686.manylinux_2_17_i686.manylinux2014_i686.whl (17.7 kB view details)

Uploaded PyPy manylinux: glibc 2.17+ i686 manylinux: glibc 2.5+ i686

tsv2py-0.6.0-pp38-pypy38_pp73-manylinux_2_17_x86_64.manylinux2014_x86_64.whl (17.4 kB view details)

Uploaded PyPy manylinux: glibc 2.17+ x86-64

tsv2py-0.6.0-pp38-pypy38_pp73-manylinux_2_5_i686.manylinux1_i686.manylinux_2_17_i686.manylinux2014_i686.whl (17.7 kB view details)

Uploaded PyPy manylinux: glibc 2.17+ i686 manylinux: glibc 2.5+ i686

tsv2py-0.6.0-cp38-abi3-win_amd64.whl (18.9 kB view details)

Uploaded CPython 3.8+ Windows x86-64

tsv2py-0.6.0-cp38-abi3-musllinux_1_1_x86_64.whl (41.4 kB view details)

Uploaded CPython 3.8+ musllinux: musl 1.1+ x86-64

tsv2py-0.6.0-cp38-abi3-musllinux_1_1_i686.whl (40.5 kB view details)

Uploaded CPython 3.8+ musllinux: musl 1.1+ i686

tsv2py-0.6.0-cp38-abi3-manylinux_2_17_x86_64.manylinux2014_x86_64.whl (39.7 kB view details)

Uploaded CPython 3.8+ manylinux: glibc 2.17+ x86-64

tsv2py-0.6.0-cp38-abi3-manylinux_2_5_i686.manylinux1_i686.manylinux_2_17_i686.manylinux2014_i686.whl (38.0 kB view details)

Uploaded CPython 3.8+ manylinux: glibc 2.17+ i686 manylinux: glibc 2.5+ i686

tsv2py-0.6.0-cp38-abi3-macosx_10_9_universal2.whl (22.7 kB view details)

Uploaded CPython 3.8+ macOS 10.9+ universal2 (ARM64, x86-64)

File details

Details for the file tsv2py-0.6.0.tar.gz.

File metadata

  • Download URL: tsv2py-0.6.0.tar.gz
  • Upload date:
  • Size: 23.0 kB
  • Tags: Source
  • Uploaded using Trusted Publishing? No
  • Uploaded via: twine/4.0.2 CPython/3.11.6

File hashes

Hashes for tsv2py-0.6.0.tar.gz
Algorithm Hash digest
SHA256 472eaddc9f32a242a980b475d2aa3ceb44417697aabbbcbb56258f36adfb64f4
MD5 16a03445e6a3b7d65ef110e90d47c38c
BLAKE2b-256 3a221e3c568b52432f3d17409816c551811ca0f593622d5db2909223d42ebe17

See more details on using hashes here.

File details

Details for the file tsv2py-0.6.0-pp310-pypy310_pp73-manylinux_2_17_x86_64.manylinux2014_x86_64.whl.

File metadata

File hashes

Hashes for tsv2py-0.6.0-pp310-pypy310_pp73-manylinux_2_17_x86_64.manylinux2014_x86_64.whl
Algorithm Hash digest
SHA256 0db87ac17781c7522baaf925dc008e30df4930ff5ef7d8a7c8cbc70fa56308cf
MD5 b695d0c71b92c8232acede1f81279495
BLAKE2b-256 19828f2512fe50d0e4bda6dae75b57bc2aadf87959f06fcdbacaccb2e6541f97

See more details on using hashes here.

File details

Details for the file tsv2py-0.6.0-pp310-pypy310_pp73-manylinux_2_5_i686.manylinux1_i686.manylinux_2_17_i686.manylinux2014_i686.whl.

File metadata

File hashes

Hashes for tsv2py-0.6.0-pp310-pypy310_pp73-manylinux_2_5_i686.manylinux1_i686.manylinux_2_17_i686.manylinux2014_i686.whl
Algorithm Hash digest
SHA256 73649aa556301b67295b2e065318c2c68ac76837c9d6f905c3c4781dca2ada32
MD5 cc68e3f0d1451c6b48a1cfdf6b5a671a
BLAKE2b-256 087eb73efd82970325af804b89075aa7f00e438d7add50165e71431a7f615dfc

See more details on using hashes here.

File details

Details for the file tsv2py-0.6.0-pp39-pypy39_pp73-manylinux_2_17_x86_64.manylinux2014_x86_64.whl.

File metadata

File hashes

Hashes for tsv2py-0.6.0-pp39-pypy39_pp73-manylinux_2_17_x86_64.manylinux2014_x86_64.whl
Algorithm Hash digest
SHA256 e8a046f458cc35af9ece1d45a696baa07fe592e6e07dfcc4d67c7f74603c4eaf
MD5 04dd55f5ab224d68814a889e869e3f1f
BLAKE2b-256 2b886ca7ec348a9a21e05f3eb836d15ed71921cf7c4121b860bb08b293128afd

See more details on using hashes here.

File details

Details for the file tsv2py-0.6.0-pp39-pypy39_pp73-manylinux_2_5_i686.manylinux1_i686.manylinux_2_17_i686.manylinux2014_i686.whl.

File metadata

File hashes

Hashes for tsv2py-0.6.0-pp39-pypy39_pp73-manylinux_2_5_i686.manylinux1_i686.manylinux_2_17_i686.manylinux2014_i686.whl
Algorithm Hash digest
SHA256 4d95ffef52b6b161e28577ac90c98cb81755bc69fd290c48e3e2fbcdf7b2cc7c
MD5 30a9aaa62d01c7b09880a6f0db6ab445
BLAKE2b-256 de67c2fdf55aa5e306bb42d997acb1e0e253046f2af004ef37689bf88bb4dfc7

See more details on using hashes here.

File details

Details for the file tsv2py-0.6.0-pp38-pypy38_pp73-manylinux_2_17_x86_64.manylinux2014_x86_64.whl.

File metadata

File hashes

Hashes for tsv2py-0.6.0-pp38-pypy38_pp73-manylinux_2_17_x86_64.manylinux2014_x86_64.whl
Algorithm Hash digest
SHA256 7acbc418a206d1702793d0f21aba65b2d31f431e1f52128c0976535ec19491c2
MD5 557d24685647249af2a4ca470e251c60
BLAKE2b-256 5a6494ebf235b184e9b90e3c1b18b75caafb43c7e1cc681d001b9ce2c90dcc79

See more details on using hashes here.

File details

Details for the file tsv2py-0.6.0-pp38-pypy38_pp73-manylinux_2_5_i686.manylinux1_i686.manylinux_2_17_i686.manylinux2014_i686.whl.

File metadata

File hashes

Hashes for tsv2py-0.6.0-pp38-pypy38_pp73-manylinux_2_5_i686.manylinux1_i686.manylinux_2_17_i686.manylinux2014_i686.whl
Algorithm Hash digest
SHA256 2c80ed8dca76e6647344baf94219c528bb87b379d4091d728f7d2e051b8597fe
MD5 93ac485a342bfbc3e1e26b7a326519ea
BLAKE2b-256 d6b3bf36720a5a840a23b1f65f009bfeccc149fc002c63752fba246ae0a3b230

See more details on using hashes here.

File details

Details for the file tsv2py-0.6.0-cp38-abi3-win_amd64.whl.

File metadata

  • Download URL: tsv2py-0.6.0-cp38-abi3-win_amd64.whl
  • Upload date:
  • Size: 18.9 kB
  • Tags: CPython 3.8+, Windows x86-64
  • Uploaded using Trusted Publishing? No
  • Uploaded via: twine/4.0.2 CPython/3.12.1

File hashes

Hashes for tsv2py-0.6.0-cp38-abi3-win_amd64.whl
Algorithm Hash digest
SHA256 39eac358ee43877e2f3d54490acfa1d23009798e344ea065eb97bb07b273af59
MD5 7252985e5c07628db4edfd7ca2612893
BLAKE2b-256 8bad2d8b263a18eefcb5a4244bec6c8f2861b3e25ad43118da78067f29f6d0a4

See more details on using hashes here.

File details

Details for the file tsv2py-0.6.0-cp38-abi3-musllinux_1_1_x86_64.whl.

File metadata

File hashes

Hashes for tsv2py-0.6.0-cp38-abi3-musllinux_1_1_x86_64.whl
Algorithm Hash digest
SHA256 cc5740123a1fc370ea3762b7d95c8e524c8a71c18f4c51ef6e4d0768339c875e
MD5 6ea71dd45bc79cd968e1b264952965f2
BLAKE2b-256 6822f54d97e58d1df6d902c07bc2fb886edbaf2c35fffed4e4d3eefaccc082c1

See more details on using hashes here.

File details

Details for the file tsv2py-0.6.0-cp38-abi3-musllinux_1_1_i686.whl.

File metadata

File hashes

Hashes for tsv2py-0.6.0-cp38-abi3-musllinux_1_1_i686.whl
Algorithm Hash digest
SHA256 ebdc5ea1207309904264fec675ff4ca08320b9f3ca86aaab1baf88bf0c6d5d8d
MD5 173d494bdd0cc6b3aac875f7bfacdd63
BLAKE2b-256 3b35a333cc4a615977c4420248a2e0a65ebb94d62af7c8c78ab92ff1f37ac67b

See more details on using hashes here.

File details

Details for the file tsv2py-0.6.0-cp38-abi3-manylinux_2_17_x86_64.manylinux2014_x86_64.whl.

File metadata

File hashes

Hashes for tsv2py-0.6.0-cp38-abi3-manylinux_2_17_x86_64.manylinux2014_x86_64.whl
Algorithm Hash digest
SHA256 48fd2e7ecf6740a3ed1232cff42b8e467b4ef1fceb0f68cab0aaeebe85fd41ef
MD5 425f289361d69eb8a4761668c606492c
BLAKE2b-256 1b1a9b6a5a57e26d78d4dbdedc49916931393ca64ada18c5c3d057f37184e904

See more details on using hashes here.

File details

Details for the file tsv2py-0.6.0-cp38-abi3-manylinux_2_5_i686.manylinux1_i686.manylinux_2_17_i686.manylinux2014_i686.whl.

File metadata

File hashes

Hashes for tsv2py-0.6.0-cp38-abi3-manylinux_2_5_i686.manylinux1_i686.manylinux_2_17_i686.manylinux2014_i686.whl
Algorithm Hash digest
SHA256 5a4abc6838c9d7c17f40bcadeba898fb940202d645e103f7eab0bd36e3936269
MD5 5462dbed3b0e3207e5cf373d212f2bb9
BLAKE2b-256 d47feb2cc2dba4655ec2baca92d206f07d0cd62e8ae65f8a7867c77c595da3d4

See more details on using hashes here.

File details

Details for the file tsv2py-0.6.0-cp38-abi3-macosx_10_9_universal2.whl.

File metadata

File hashes

Hashes for tsv2py-0.6.0-cp38-abi3-macosx_10_9_universal2.whl
Algorithm Hash digest
SHA256 4c7c488cf360f11c9d49067c32d54b6a938c93ba01656dffdf040438686ca9a3
MD5 0c1e16f7c5bf1d3df03cb6c9006bd403
BLAKE2b-256 82582f4956256f0ed6450d0aae9a99347ee08f3e7a2949732e378d3f47da5f58

See more details on using hashes here.

Supported by

AWS AWS Cloud computing and Security Sponsor Datadog Datadog Monitoring Fastly Fastly CDN Google Google Download Analytics Microsoft Microsoft PSF Sponsor Pingdom Pingdom Monitoring Sentry Sentry Error logging StatusPage StatusPage Status page