Skip to main content

A simple format for typing TSVs with an implementation in Python 3

Project description

typedtsv

Typed TSV: A simple format for typing TSVs with an implementation in Python 3.

Available on pypi: https://pypi.org/project/typedtsv/

Install with: pip install typedtsv

See code and leave feedback here: https://github.com/jimmybot/typedtsv

Why?

JSON, YAML, TOML and other simple formats aren't built for list/table like sets of data.

YAML is particularly slow due to its expansive featureset and JSON, being that is for single objects and not collections, is not chunkable. I once stored all PyPI package info in a YAML file and reading it back out was going to take half a day. Using a dead-simple newline-delimited JSON format made parsing take seconds.

Newline-delimited JSON is convenient with little chance of making mistakes in parsing and good performance. The downsides are the types supported are a bit too limited (no int vs float), and it is also not easily human readable or editable.

TOML is particularly targeted towards configuration files and similarly parses results in a single dictionary object rather than a collection.

CSV/TSV formats have too much ambiguity resulting in repetitive custom parsing logic contained outside the file itself. CSV quote escaping can also lead to poor parsing performance.

Goals

  • Be simple
  • Be fast
  • Be easily parallelized
  • Be a better alternative to CSV/TSV/JSON and simple uses of YAML
  • Support open data and data sharing/archival. Push information about a dataset into the data file itself for future reproducibility

Use Cases in Mind

  • Database-agnostic, program-agnostic simple file format for open data
  • A quick go-to serialization format for sharing reproducible data science datasets
  • Easily-created, easily-editable, easily-understood database fixtures for tests

Non-Goals

  • Unlimited extensibility a la YAML
  • Config files. Focus is on lists of objects/tabular data

Format

Format is a normal TSV except the header rows uses a colon format to annotate the type:

<col_name>:<col_type>\t<col_name2>:<coL_type2>...

For example:

# I'm a comment and will be ignored
url:str    n_times:int   score:float
https://www.example.com 5   1.6
https://archive.org 99  9.9

Initial pass centered around Python's basic types plus JSON. Current valid types are:

Type Notes
int
float
bool Valid values: true, false, t, f, yes, no, y, n, 1, 0
str Newlines, tabs, \, and # must be escaped
datetime '2011-01-01 00:00:00' Without timezone assumes UTC
json
null All types are nullable with value 'null'. To get literal string 'null', use '\null'

Comments are supported, just prefix with #. Escape actual # in a string with a single backslash '\#'.

Row separators use '\n' only. Windows line breaks, '\r\n' are not valid.

We'll never allow quoted '\n' because this would make the file difficult to chunk and thus make it difficult to parallelize reading.

Gotchas:

  • In Python, you need to be careful about opening files that may contain Windows newlines:
infile = open('data.ttsv', 'r', newline='\n')   # must set newline='\n' because default for newline is '\n' or '\r' or '\r\n'
  • typedtsv.dumps can infer column types from the first row of your data but not if there are any null's. In that case, use the regular OrderedDict method to define column names and types

TODO:

  • Add a boolean type
  • Add nulls
  • Add a datetime/date/time type: need to avoid ambiguity yet support common uses
  • Ergonomics: optionally read and dump single lists of data rather than dealing with a list of lists
  • Support units annotations such as degrees F, meters/second using similar using same syntax as F#: https://docs.microsoft.com/en-us/dotnet/fsharp/language-reference/units-of-measure
  • Maybe: extend format to support column comments / other common metadata
  • Maybe: support array and map types for compatibility with Postgres
  • Maybe: Support date, time, and/or timeinterval types

Developing

Make sure you have Poetry installed: https://github.com/sdispater/poetry

git clone git@github.com:jimmybot/typedtsv.git
cd typedtsv
poetry install
poetry shell
pytest

Project details


Download files

Download the file for your platform. If you're not sure which to choose, learn more about installing packages.

Source Distribution

typedtsv-0.9.1.tar.gz (9.6 kB view details)

Uploaded Source

Built Distribution

If you're not sure about the file name format, learn more about wheel file names.

typedtsv-0.9.1-py3-none-any.whl (23.0 kB view details)

Uploaded Python 3

File details

Details for the file typedtsv-0.9.1.tar.gz.

File metadata

  • Download URL: typedtsv-0.9.1.tar.gz
  • Upload date:
  • Size: 9.6 kB
  • Tags: Source
  • Uploaded using Trusted Publishing? No
  • Uploaded via: poetry/0.11.4 CPython/3.7.1 Darwin/17.7.0

File hashes

Hashes for typedtsv-0.9.1.tar.gz
Algorithm Hash digest
SHA256 10a1ecabe12c42d33c8fb57c757d4e14c16b7b377fd717ca2370a5f73f1a6a1d
MD5 91bca01a7b5ff9f62ae085c39de88743
BLAKE2b-256 19d9df32e7855997d37ce7680682b046d33d8be9a3d7026bae72155888d0fe78

See more details on using hashes here.

File details

Details for the file typedtsv-0.9.1-py3-none-any.whl.

File metadata

  • Download URL: typedtsv-0.9.1-py3-none-any.whl
  • Upload date:
  • Size: 23.0 kB
  • Tags: Python 3
  • Uploaded using Trusted Publishing? No
  • Uploaded via: poetry/0.11.4 CPython/3.7.1 Darwin/17.7.0

File hashes

Hashes for typedtsv-0.9.1-py3-none-any.whl
Algorithm Hash digest
SHA256 434b90fb0b851d060e2c5dab3cab01f3f04d6f648eae55ee9f9452ed80a7f957
MD5 9d6aae4a315c82e30fd7a62d19b297b7
BLAKE2b-256 5dda02808a0d791b17aee12850b108614c87e0f279cfa2bbf99ffab33c6ba614

See more details on using hashes here.

Supported by

AWS Cloud computing and Security Sponsor Datadog Monitoring Depot Continuous Integration Fastly CDN Google Download Analytics Pingdom Monitoring Sentry Error logging StatusPage Status page