Line-oriented, tab-separated value format
Linear TSV is a line-oriented, portable tabular data format. Tabular data – rows of tuples, each of the same length – is commonly stored as CSV and is the lingua franca of spreadsheets, databases and analysis tools.
CSV is almost but not quite line-oriented, because newlines are quoted, not escaped. In the TSV format presented here, escape codes are used for newlines and tabs in field data, allowing naive filtering with line-oriented shell tools like sort, fgrep and cut to work as expected. In all of its details, the format derives from the TEXT serialization mode of Postgres and MySQL.
from collections import namedtuple import sys import tsv # Simplest access mode: parse a text stream (strings are okay, too) to a # generator of lists of strings. lists = tsv.un(sys.stdin) # Parse each row as a particular class derived with namedtuple() class Stats(namedtuple('Stats', ['state', 'city', 'population', 'area'])): pass tuples = tsv.un(sys.stdin, Stats) # Format a collection of rows, getting back a generator of strings, one # each row. Any parseable type is okay. strings = tsv.to(lists) strings = tsv.to(tuples) # Write the rows to a handle: strings = tsv.to(tuples, sys.stdout)
In this format, all records are separated by ASCII newlines (0x0a) and fields within a record are separated with ASCII tab (0x09). It is permitted but discouraged to separate records with \r\n.
To include newlines, tabs, carriage returns and backslashes in field data, the following escape sequences must be used:
To indicate missing data for a field, the character sequence \N (bytes 0x5c and 0x4e) is used. Note that the N is capitalized. This character sequence is exactly that used by SQL databases to indicate SQL NULL in their tab-separated output mode.
There are no header lines specified by this format. One objection to them is that they break the naive concantenation of files. Another is that they are anithetical to stream processing. Yet another is that one generally wants more than column names – one wants at least column types. Better to do nothing than too little.
In advocating a shift to a line-oriented, tab-separated serialization format, we are endorsing an existing format: the default serialization format of both Postgres and MySQL. We propose to standardize a subset of the format common to both database systems.
A truly line-oriented format for tabular data, where newline, carriage return and the separator are always represented by escape sequences, offers many practical advantages, among them:
CSV is almost right and it’s worth talking about the disadvantages of CSV that motivate the author to promote another tabular data format:
A few sample parsers are included in the distribution.
This grammar is presented in the W3C EBNF format.
TSV ::= Row (NL Row)* /* This form may be read but not written by conforming implementations. */ TSVInput ::= Row (CR? NL Row)* Row ::= Field (Tab Field)* Field ::= (Escape|NoOpEscape|PlainChar)* Char ::= [http://www.w3.org/TR/xml#NT-Char] PlainChar ::= Char - (NL|Tab|CR|'\') NL ::= #x0A CR ::= #x0D Tab ::= #x09 Escape ::= '\n' | '\r' | '\t' | '\\' NoOpEscape ::= '\' (Char - ('n'|'r'|'t'|'\'))
A diagram of the grammar can be generated online with the Bottlecaps Railroad Diagram generator.