Skip to main content

A python implementation of the flat-file streaming database

Project description

Objective

The FSDB "flat-file streaming database" is a structured data file that includes column names, formatting specifications (e.g. tab vs space vs comma), and a command history that generated each file. PyFSDB is a a python implementation of the original functionality that was implemented in perl. Both the perl and python version come with a long list of command line tools that can be used to quickly process datasets using traditional unix pipeline processing. There is also a C implementation and a Go implementation (ref needed) of FSDB.

Below is just getting started documentation. see the full documentation over on readthedocs.

Installation

Using pip (or uv or pipx):

pip3 install pyfsdb

Example Usage

The FSDB file format contains headers and footers that supplement the data within a file. The most common separator is tab-separated, but can wrap CSVs and other datatypes (see the FSDB documentation for full details). The file also contains footers that trace all the piped commands that were used to create a file, thus documenting the history of its creation within the metadata in the file.

Example FSDB file

#fsdb -F t col1 two andthree
1	key1	42.0
2	key2	123.0

Example pyfsdb code for reading a FSDB file

Reading in row by row:

import pyfsdb
db = pyfsdb.Fsdb("myfile.fsdb")
print(db.column_names)
for row in db:
    print(row)

Example writing to a FSDB formatted file.

import pyfsdb
db = pyfsdb.Fsdb(out_file="myfile.fsdb")
db.out_column_names=('one', 'two')
db.append([4, 'hello world'])
db.close()

Read below for further usage details.

Additional Usage Details

The real power of the FSDB comes from the build up of tool-suites that all interchange FSDB formatted files. This allows chaining multiple commands together to achieve a goal. Though the original base set of tools are in perl, you don't need to know perl for most of them.

Let's create a ./mydemo.py script:

import sys, pyfsdb

db = pyfsdb.Fsdb(file_handle=sys.stdin, out_file_handle=sys.stdout)
value_column = db.get_column_number('value')

for row in db:     # reads a row from the input stream
    row[value_column] = float(row[value_column]) * 2
    db.append(row) # sends the row to the output stream

db.close()

And then feed it this file:

#fsdb -F t col1 value
1	42.0
2	123.0

We can run it thus'ly:

# cat test.fsdb | ./mydemo.py
#fsdb -F t col1 value
1	84.0
2	246.0
#   | ./mydemo.py

Or chain it together with multiple FSDB commands:

# cat test.fsdb | ./mydemo | dbcolstats value | dbcol mean stddev sum min max | dbfilealter -R C
#fsdb -R C mean stddev sum min max
mean: 165
stddev: 114.55
sum: 330
min: 84
max: 246
#   | ./mydemo.py
#   | dbcolstats value
#   | dbcol mean stddev sum min max
#   | dbfilealter -R C

Command line tools included

All the command line utilities that come with pyfsdb start with p by convention so as not to conflict with the utilities from perl package. The leading p also serves to distinguish the CLI argument differences as well (e.g. the python versions allow file names to be specified on the command line, and most keys must be passed with a -k flag).

Data processing tools

  • pdbrow: select rows based on logic criteria
  • pdbroweval: modify rows based on python code
  • pdbtopn: given a key and a value column, print the top N rows with unique keys and the highest values.
  • pdbaugment: a fast way to merge two fsdb files, where one is stored entirely in memory for speed. Unlike other tools, this does not sort the data for speed purposes.
  • pdbcoluniq: find all unique values of a key column, optionally with counting. Requires no sorting (unlike dbrowuniq) at the cost of greater memory usage.
  • pdbzerofill: fills a column with zeros if the value is otherwise blank
  • pdbkeyedsort: sorts a potentially large file that is already "mostly" sorted by performing a double-pass on reading it. This will be less and less efficient the more random the rows are in order.
  • pdbfullpivot: description TBD
  • pdbreescape: converts a column full of data to regex quoted for safety
  • pdbensure:
  • pdbcdf: performs cdf analysis on a column

Conversion tools

  • bro2fsdb: converts a zeek/bro log into an fsdb
  • json2fsdb: converts a json file to fsdb
  • fsdb2json: converts an fsdb file to json
  • pdb2tex: converts a fsdb file to a latex table
  • pdbformat: generically formats each row according to a python column specifier
  • pdbsplitter: splits a FSDB file into multiple sub-files based on a column set
  • pdbdatetoepoch: converts columns from a date string to an integer epoch column
  • pdbepochtodate: formats a unix epoch seconds date to human readable
  • pdbnormalize: normalizes a column to a limited range
  • pdbsum: tbd
  • pdbj2: formats results based on a jinja2 template
  • pdb2sql: converts a fsdb file into an sqlite3 database

graphical utilities

  • pdbheatmap: creates a heat map based on incoming data columns
  • pdbroc: creates a ROC graph for incoming fsdb data

Author

Wes Hardaker @ USC/ISI

See also

The FSDB website and manual page for the original perl module:

https://www.isi.edu/~johnh/SOFTWARE/FSDB/

Project details


Release history Release notifications | RSS feed

This version

2.7

Download files

Download the file for your platform. If you're not sure which to choose, learn more about installing packages.

Source Distribution

pyfsdb-2.7.tar.gz (54.7 kB view details)

Uploaded Source

Built Distribution

If you're not sure about the file name format, learn more about wheel file names.

pyfsdb-2.7-py3-none-any.whl (91.7 kB view details)

Uploaded Python 3

File details

Details for the file pyfsdb-2.7.tar.gz.

File metadata

  • Download URL: pyfsdb-2.7.tar.gz
  • Upload date:
  • Size: 54.7 kB
  • Tags: Source
  • Uploaded using Trusted Publishing? No
  • Uploaded via: uv/0.5.23

File hashes

Hashes for pyfsdb-2.7.tar.gz
Algorithm Hash digest
SHA256 f3e05992505f4026093b6dbdfd7641e2b675f7bebb7c1c28a0da26587a3c6b09
MD5 c71f31571d0a302a73a1ce3c6a9c5271
BLAKE2b-256 7dabb6a68b59082efb89f06ae0a962cc1ab1adc13b097a2d3200d927f96e373f

See more details on using hashes here.

File details

Details for the file pyfsdb-2.7-py3-none-any.whl.

File metadata

  • Download URL: pyfsdb-2.7-py3-none-any.whl
  • Upload date:
  • Size: 91.7 kB
  • Tags: Python 3
  • Uploaded using Trusted Publishing? No
  • Uploaded via: uv/0.5.23

File hashes

Hashes for pyfsdb-2.7-py3-none-any.whl
Algorithm Hash digest
SHA256 ae12aad92e3884ad7159b102bf1a8fbde7055804dd5746e4c86e514b385ef933
MD5 916b9a500207d7ecae512bae80f14bb0
BLAKE2b-256 5595bf6d2ad692fc98eca86f29d798a3b5abbb5455cf901e3923383a8f46621d

See more details on using hashes here.

Supported by

AWS Cloud computing and Security Sponsor Datadog Monitoring Depot Continuous Integration Fastly CDN Google Download Analytics Pingdom Monitoring Sentry Error logging StatusPage Status page