Skip to main content

A simple, efficient, pythonic column data store

Project description

A simple, efficient column-oriented, pythonic data store.

The focus is currently on efficiency of reading and writing. The code is pure python but searching and reading data is fast due to the use of the fitsio package for column data and index data. Basic consistency is ensured for the columns in the table, but the database is not fully ACID.

The storage is a simple directory with files on disk.

Examples

>>> import pycolumns as pyc

# instantiate a column database from the specified coldir
>>> c = pyc.Columns('/some/path/mycols.cols')

# display some info about the columns
>>> c
Columns Directory:

  mydata
  dir: /some/path/mydata.cols
  nrows: 64348146
  Columns:
    name             dtype index
    -----------------------------
    ccd                <i2 True
    dec                <f8 False
    exposurename      |S20 True
    id                 <i8 True
    imag               <f4 False
    ra                 <f8 False
    x                  <f4 False
    y                  <f4 False
    g                  <f8 False

  Dictionaries
    name
    -----------------------------
    meta

  Sub-Columns Directories:
    name
    -----------------------------
    psfstars

# display info about column 'id'
>>> c['id']
Column:
  name: id
  filename: ./id.array
  type: array
  dtype: <i8
  has index: False
  nrows: 64348146

# number of rows in table
>>> c.nrows

# read all columns into a single rec array.  By default the dict
# columns are not loaded

>>> data = c.read()

# using asdict=True puts the data into a dict.  The dict data
# are also loaded in this case
>>> data = c.read(asdict=True)

# specify columns
>>> data = c.read(columns=['id', 'flux'])

# dict columns can be specified if asdict is True.  Dicts can also
# be read as single columns, see below
>>> data = c.read(columns=['id', 'flux', 'meta'], asdict=True)

# specifying a set of rows as sequence/array or slice
>>> data = c.read(columns=['id', 'flux'], rows=[3, 225, 1235])
>>> data = c.read(columns=['id', 'flux'], rows=slice(10, 20))

# read all data from column 'id' as an array rather than recarray
# alternative syntaxes
>>> ind = c['id'][:]
>>> ind = c['id'].read()
>>> ind = c.read_column('id')

# read a subset of rows
# slicing
>>> ind = c['id'][25:125]

# specifying a set of rows
>>> rows = [3, 225, 1235]
>>> ind = c['id'][rows]
>>> ind = c.read_column('id', rows=rows)

# reading a dictionary column
>>> meta = c['meta'].read()

# Create indexes for fast searching
>>> c['id'].create_index()

# get indices for some conditions
>>> ind = c['id'] > 25
>>> ind = c['id'].between(25, 35)
>>> ind = c['id'] == 25

# read the corresponding data
>>> ccd = c['ccd'][ind]
>>> data = c.read(columns=['ra', 'dec'], rows=ind)

# composite searches over multiple columns
>>> ind = (c['id'] == 25) & (col['ra'] < 15.23)
>>> ind = c['id'].between(15, 25) | (c['id'] == 55)
>>> ind = c['id'].between(15, 250) & (c['id'] != 66) & (c['ra'] < 100)

# write columns from the fields in a rec array names in the data correspond
# to column names.  If this is the first time writing data, the columns are
# created, and on subsequent writes, the columns must match

>>> c.append(recdata)
>>> c.append(new_data)

# append data from the fields in a FITS file
>>> c.from_fits(fitsfile_name)

# add a dictionary column.
>>> c.create_column('weather', 'dict')
>>> c['weather'].write({'temp': 30.1, 'humid': 0.5})

# overwrite dict column
>>> c['weather'].write({'temp': 33.2, 'humid': 0.3, 'windspeed': 60.5})

# you should not generally create array columns, since they
# can get out of sync with existing columns.  This will by default
# raise an exception, but you can send verify=False if you know
# what you are doing.  In the future special support will be added for
# adding new columns
>>> c.create_column('test', 'array')

# update values for an array column
>>> c['id'][35] = 10
>>> c['id'][35:35+3] = [8, 9, 10]
>>> c['id'][rows] = idvalues

# get all names, including dictionary and sub Columns
# same as list(c.keys())
>>> c.names
['ccd', 'dec', 'exposurename', 'id', 'imag', 'ra', 'x', 'y', 'g',
 'meta', 'psfstars']

# only array column names
>>> c.column_names
['ccd', 'dec', 'exposurename', 'id', 'imag', 'ra', 'x', 'y', 'g']

# only dict columns
>>> c.dict_names
['meta', 'weather']

# only sub Columns directories
>>> c.subcols_names
['psfstars']

# reload all columns or specified column/column list
>>> c.reload()

# delete all data.  This will ask for confirmation
>>> c.delete()

# delete column and its data
>>> c.delete_column('ra')

# to configure the amount of memory used during index creation, specify
# cache_mem in gigabytes.  Default 1 gig
>>> cols = pyc.Columns(fname, cache_mem=0.5)

# columns can actually be another pycolumns directory
>>> psfcols = cols['psfstars']
>>> psfcols
  dir: /some/path/mydata.cols/psfstars.cols
  Columns:
    name             type  dtype index  shape
    --------------------------------------------------
    ccd             array    <i2 True   (64348146,)
    id              array    <i8 True   (64348146,)
    imag            array    <f4 False  (64348146,)
    x               array    <f4 False  (64348146,)
    y               array    <f4 False  (64348146,)
    ...etc

Dependencies

numpy

Project details


Download files

Download the file for your platform. If you're not sure which to choose, learn more about installing packages.

Source Distribution

pycolumns-1.0.0.tar.gz (25.4 kB view details)

Uploaded Source

File details

Details for the file pycolumns-1.0.0.tar.gz.

File metadata

  • Download URL: pycolumns-1.0.0.tar.gz
  • Upload date:
  • Size: 25.4 kB
  • Tags: Source
  • Uploaded using Trusted Publishing? No
  • Uploaded via: twine/3.8.0 colorama/0.4.4 importlib-metadata/4.6.4 keyring/23.5.0 pkginfo/1.8.2 readme-renderer/34.0 requests-toolbelt/0.9.1 requests/2.25.1 rfc3986/1.5.0 tqdm/4.57.0 urllib3/1.26.5 CPython/3.10.6

File hashes

Hashes for pycolumns-1.0.0.tar.gz
Algorithm Hash digest
SHA256 4c715260a118bd8b24052158632c2a2e89e213210307c0d80a9220587b15d9a7
MD5 55e5c7ef095ac7a03bcfa34318287c97
BLAKE2b-256 d81aa05614a0128b5018057d2badf505aec058c17db3815a8ee71cc7812e638a

See more details on using hashes here.

Supported by

AWS AWS Cloud computing and Security Sponsor Datadog Datadog Monitoring Fastly Fastly CDN Google Google Download Analytics Microsoft Microsoft PSF Sponsor Pingdom Pingdom Monitoring Sentry Sentry Error logging StatusPage StatusPage Status page