pycolumns

A simple, efficient, pythonic column data store

These details have not been verified by PyPI

Project links

Homepage

GitHub Statistics

View statistics for this project via Libraries.io, or by using our public dataset on Google BigQuery

Project description

A simple, efficient column-oriented, pythonic data store.

The focus is currently on efficiency of reading and writing. The code is pure python but searching and reading data is fast due to the use of numpy memory maps and column indexing. Basic consistency is ensured but the database is not fully ACID.

The storage is a simple directory with files on disk.

Examples

>>> import pycolumns as pyc

# instantiate a column database from the specified coldir
>>> c=pyc.Columns('/some/path/mycols.cols')

# display some info about the columns
>>> c
Column Directory:

  dir: /some/path/mydata.cols
  Columns:
    name             type  dtype index  shape
    --------------------------------------------------
    ccd             array    <i2 True   (64348146,)
    dec             array    <f8 False  (64348146,)
    exposurename    array   |S20 True   (64348146,)
    id              array    <i8 False  (64348146,)
    imag            array    <f4 False  (64348146,)
    ra              array    <f8 False  (64348146,)
    x               array    <f4 False  (64348146,)
    y               array    <f4 False  (64348146,)
    g               array    <f8 False  (64348146, 2)
    meta             dict


  Sub-Column Directories:
    name
    --------------------------------------------------
    psfstars

# display info about column 'id'
>>> c['id']
Column:
  "id"
  filename: ./id.array
  type: col
  shape: (64348146,)
  has index: False
  dtype: <i8

# get the column names
>>> c.colnames
['ccd', 'dec', 'exposurename', 'id', 'imag', 'ra', 'x', 'y', 'g', 'meta']

# reload all columns or specified column/column list
>>> c.reload(name=None)

# read all data from array column 'id'
# alternative syntaxes, including numpy style slicing
>>> ind = c['id'][:]
>>> ind = c['id'].read()
>>> ind = c.read_column('id')

# dict columns are read as a dict. No slicing for dicts
>>> meta = c['meta'].read()

# read a subset of rows
# slicing
>>> ind = c['id'][25:125]

# specifying a set of rows
>>> rows=[3, 225, 1235]
>>> ind = c['id'][rows]
>>> ind = c.read_column('id', rows=rows)

# read all columns into a single rec array.  By default the dict
# columns are not loaded

>>> data = c.read()

# using asdict=True puts the data into a dict.  The dict data
# are loaded in this case
>>> data = c.read(asdict=True)

# specify columns
>>> data = c.read(columns=['id', 'flux'], rows=rows)

# dict columns can be specified if asdict is True
>>> data = c.read(columns=['id', 'flux', 'meta'], asdict=True)

# Create indexes for fast searching
>>> c['id'].create_index()

# get indices for some condition
>>> ind = c['id'] > 25
>>> ind = c['id'].between(25, 35)
>>> ind = c['id'] == 25

# find all matches
>>> ind = c['id'].match([35, 77])

# read the corresponding data
>>> ccd = c['ccd'][ind]
>>> data = c.read(columns=['ra', 'dec'], rows=ind)

# composite searches over multiple columns
>>> ind = (c['id'] == 25) & (col['ra'] < 15.23)
>>> ind = c['id'].between(15, 25) | (c['id'] == 55)
>>> ind = c['id'].between(15, 250) & (c['id'] != 66) & (c['ra'] < 100)
>>> ind = c['id'].between(15, 250) & c['id'].match([35, 99])

# speed up reads by sorting indices
>>> ind.sort()
>>> data = c.read(columns=['ra', 'dec'], rows=ind)

# you can check if the index is already sorted
>>> if not ind.is_sorted:
>>>    ind.sort()

# update values for a column
>>> c['id'][35] = 10
>>> c['id'][35:35+3] = [8, 9, 10]
>>> c['id'][rows] = idvalues

# write multiple columns from the fields in a rec array
# names in the data correspond to column names.
# If columns are not present, they are created
# but row count consistency must be maintained for all array
# columns and this is checked.

>>> c.append(recdata)

# append data from the fields in a FITS file
>>> c.from_fits(fitsfile_name)

# add a dict column
>>> c.create_column('meta')
>>> c['meta'].write({'test': 'hello'})
>>> d = c['meta'].read()

# heirarchical sets of columns are also supported
>>> c['psfstars']
Column Directory:

  dir: /some/path/mydata.cols/psfstars.cols
  Columns:
    name             type  dtype index  shape
    --------------------------------------------------
    ccd             array    <i2 True   (348146,)
    exposurename    array   |S20 True   (348146,)
    x               array    <f4 False  (348146,)
    y               array    <f4 False  (348146,)

Dependencies

numpy

Project details

These details have not been verified by PyPI

Project links

Homepage

GitHub Statistics

View statistics for this project via Libraries.io, or by using our public dataset on Google BigQuery

Release history Release notifications | RSS feed

1.0.0

Apr 11, 2023

This version

0.9.2

Apr 7, 2023

Download files

Download the file for your platform. If you're not sure which to choose, learn more about installing packages.

Source Distribution

pycolumns-0.9.2.tar.gz (25.7 kB view hashes)

Uploaded Apr 7, 2023 Source

Hashes for pycolumns-0.9.2.tar.gz

Hashes for pycolumns-0.9.2.tar.gz
Algorithm	Hash digest
SHA256	`47f79506f20d564b7d1b071342e7e3f8caa447a8eef709298ca6bfd47d2b89a1`
MD5	`dd367d2c58621d6d6f3964aaacaf7450`
BLAKE2b-256	`66fbb227792f1bb0b2a19c87b28e29abcaa3e5f9f5bd2221a32a911401f2c3f4`