A simple, efficient, pythonic column data store
Project description
A simple, efficient column-oriented, pythonic data store.
The focus is currently on efficiency of reading and writing. The code is pure python but searching and reading data is fast due to the use of the fitsio package for column data and index data. Basic consistency is ensured for the columns in the table, but the database is not fully ACID.
The storage is a simple directory with files on disk.
Examples
>>> import pycolumns as pyc
# instantiate a column database from the specified coldir
>>> c = pyc.Columns('/some/path/mycols.cols')
# display some info about the columns
>>> c
Columns Directory:
mydata
dir: /some/path/mydata.cols
nrows: 64348146
Columns:
name dtype index
-----------------------------
ccd <i2 True
dec <f8 False
exposurename |S20 True
id <i8 True
imag <f4 False
ra <f8 False
x <f4 False
y <f4 False
g <f8 False
Dictionaries
name
-----------------------------
meta
Sub-Columns Directories:
name
-----------------------------
psfstars
# display info about column 'id'
>>> c['id']
Column:
name: id
filename: ./id.array
type: array
dtype: <i8
has index: False
nrows: 64348146
# number of rows in table
>>> c.nrows
# read all columns into a single rec array. By default the dict
# columns are not loaded
>>> data = c.read()
# using asdict=True puts the data into a dict. The dict data
# are also loaded in this case
>>> data = c.read(asdict=True)
# specify columns
>>> data = c.read(columns=['id', 'flux'])
# dict columns can be specified if asdict is True. Dicts can also
# be read as single columns, see below
>>> data = c.read(columns=['id', 'flux', 'meta'], asdict=True)
# specifying a set of rows as sequence/array or slice
>>> data = c.read(columns=['id', 'flux'], rows=[3, 225, 1235])
>>> data = c.read(columns=['id', 'flux'], rows=slice(10, 20))
# read all data from column 'id' as an array rather than recarray
# alternative syntaxes
>>> ind = c['id'][:]
>>> ind = c['id'].read()
>>> ind = c.read_column('id')
# read a subset of rows
# slicing
>>> ind = c['id'][25:125]
# specifying a set of rows
>>> rows = [3, 225, 1235]
>>> ind = c['id'][rows]
>>> ind = c.read_column('id', rows=rows)
# reading a dictionary column
>>> meta = c['meta'].read()
# Create indexes for fast searching
>>> c['id'].create_index()
# get indices for some conditions
>>> ind = c['id'] > 25
>>> ind = c['id'].between(25, 35)
>>> ind = c['id'] == 25
# read the corresponding data
>>> ccd = c['ccd'][ind]
>>> data = c.read(columns=['ra', 'dec'], rows=ind)
# composite searches over multiple columns
>>> ind = (c['id'] == 25) & (col['ra'] < 15.23)
>>> ind = c['id'].between(15, 25) | (c['id'] == 55)
>>> ind = c['id'].between(15, 250) & (c['id'] != 66) & (c['ra'] < 100)
# write columns from the fields in a rec array names in the data correspond
# to column names. If this is the first time writing data, the columns are
# created, and on subsequent writes, the columns must match
>>> c.append(recdata)
>>> c.append(new_data)
# append data from the fields in a FITS file
>>> c.from_fits(fitsfile_name)
# add a dictionary column.
>>> c.create_column('weather', 'dict')
>>> c['weather'].write({'temp': 30.1, 'humid': 0.5})
# overwrite dict column
>>> c['weather'].write({'temp': 33.2, 'humid': 0.3, 'windspeed': 60.5})
# you should not generally create array columns, since they
# can get out of sync with existing columns. This will by default
# raise an exception, but you can send verify=False if you know
# what you are doing. In the future special support will be added for
# adding new columns
>>> c.create_column('test', 'array')
# update values for an array column
>>> c['id'][35] = 10
>>> c['id'][35:35+3] = [8, 9, 10]
>>> c['id'][rows] = idvalues
# get all names, including dictionary and sub Columns
# same as list(c.keys())
>>> c.names
['ccd', 'dec', 'exposurename', 'id', 'imag', 'ra', 'x', 'y', 'g',
'meta', 'psfstars']
# only array column names
>>> c.column_names
['ccd', 'dec', 'exposurename', 'id', 'imag', 'ra', 'x', 'y', 'g']
# only dict columns
>>> c.dict_names
['meta', 'weather']
# only sub Columns directories
>>> c.subcols_names
['psfstars']
# reload all columns or specified column/column list
>>> c.reload()
# delete all data. This will ask for confirmation
>>> c.delete()
# delete column and its data
>>> c.delete_column('ra')
# to configure the amount of memory used during index creation, specify
# cache_mem in gigabytes. Default 1 gig
>>> cols = pyc.Columns(fname, cache_mem=0.5)
# columns can actually be another pycolumns directory
>>> psfcols = cols['psfstars']
>>> psfcols
dir: /some/path/mydata.cols/psfstars.cols
Columns:
name type dtype index shape
--------------------------------------------------
ccd array <i2 True (64348146,)
id array <i8 True (64348146,)
imag array <f4 False (64348146,)
x array <f4 False (64348146,)
y array <f4 False (64348146,)
...etc
Dependencies
numpy
Project details
Release history Release notifications | RSS feed
Download files
Download the file for your platform. If you're not sure which to choose, learn more about installing packages.
Source Distribution
pycolumns-1.0.0.tar.gz
(25.4 kB
view details)
File details
Details for the file pycolumns-1.0.0.tar.gz
.
File metadata
- Download URL: pycolumns-1.0.0.tar.gz
- Upload date:
- Size: 25.4 kB
- Tags: Source
- Uploaded using Trusted Publishing? No
- Uploaded via: twine/3.8.0 colorama/0.4.4 importlib-metadata/4.6.4 keyring/23.5.0 pkginfo/1.8.2 readme-renderer/34.0 requests-toolbelt/0.9.1 requests/2.25.1 rfc3986/1.5.0 tqdm/4.57.0 urllib3/1.26.5 CPython/3.10.6
File hashes
Algorithm | Hash digest | |
---|---|---|
SHA256 | 4c715260a118bd8b24052158632c2a2e89e213210307c0d80a9220587b15d9a7 |
|
MD5 | 55e5c7ef095ac7a03bcfa34318287c97 |
|
BLAKE2b-256 | d81aa05614a0128b5018057d2badf505aec058c17db3815a8ee71cc7812e638a |