A small library for taking the transpose of arbitrarily large .csvs
Project description
bigcsv: A small Python library to manipulate large csv files that can't fit in memory.
Transposition
Suppose you have an p x m
matrix where your original data is m
points samples with p
features, or in m
points in p
dimensional space. Then we want the column space to be the features, that is, we'd like to consider the m x p
data matrix. This small library is for performing this calculation on arbitrarily large csv files.
It works in the following way:
- Read in chunks that fit in memory
- Transpose those in memory (which is fast)
- Write each transposed chunk to a
.csv
file - Use
paste
to join the files horizontally (columnwise), this is why we don't need to save the index, since it will be the same as the columns of the original file.
This process outputs the m x p
matrix, as desired. This is particularly useful for single-cell data, where expression matrices are often uploaded genewise, but you may want to work with machine learning models that learn cellwise :).
Converting to h5ad
If data is purely numeric, it is much more efficient to store in in h5ad
(readable by AnnData
), which uses the amazing HDF5 format under-the-hood.
Installation
To install, run pip install bigcsv
How to use
All operations are method of the BigCSV
class, which contains metadata information used to do all calculations.
from bigcsv import BigCSV
obj = BigCSV(
file='massive_dataset.csv',
chunksize=400, # Number of rows to read in at each iteration
# leave as default
# insep=',',
# outsep=',',
# chunksize=400,
# save_chunks=False,
# quiet=False,
)
obj.to_h5ad(outfile='converted.h5ad')
# Or maybe we want to keep as csv, but transpose it (in the case of non-numerical data)
obj.transpose(outfile='dataset_T.csv')
Then to upload to S3, we would run
obj.upload(
file='converted.h5ad',
bucket='braingeneersdev',
endpoint_url='https://s3.nautilus.optiputer.net',
aws_secret_key_id=secret,
aws_secret_access_key=access,
remote_file_name='jlehrer/massive_dataset_T.csv'
)
Documentation
transposecsv.Transpose
Parameters:
file
: Path to input file
outfile
: Path to output file (transposed input file)
sep=','
: Separator for .csv, by default is ,
chunksize=400
: Number of lines per iteration
chunkfolder=None
: Optional, Path to chunkfolder
quiet=False
: Boolean indicating whether to print progress or not
transposecsv.Transpose.upload
Parameters:
bucket
: Bucket name
endpoint_url
: S3 endpoint
aws_secret_key_id
: AWS secret key for your account
aws_secret_access_key
: Specifies the secret key associated with the access key
remote_file_key
: Optional, key to upload file to in S3. Must be complete path, including file name
remote_chunk_path
: Optional, key to upload chunks to in S3. Must be a folder-like path, where the chunks will be labeled as chunk_{outfile_name}_{l}.csv
Project details
Download files
Download the file for your platform. If you're not sure which to choose, learn more about installing packages.
Source Distribution
Built Distribution
Hashes for bigcsv-0.0.6-py2.py3-none-any.whl
Algorithm | Hash digest | |
---|---|---|
SHA256 | aa9e07dfc8d42711c95e2918b8d39fdd762ec8210c642c6e9f9b2c66c210ec4a |
|
MD5 | 9af996d922148d890cf0b9dcd93a5e2f |
|
BLAKE2b-256 | a5a02b9ed61877aed78154a61abd710cf753ccd5009758c712ba0e1e99531490 |