A small library for taking the transpose of arbitrarily large .csvs

These details have not been verified by PyPI

Project links

Homepage

GitHub Statistics

View statistics for this project via Libraries.io, or by using our public dataset on Google BigQuery

Development Status
- 3 - Alpha
Intended Audience
- Developers
License
- OSI Approved :: GNU General Public License v2 (GPLv2)
Natural Language
- English
Programming Language

Project description

bigcsv: A small Python library to manipulate large csv files that can't fit in memory.

Transposition

Suppose you have an p x m matrix where your original data is m points samples with p features, or in m points in p dimensional space. Then we want the column space to be the features, that is, we'd like to consider the m x p data matrix. This small library is for performing this calculation on arbitrarily large csv files.

It works in the following way:

Read in chunks that fit in memory
Transpose those in memory (which is fast)
Write each transposed chunk to a .csv file
Use paste to join the files horizontally (columnwise), this is why we don't need to save the index, since it will be the same as the columns of the original file.

This process outputs the m x p matrix, as desired. This is particularly useful for single-cell data, where expression matrices are often uploaded genewise, but you may want to work with machine learning models that learn cellwise :).

Converting to h5ad

If data is purely numeric, it is much more efficient to store in in h5ad (readable by AnnData), which uses the amazing HDF5 format under-the-hood.

Installation

To install, run pip install bigcsv

How to use

All operations are method of the BigCSV class, which contains metadata information used to do all calculations.

from bigcsv import BigCSV

obj = BigCSV(
    file='massive_dataset.csv',
    chunksize=400, # Number of rows to read in at each iteration
    # leave as default
    # insep=',', 
    # outsep=',',
    # chunksize=400, 
    # save_chunks=False,
    # quiet=False,
)

obj.to_h5ad(outfile='converted.h5ad')

# Or maybe we want to keep as csv, but transpose it (in the case of non-numerical data)
obj.transpose(outfile='dataset_T.csv')

Then to upload to S3, we would run

obj.upload(
    file='converted.h5ad',
    bucket='braingeneersdev',
    endpoint_url='https://s3.nautilus.optiputer.net',
    aws_secret_key_id=secret,
    aws_secret_access_key=access,
    remote_file_name='jlehrer/massive_dataset_T.csv'
)

Documentation

transposecsv.Transpose

Parameters:

file: Path to input file
outfile: Path to output file (transposed input file)
sep=',': Separator for .csv, by default is , chunksize=400: Number of lines per iteration
chunkfolder=None: Optional, Path to chunkfolder
quiet=False: Boolean indicating whether to print progress or not

transposecsv.Transpose.upload

Parameters:

bucket: Bucket name
endpoint_url: S3 endpoint
aws_secret_key_id: AWS secret key for your account
aws_secret_access_key: Specifies the secret key associated with the access key
remote_file_key: Optional, key to upload file to in S3. Must be complete path, including file name
remote_chunk_path: Optional, key to upload chunks to in S3. Must be a folder-like path, where the chunks will be labeled as chunk_{outfile_name}_{l}.csv

Project details

These details have not been verified by PyPI

Project links

Homepage

GitHub Statistics

View statistics for this project via Libraries.io, or by using our public dataset on Google BigQuery

Development Status
- 3 - Alpha
Intended Audience
- Developers
License
- OSI Approved :: GNU General Public License v2 (GPLv2)
Natural Language
- English
Programming Language

Release history Release notifications | RSS feed

1.0.2

May 11, 2022

1.0.0

May 3, 2022

0.0.8

May 2, 2022

0.0.7

Apr 28, 2022

This version

0.0.6

Apr 18, 2022

0.0.5

Apr 18, 2022

Download files

Download the file for your platform. If you're not sure which to choose, learn more about installing packages.

Source Distribution

bigcsv-0.0.6.tar.gz (12.6 kB view hashes)

Uploaded Apr 18, 2022 Source

Built Distribution

bigcsv-0.0.6-py2.py3-none-any.whl (10.5 kB view hashes)

Uploaded Apr 18, 2022 Python 2 Python 3

Hashes for bigcsv-0.0.6.tar.gz

Hashes for bigcsv-0.0.6.tar.gz
Algorithm	Hash digest
SHA256	`50c6648185546278788f2eae95828a1f58679c8e4f36bdb23427a5b5e4295919`
MD5	`79b3d0e97fe31a6ccd271a494cefcaab`
BLAKE2b-256	`f0db872301f0222d2084e81af8a36df26f426cbc7f6929057c6da9b56b330db3`

Hashes for bigcsv-0.0.6-py2.py3-none-any.whl

Hashes for bigcsv-0.0.6-py2.py3-none-any.whl
Algorithm	Hash digest
SHA256	`aa9e07dfc8d42711c95e2918b8d39fdd762ec8210c642c6e9f9b2c66c210ec4a`
MD5	`9af996d922148d890cf0b9dcd93a5e2f`
BLAKE2b-256	`a5a02b9ed61877aed78154a61abd710cf753ccd5009758c712ba0e1e99531490`