Skip to main content

A small library for taking the transpose of arbitrarily large .csvs

Project description

transposecsv: A small Python library to transpose large csv files that can't fit in memory.

Suppose you have an p x m matrix where your original data is m points samples with p features, or in m points in p dimensional space. Then we want the column space to be the features, that is, we'd like to consider the m x p data matrix. This small library is for performing this calculation on arbitrarily large csv files.

It works in the following way:

  1. Read in chunks that fit in memory
  2. Transpose those in memory (which is fast)
  3. Write each transposed chunk to a .csv file
  4. Use paste to join the files horizontally (columnwise), this is why we don't need to save the index, since it will be the same as the columns of the original file.

This process outputs the m x p matrix, as desired. This is particularly useful for single-cell data, where expression matrices are often uploaded genewise, but you may want to work with machine learning models that learn cellwise :).

Installation

To install, run pip install transposecsv

How to use

The transpose operation is contained in a lazily-loaded Transpose class, so the transpose operation isn't started on initialization. For example:

from transposecsv import Transpose 

transpose = Transpose(
    file_name='massive_dataset.csv',
    write_path='massive_dataset_T.csv',
    chunksize=400, # Number of rows to read in at each iteration
    # leave as default
    # insep=',', 
    # outsep=',',
    # chunksize=400, 
    # save_chunks=False,
    # quiet=False,
)

transpose.compute()

Then to upload to S3, we would run

tranpose.upload(
    bucket='braingeneersdev',
    endpoint_url='https://s3.nautilus.optiputer.net',
    aws_secret_key_id=secret,
    aws_secret_access_key=access,
    remote_name='jlehrer/massive_dataset_T.csv'
)

Project details


Download files

Download the file for your platform. If you're not sure which to choose, learn more about installing packages.

Source Distribution

transposecsv-0.0.5.tar.gz (15.0 kB view hashes)

Uploaded Source

Built Distribution

transposecsv-0.0.5-py2.py3-none-any.whl (7.1 kB view hashes)

Uploaded Python 2 Python 3

Supported by

AWS AWS Cloud computing and Security Sponsor Datadog Datadog Monitoring Fastly Fastly CDN Google Google Download Analytics Microsoft Microsoft PSF Sponsor Pingdom Pingdom Monitoring Sentry Sentry Error logging StatusPage StatusPage Status page