Skip to main content

A small library for taking the transpose of arbitrarily large .csvs

Project description

transposecsv: A small Python library to transpose large csv files that can't fit in memory.

Suppose you have an p x m matrix where your original data is m points samples with p features, or in m points in p dimensional space. Then we want the column space to be the features, that is, we'd like to consider the m x p data matrix. This small library is for performing this calculation on arbitrarily large csv files.

It works in the following way:

  1. Read in chunks that fit in memory
  2. Transpose those in memory (which is fast)
  3. Write each transposed chunk to a .csv file
  4. Use paste to join the files horizontally (columnwise), this is why we don't need to save the index, since it will be the same as the columns of the original file.

This process outputs the m x p matrix, as desired. This is particularly useful for single-cell data, where expression matrices are often uploaded genewise, but you may want to work with machine learning models that learn cellwise :).

Installation

To install, run pip install transposecsv

How to use

The transpose operation is contained in a lazily-loaded Transpose class, so the transpose operation isn't started on initialization. For example:

from transposecsv import Transpose 

transpose = Transpose(
    file_name='massive_dataset.csv',
    write_path='massive_dataset_T.csv',
    chunksize=400, # Number of rows to read in at each iteration
    # leave as default
    # insep=',', 
    # outsep=',',
    # chunksize=400, 
    # save_chunks=False,
    # quiet=False,
)

transpose.compute()

Then to upload to S3, we would run

tranpose.upload(
    bucket='braingeneersdev',
    endpoint_url='https://s3.nautilus.optiputer.net',
    aws_secret_key_id=secret,
    aws_secret_access_key=access,
    remote_name='jlehrer/massive_dataset_T.csv'
)

Project details


Download files

Download the file for your platform. If you're not sure which to choose, learn more about installing packages.

Source Distribution

transposecsv-0.0.5.tar.gz (15.0 kB view details)

Uploaded Source

Built Distribution

transposecsv-0.0.5-py2.py3-none-any.whl (7.1 kB view details)

Uploaded Python 2 Python 3

File details

Details for the file transposecsv-0.0.5.tar.gz.

File metadata

  • Download URL: transposecsv-0.0.5.tar.gz
  • Upload date:
  • Size: 15.0 kB
  • Tags: Source
  • Uploaded using Trusted Publishing? No
  • Uploaded via: twine/3.7.1 importlib_metadata/4.8.1 pkginfo/1.8.2 requests/2.26.0 requests-toolbelt/0.9.1 tqdm/4.62.3 CPython/3.9.7

File hashes

Hashes for transposecsv-0.0.5.tar.gz
Algorithm Hash digest
SHA256 274e0cb537d4eb7af51425eac03bf196f547e06caf0ead9815729a7675bdb947
MD5 23f66b3eb3ba28d09a03ceb2f409254d
BLAKE2b-256 bcc6a04ee1c0604909b6f8602c27a59d28bac7199f8f901a2d3a9fd83bf6f88f

See more details on using hashes here.

File details

Details for the file transposecsv-0.0.5-py2.py3-none-any.whl.

File metadata

  • Download URL: transposecsv-0.0.5-py2.py3-none-any.whl
  • Upload date:
  • Size: 7.1 kB
  • Tags: Python 2, Python 3
  • Uploaded using Trusted Publishing? No
  • Uploaded via: twine/3.7.1 importlib_metadata/4.8.1 pkginfo/1.8.2 requests/2.26.0 requests-toolbelt/0.9.1 tqdm/4.62.3 CPython/3.9.7

File hashes

Hashes for transposecsv-0.0.5-py2.py3-none-any.whl
Algorithm Hash digest
SHA256 6b730ffbe55490a5a95d2c72cdece9fa66a7c8cdaf82310494825c49b1d965c0
MD5 6199915dca6663ef36c8f03e68a337d2
BLAKE2b-256 dd10e8137a1cadbc9156a6cca35821ffee14367f40545e99befceed4db136205

See more details on using hashes here.

Supported by

AWS AWS Cloud computing and Security Sponsor Datadog Datadog Monitoring Fastly Fastly CDN Google Google Download Analytics Microsoft Microsoft PSF Sponsor Pingdom Pingdom Monitoring Sentry Sentry Error logging StatusPage StatusPage Status page