Skip to main content

CLI for fast, flexbile concatenation of tabular data using polars.

Project description

PyPi CI GitHub stars DOI

joinem provides a CLI for fast, flexbile concatenation of tabular data using polars

Install

python3 -m pip install joinem

Features

  • Lazily streams I/O to expeditiously handle numerous large files.
  • Supports CSV and parquet input files.
    • Due to current polars limitations, JSON and feather files are not supported.
    • Input formats may be mixed.
  • Supports output to CSV, JSON, parquet, and feather file types.
  • Allows mismatched columns and/or empty data files with --how diagonal and --how diagonal_relaxed.
  • Provides a progress bar with --progress.
  • Add programatically-generated columns to output.

Example Usage

Pass input filenames via stdin, one filename per line.

find path/to/*.parquet path/to/*.csv | python3 -m joinem out.parquet

Output file type is inferred from the extension of the output file name. Supported output types are feather, JSON, parquet, and csv.

find -name '*.parquet' | python3 -m joinem out.json

Use --progress to show a progress bar.

ls -1 path/{*.csv,*.pqt} | python3 -m joinem out.csv --progress

If file columns may mismatch, use --how diagonal.

find path/to/ -name '*.csv' | python3 -m joinem out.csv --how diagonal

If some files may be empty, use --how diagonal_relaxed.

To run via Singularity/Apptainer,

ls -1 *.csv | singularity run docker://ghcr.io/mmore500/joinem out.feather

Add literal value column to output.

ls -1 *.csv | python3 -m joinem out.csv --with-column 'pl.lit(2).alias("two")'

Alias an existing column in the output.

ls -1 *.csv | python3 -m joinem out.csv --with-column 'pl.col("a").alias("a2")'

Apply regex on source datafile paths to create new column in output.

ls -1 path/to/*.csv | python3 -m joinem out.csv \
  --with-column 'pl.lit(filepath).str.replace(r".*?([^/]*)\.csv", r"${1}").alias("filename stem")'

Read data from stdin and write data to stdout.

cat foo.csv | python3 -m joinem "/dev/stdout" --stdin --output-filetype csv --input-filetype csv

API

usage: __main__.py [-h] [--version] [--progress] [--stdin] [--with-column WITH_COLUMNS]
                   [--how {vertical,horizontal,diagonal,diagonal_relaxed}] [--input-filetype INPUT_FILETYPE]
                   [--output-filetype OUTPUT_FILETYPE] [--open-kwarg OPEN_KWARGS]
                   output_file

Concatenate CSV and/or parquet tabular data files.

positional arguments:
  output_file           Output file name

options:
  -h, --help            show this help message and exit
  --version             show program's version number and exit
  --progress            Show progress bar
  --stdin               Read data from stdin
  --with-column WITH_COLUMNS
                        Expression to be evaluated to add a column, as access to each datafile's filepath as
                        `filepath` and polars as `pl`. Example: 'pl.lit(filepath).str.replace(r".*?([^/]*)\.csv",
                        r"${1}").alias("filename stem")'
  --how {vertical,horizontal,diagonal,diagonal_relaxed}
                        How to concatenate frames. See <https://docs.pola.rs/py-
                        polars/html/reference/api/polars.concat.html> for more information.
  --input-filetype INPUT_FILETYPE
                        Filetype of input. Otherwise, inferred. Example: csv, parquet, json, feather
  --output-filetype OUTPUT_FILETYPE
                        Filetype of output. Otherwise, inferred. Example: csv, parquet
  --open-kwarg OPEN_KWARGS
                        Additional keyword arguments to pass to the file opening call. Provide as 'key=value'.
                        Specify multiple kwargs by using this flag multiple times. Arguments will be evaluated as
                        Python expressions. Example: 'infer_schema_length=None'
  --sink-kwarg SINK_KWARGS
                        Additional keyword arguments to pass to the file sink call. Provide as 'key=value'. Specify multiple kwargs by using this flag multiple times. Arguments will be
                        evaluated as Python expressions. Example: 'compression="lz4"'

Provide input filepaths via stdin. Example: find path/to/ -name '*.csv' | python3 -m joinem out.csv

Citing

If joinem contributes to a scholarly work, please cite it as

Matthew Andres Moreno. (2024). mmore500/joinem. Zenodo. https://doi.org/10.5281/zenodo.10701182

@software{moreno2024joinem,
  author = {Matthew Andres Moreno},
  title = {mmore500/joinem},
  month = feb,
  year = 2024,
  publisher = {Zenodo},
  doi = {10.5281/zenodo.10701182},
  url = {https://doi.org/10.5281/zenodo.10701182}
}

And don't forget to leave a star on GitHub!

Project details


Download files

Download the file for your platform. If you're not sure which to choose, learn more about installing packages.

Source Distribution

joinem-0.5.0.tar.gz (6.4 kB view details)

Uploaded Source

Built Distribution

joinem-0.5.0-py2.py3-none-any.whl (6.8 kB view details)

Uploaded Python 2 Python 3

File details

Details for the file joinem-0.5.0.tar.gz.

File metadata

  • Download URL: joinem-0.5.0.tar.gz
  • Upload date:
  • Size: 6.4 kB
  • Tags: Source
  • Uploaded using Trusted Publishing? No
  • Uploaded via: twine/5.1.1 CPython/3.12.6

File hashes

Hashes for joinem-0.5.0.tar.gz
Algorithm Hash digest
SHA256 c2e2e876f96cbce78d507ae5bac251b11577ebd726388eec3d6003de85108aa4
MD5 5994dfc3c87bb5b0d86830cc78926308
BLAKE2b-256 89af76efb38be6626ad7710471abf35d5dd080a3c56dc5f2057f7bfe87c676df

See more details on using hashes here.

File details

Details for the file joinem-0.5.0-py2.py3-none-any.whl.

File metadata

  • Download URL: joinem-0.5.0-py2.py3-none-any.whl
  • Upload date:
  • Size: 6.8 kB
  • Tags: Python 2, Python 3
  • Uploaded using Trusted Publishing? No
  • Uploaded via: twine/5.1.1 CPython/3.12.6

File hashes

Hashes for joinem-0.5.0-py2.py3-none-any.whl
Algorithm Hash digest
SHA256 a49eabe352e9eb6a1622e9fa89465b3dd6e37226e69b89e1e96d7fc5af9d0ffc
MD5 df09ba9139a1d1805f7e36d10840f739
BLAKE2b-256 8b807ab3412035f6dbbff085b00fb56ad85c04b78b44bbd96fb2873ca073b037

See more details on using hashes here.

Supported by

AWS AWS Cloud computing and Security Sponsor Datadog Datadog Monitoring Fastly Fastly CDN Google Google Download Analytics Microsoft Microsoft PSF Sponsor Pingdom Pingdom Monitoring Sentry Sentry Error logging StatusPage StatusPage Status page