Skip to main content

CLI for fast, flexbile concatenation of tabular data using Polars.

Project description

PyPi CI GitHub stars DOI

joinem provides a CLI for fast, flexbile concatenation of tabular data using polars

Install

python3 -m pip install joinem

Features

  • Lazily streams I/O to expeditiously handle numerous large files.
  • Supports CSV and parquet input files.
    • Due to current polars limitations, JSON and feather files are not supported.
    • Input formats may be mixed.
  • Supports output to CSV, JSON, parquet, and feather file types.
  • Allows mismatched columns and/or empty data files with --how diagonal and --how diagonal_relaxed.
  • Provides a progress bar with --progress.
  • Add programatically-generated columns to output.

Example Usage

Pass input filenames via stdin, one filename per line.

find path/to/*.parquet path/to/*.csv | python3 -m joinem out.parquet

Output file type is inferred from the extension of the output file name. Supported output types are feather, JSON, parquet, and csv.

find -name '*.parquet' | python3 -m joinem out.json

Use --progress to show a progress bar.

ls -1 path/{*.csv,*.pqt} | python3 -m joinem out.csv --progress

If file columns may mismatch, use --how diagonal.

find path/to/ -name '*.csv' | python3 -m joinem out.csv --how diagonal

If some files may be empty, use --how diagonal_relaxed.

To run via Singularity/Apptainer,

ls -1 *.csv | singularity run docker://ghcr.io/mmore500/joinem out.feather

Add literal value column to output.

ls -1 *.csv | python3 -m joinem out.csv --with-column 'pl.lit(2).alias("two")'

Alias an existing column in the output.

ls -1 *.csv | python3 -m joinem out.csv --with-column 'pl.col("a").alias("a2")'

Apply regex on source datafile paths to create new column in output.

ls -1 path/to/*.csv | python3 -m joinem out.csv \
  --with-column 'pl.lit(filepath).str.replace(r".*?([^/]*)\.csv", r"${1}").alias("filename stem")'

Read data from stdin and write data to stdout.

cat foo.csv | python3 -m joinem "/dev/stdout" --stdin --output-filetype csv --input-filetype csv

Advanced usage. Write to parquet via stdout using pv to display progress, cast "myValue" column to categorical, and use lz4 for parquet compression.

ls -1 input/*.pqt | python3 -m joinem "/dev/stdout" --output-filetype pqt --with-column 'pl.col("myValue").cast(pl.Categorical)' --write-kwarg 'compression="lz4"' | pv > concat.pqt

API

usage: __main__.py [-h] [--version] [--progress] [--stdin] [--eager-read]
                   [--eager-write] [--with-column WITH_COLUMNS]
                   [--string-cache]
                   [--how {vertical,horizontal,diagonal,diagonal_relaxed}]
                   [--input-filetype INPUT_FILETYPE]
                   [--output-filetype OUTPUT_FILETYPE]
                   [--read-kwarg READ_KWARGS] [--write-kwarg WRITE_KWARGS]
                   output_file

CLI for fast, flexbile concatenation of tabular data using Polars.

positional arguments:
  output_file           Output file name

options:
  -h, --help            show this help message and exit
  --version             show program's version number and exit
  --progress            Show progress bar
  --stdin               Read data from stdin
  --eager-read          Use read_* instead of scan_*. Can improve performance
                        in some cases.
  --eager-write         Use write_* instead of sink_*. Can improve performance
                        in some cases.
  --with-column WITH_COLUMNS
                        Expression to be evaluated to add a column, as access
                        to each datafile's filepath as `filepath` and polars
                        as `pl`. Example:
                        'pl.lit(filepath).str.replace(r".*?([^/]*)\.csv",
                        r"${1}").alias("filename stem")'
  --string-cache        Enable Polars global string cache.
  --how {vertical,horizontal,diagonal,diagonal_relaxed}
                        How to concatenate frames. See
                        <https://docs.pola.rs/py-
                        polars/html/reference/api/polars.concat.html> for more
                        information.
  --input-filetype INPUT_FILETYPE
                        Filetype of input. Otherwise, inferred. Example: csv,
                        parquet, json, feather
  --output-filetype OUTPUT_FILETYPE
                        Filetype of output. Otherwise, inferred. Example: csv,
                        parquet
  --read-kwarg READ_KWARGS
                        Additional keyword arguments to pass to pl.read_* or
                        pl.scan_* call(s). Provide as 'key=value'. Specify
                        multiple kwargs by using this flag multiple times.
                        Arguments will be evaluated as Python expressions.
                        Example: 'infer_schema_length=None'
  --write-kwarg WRITE_KWARGS
                        Additional keyword arguments to pass to pl.write_* or
                        pl.sink_* call. Provide as 'key=value'. Specify
                        multiple kwargs by using this flag multiple times.
                        Arguments will be evaluated as Python expressions.
                        Example: 'compression="lz4"'

Provide input filepaths via stdin. Example: find path/to/ -name '*.csv' |
python3 -m joinem out.csv

Citing

If joinem contributes to a scholarly work, please cite it as

Matthew Andres Moreno. (2024). mmore500/joinem. Zenodo. https://doi.org/10.5281/zenodo.10701182

@software{moreno2024joinem,
  author = {Matthew Andres Moreno},
  title = {mmore500/joinem},
  month = feb,
  year = 2024,
  publisher = {Zenodo},
  doi = {10.5281/zenodo.10701182},
  url = {https://doi.org/10.5281/zenodo.10701182}
}

And don't forget to leave a star on GitHub!

Project details


Download files

Download the file for your platform. If you're not sure which to choose, learn more about installing packages.

Source Distribution

joinem-0.6.0.tar.gz (7.2 kB view details)

Uploaded Source

Built Distribution

joinem-0.6.0-py2.py3-none-any.whl (7.2 kB view details)

Uploaded Python 2 Python 3

File details

Details for the file joinem-0.6.0.tar.gz.

File metadata

  • Download URL: joinem-0.6.0.tar.gz
  • Upload date:
  • Size: 7.2 kB
  • Tags: Source
  • Uploaded using Trusted Publishing? No
  • Uploaded via: twine/5.1.1 CPython/3.12.6

File hashes

Hashes for joinem-0.6.0.tar.gz
Algorithm Hash digest
SHA256 c9be11b2a3adb2b5a3e8c90dd785b056d3096dc566d0b0344b37230cd0180319
MD5 6d650a3c1b4f72efa47f546e6bbb0094
BLAKE2b-256 c281c2cd2513f1f037e04d5e4cd60c14187addab26e6c2d70ac50b731e04307a

See more details on using hashes here.

File details

Details for the file joinem-0.6.0-py2.py3-none-any.whl.

File metadata

  • Download URL: joinem-0.6.0-py2.py3-none-any.whl
  • Upload date:
  • Size: 7.2 kB
  • Tags: Python 2, Python 3
  • Uploaded using Trusted Publishing? No
  • Uploaded via: twine/5.1.1 CPython/3.12.6

File hashes

Hashes for joinem-0.6.0-py2.py3-none-any.whl
Algorithm Hash digest
SHA256 d990c63ff53bcb257fa04b8024dc91033f39d194271bfaf996dc1d2eabbf6713
MD5 531d6e253e2f07be2e54f156a1eb00ec
BLAKE2b-256 f560ba16e8a0569fc05c4c3e5ad01839a829c7c977a13449ac5e43f104b431b2

See more details on using hashes here.

Supported by

AWS AWS Cloud computing and Security Sponsor Datadog Datadog Monitoring Fastly Fastly CDN Google Google Download Analytics Microsoft Microsoft PSF Sponsor Pingdom Pingdom Monitoring Sentry Sentry Error logging StatusPage StatusPage Status page