Skip to main content

Ranchero: metadata wrangling for bioinformatics

Project description

Ranchero

Is your mycobacterial metadata a mess? Grab the M. bovis by the horns with Ranchero.

Ranchero is a Python solution to the dozens of different metadata formats used in genomic datasets. While it is specifically focused on NCBI's collection of Mycobacterium tuberculosis complex metadata, it still has utility for other organisms. For information on what Ranchero considers "a sample" and the like, see ./docs/data_structure.md. For information on how to configure Ranchero, see .docs/configuration.md.

Features

  • Input a TSV/JSON/CSV of new samples and their metadata into a dataframe
  • Merge columns of similar data types into a single column, filling in nulls/empty values as you go
  • Input a TSV of metadata to "inject" into an existing dataframe, optionally overriding metadata already present
  • Flatten all of those "missing" and "Not Applicable" strings into proper null values
  • Convert countries into three-letter country codes per ISO 3166
  • Convert dates to YYYY-MM-DD format into an ISO 8601-like format -- missing months/days are denoted as NN.
  • Convert common host animal names to a standarized Genus species "common name" format
  • (tuberculosis only) Convert old-school strain names to the modern lineage system

Dependencies

  • Python 3.11-ish (3.9+ should be okay)
  • pandas >= 2.0.0
  • pyarrow, even if not working with Apache Arrow datasets
  • polars for Python == 1.27.0
  • tqdm
  • xmltodict for working with Enterz Direct files

Supported inputs

Platform Expected format Ranchero function
BigQuery newline-delimited JSONL from_bigquery()
Enterz Direct (efetch) XML from_efetch()
NCBI SRA web search XML from_efetch()
Excel/LibreOffice TSV (XLSX not supported) from_tsv()
Google Sheets TSV from_tsv()
NCBI Run Selector CSV from_run_selector()
basically anything else TSV from_tsv()

BQ typically outputs JSONs in a format polars does not like; from_bigquery() will fix it on the fly. efetch typically outputs an invalid XML; from_efetch() will fix it on the fly. However, note that only -db sra -format native -mode xml and output from NCBI SRA web search is supported.

Project details


Download files

Download the file for your platform. If you're not sure which to choose, learn more about installing packages.

Source Distribution

ranchero-0.1.0rc20.tar.gz (108.7 kB view details)

Uploaded Source

Built Distribution

If you're not sure about the file name format, learn more about wheel file names.

ranchero-0.1.0rc20-py3-none-any.whl (113.9 kB view details)

Uploaded Python 3

File details

Details for the file ranchero-0.1.0rc20.tar.gz.

File metadata

  • Download URL: ranchero-0.1.0rc20.tar.gz
  • Upload date:
  • Size: 108.7 kB
  • Tags: Source
  • Uploaded using Trusted Publishing? No
  • Uploaded via: twine/6.1.0 CPython/3.13.3

File hashes

Hashes for ranchero-0.1.0rc20.tar.gz
Algorithm Hash digest
SHA256 d9ab0d4f047942c9277f14c3ae7fb0c295aad1835ccbe6ef21d715d45d12bd22
MD5 e9fc4692e5ec698a8db96976b6ec7e48
BLAKE2b-256 beea604ec122815f240581b6060af94a88e8c1778d4d57318d8c7bd45394501c

See more details on using hashes here.

File details

Details for the file ranchero-0.1.0rc20-py3-none-any.whl.

File metadata

  • Download URL: ranchero-0.1.0rc20-py3-none-any.whl
  • Upload date:
  • Size: 113.9 kB
  • Tags: Python 3
  • Uploaded using Trusted Publishing? No
  • Uploaded via: twine/6.1.0 CPython/3.13.3

File hashes

Hashes for ranchero-0.1.0rc20-py3-none-any.whl
Algorithm Hash digest
SHA256 3e27026e8a53a622fba1981e8d50323a06aef2d4979b2c34e0962aad3ac65a65
MD5 38aed6a8642838dbdc878f68f08172b5
BLAKE2b-256 2e664874612fa237bea41c1088045c7389a9be6af4135cdcc5924d7de0e89ff5

See more details on using hashes here.

Supported by

AWS Cloud computing and Security Sponsor Datadog Monitoring Depot Continuous Integration Fastly CDN Google Download Analytics Pingdom Monitoring Sentry Error logging StatusPage Status page