Ranchero: metadata wrangling for bioinformatics
Project description
Ranchero
Is your mycobacterial metadata a mess? Grab the M. bovis by the horns with Ranchero.
Ranchero is a Python solution to the dozens of different metadata formats used in genomic datasets. While it is specifically focused on NCBI's collection of Mycobacterium tuberculosis complex metadata, it still has utility for other organisms.
GitHub: github.com/aofarrel/ranchero
Features
- Pre-configured to standardize dozens of common NCBI metadata fields
- Input a TSV/JSON/CSV of new samples and their metadata into a dataframe
- Merge columns of similar data types into a single column, filling in nulls/empty values as you go
- Input a TSV of metadata to "inject" into an existing dataframe, optionally overriding metadata already present
- Flatten all of those "missing" and "Not Applicable" strings into proper null values
- Convert countries into three-letter country codes per ISO 3166
- Convert dates to YYYY-MM-DD format into an ISO 8601-like format
- Convert common host animal names to the standardized Genus species format when possible, as well a common name
- (tuberculosis only) Convert old-school strain names (Beijing, LAM, etc) to the modern lineage system (L2.2.1, L4.3, etc)
Dependencies
- Python 3.11-ish (3.9+ should be okay)
- pandas >= 2.0.0
- pyarrow, even if not working with Apache Arrow datasets
- polars for Python == 1.27.0
- tqdm
- xmltodict for working with Enterz Direct files
Supported inputs
| Platform | Expected format | Ranchero function |
|---|---|---|
| BigQuery | newline-delimited JSONL† | from_bigquery() |
| Enterz Direct (efetch) | XML‡ | from_efetch() |
| NCBI SRA web search | XML‡ | from_efetch() |
| Excel/LibreOffice | TSV (XLSX not supported) | from_tsv() |
| Google Sheets | TSV | from_tsv() |
| NCBI Run Selector | CSV | from_run_selector() |
| basically anything else | TSV | from_tsv() |
† BQ typically outputs JSONs in a format polars does not like; from_bigquery() will fix it on the fly.
‡ efetch typically outputs an invalid XML; from_efetch() will fix it on the fly. However, note that only -db sra -format native -mode xml and output from NCBI SRA web search is supported.
Project details
Release history Release notifications | RSS feed
Download files
Download the file for your platform. If you're not sure which to choose, learn more about installing packages.
Source Distribution
Built Distribution
Filter files by name, interpreter, ABI, and platform.
If you're not sure about the file name format, learn more about wheel file names.
Copy a direct link to the current filters
File details
Details for the file ranchero-0.1.0rc21.tar.gz.
File metadata
- Download URL: ranchero-0.1.0rc21.tar.gz
- Upload date:
- Size: 108.8 kB
- Tags: Source
- Uploaded using Trusted Publishing? No
- Uploaded via: twine/6.1.0 CPython/3.13.3
File hashes
| Algorithm | Hash digest | |
|---|---|---|
| SHA256 |
5f1f8802b404e27855eff5feb49760b9c047b547dac3128c285fa688939bc1ac
|
|
| MD5 |
568430e73f760bc963dfaa392653b571
|
|
| BLAKE2b-256 |
fae22559815c93fe9b5658372318fdcd0998442688befbf4d9ac9068a8bc7e3b
|
File details
Details for the file ranchero-0.1.0rc21-py3-none-any.whl.
File metadata
- Download URL: ranchero-0.1.0rc21-py3-none-any.whl
- Upload date:
- Size: 113.9 kB
- Tags: Python 3
- Uploaded using Trusted Publishing? No
- Uploaded via: twine/6.1.0 CPython/3.13.3
File hashes
| Algorithm | Hash digest | |
|---|---|---|
| SHA256 |
db449518f914d461950bb12fff9a511789fb2da1cc342a10b95d775d4f68627b
|
|
| MD5 |
07f5ab969a3b0be36c3a1496cc2b0528
|
|
| BLAKE2b-256 |
f6fc3793707b0d669a9ee43d8a1dc25f02f4e8c6e755e95ac5aa3d9f3577472b
|