Harmonize messy NCBI BioSample metadata at scale

These details have not been verified by PyPI

Project description

BioMetaHarmonizer

A Python package for fetching, parsing, and standardizing NCBI BioSample metadata for large-scale genomic epidemiology.

What it does

NCBI BioSample metadata is free-text, crowd-sourced, and inconsistent across submitters. BioMetaHarmonizer fetches BioSample XML records via the Entrez API, maps raw attribute names to a fixed set of standard columns, normalizes placeholder null values, parses dates and geographic strings, and assigns One Health categories. The result is a pandas DataFrame that can be written to CSV, TSV, Excel, or Parquet.

Input can be BioSample accessions (SAMN, SAME, SAMD), assembly accessions (GCF_, GCA_), or a mix of both. Assembly accessions are resolved to BioSample IDs through locally cached NCBI assembly summary flat files.

Records submitted under NCBI pathogen packages (e.g. Pathogen.cl.1.0, Pathogen.env.1.0) often carry a structured <Antibiogram> section alongside standard attributes. BioMetaHarmonizer parses the antibiogram and serializes it as a compact JSON list in _extra_attributes["antibiogram"] so that MIC and phenotype data are never silently discarded.

Installation

git clone https://github.com/rustam-bioinfo/BioMetaHarmonizer.git
cd BioMetaHarmonizer
pip install -e .

Requires Python 3.9+. Dependencies are declared in pyproject.toml and installed automatically.

The package ships with pre-built schema files (unified.json, one_health_dictionaries.json, ncbi_attributes.xml). The rebuild scripts in scripts/ are only needed when you want to refresh those files from upstream sources — see Rebuilding schema files.

Quick start

Command line

biometaharmonizer run \
    --input  accessions.txt \
    --email  your@email.com \
    --output harmonized.csv

Flag	Default	Description
`--input FILE`	required	Path to accession list (one per line)
`--email EMAIL`	required	Valid contact email for NCBI Entrez — must contain `@` and a domain
`--output FILE`	required	Output file path
`--api-key KEY`	—	NCBI API key; raises rate limit from 3 to 10 requests/second
`--cache-dir DIR`	`~/.biometaharmonizer/cache/`	Directory for assembly summary flat files
`--format FORMAT`	inferred from file extension	`csv`, `tsv`, `excel`, `parquet`
`--summary FILE`	—	Write a per-column fill-rate CSV
`--fetch-batch-size N`	`200`	Number of records per efetch request
`--esearch-batch-size N`	`200`	Number of accessions per esearch term
`--refresh-cache`	off	Force re-download of assembly summary flat files regardless of age
`--verbose`	off	Enable DEBUG-level logging

Python API

from biometaharmonizer.ingestion import set_email, ingest
from biometaharmonizer import KeyMapper, DateEngine, GeoEngine, OneHealthClassifier
from biometaharmonizer import write, write_summary

# Ingest: accepts a file path, a Python list, or a mix of both accession types
set_email("your@email.com")
df = ingest("accessions.txt")
# or: df = ingest(["SAMN12345678", "GCF_000001405.39"])

# Force re-download of assembly summary flat files (bypasses 7-day TTL):
# df = ingest("accessions.txt", refresh_cache=True)

# Key harmonization — renames raw columns to standard keys, coalesces duplicates
# Needed only if you bring your own DataFrame; ingest() already applies the schema
mapper = KeyMapper()
df = mapper.map_columns(df)

# Date parsing: 40+ input formats -> ISO 8601 (YYYY / YYYY-MM / YYYY-MM-DD)
de = DateEngine()
date_df = de.parse_with_range(df["collection_date"])
df["collection_date"] = date_df["collection_date"]
df["collection_date_range"] = date_df["collection_date_range"]

# Geography: splits geo_loc_name into country, region, locality, ISO code, sea
ge = GeoEngine()
geo_df = ge.parse(df["geo_loc_name"])
for col in geo_df.columns:
    df[col] = geo_df[col]

# One Health classification across multiple source columns simultaneously
oh = OneHealthClassifier()
src = {col: df[col] for col in
       ["isolation_source", "env_broad_scale", "env_local_scale",
        "env_medium", "sample_type", "host"]
       if col in df.columns}
oh_df = oh.classify_multi_field(**src)
for col in oh_df.columns:
    df[col] = oh_df[col]

# Write output
write(df, "harmonized.csv")
write_summary(df, "fill_rates.csv")

Output columns

The output DataFrame contains the following 57 columns. Columns with no data for a given dataset are present and filled with NaN. Attributes that do not map to any column are preserved as a JSON string in _extra_attributes.

The first 52 columns come from ingestion. The final 5 are added by OneHealthClassifier.classify_multi_field() (column 28, one_health_category, is also from that step).

#	Column	Source	Description
1	`biosample_accession`	BioSample XML	NCBI BioSample accession (e.g. `SAMN07597573`)
2	`biosample_id`	BioSample XML	NCBI internal numeric BioSample ID
3	`sra_accession`	BioSample XML	Linked SRA accession, if present
4	`bioproject_accession`	BioSample XML / assembly index	Parent BioProject accession
5	`assembly_accession_refseq`	Assembly index	RefSeq assembly accession (GCF_)
6	`assembly_accession_genbank`	Assembly index	GenBank assembly accession (GCA_)
7	`sample_name_id`	BioSample XML	Submitter sample name from `<Id db_label="Sample name">`
8	`taxonomy_id`	BioSample XML	NCBI Taxonomy numeric ID
9	`taxonomy_name`	BioSample XML	Taxon name for the assigned taxonomy_id
10	`organism_name`	BioSample XML	Organism name from `<OrganismName>`; falls back to taxonomy_name
11	`collection_date`	BioSample attribute → DateEngine	Collection date normalized to ISO 8601
12	`collection_date_range`	DateEngine	Inferred date range when only year or year-month was provided
13	`geo_loc_name`	BioSample attribute	Raw geographic location string as submitted
14	`lat_lon`	BioSample attribute	Decimal lat/lon as submitted
15	`geo_country`	GeoEngine	Country resolved from `geo_loc_name`
16	`geo_region`	GeoEngine	Sub-national region; populated only from colon-format inputs (`"Country: Region, Locality"`); `NaN` for comma-only inputs
17	`geo_locality`	GeoEngine	Locality after the region in colon format, or the part after the first comma in comma-only inputs
18	`geo_iso3166`	GeoEngine	ISO 3166-1 alpha-2 country code; historical names tagged `HISTORICAL`
19	`geo_sea_ocean`	GeoEngine	Sea or ocean name for marine locations
20	`geo_loc_raw`	GeoEngine	Preserved raw string for coordinate-only inputs (e.g. `"40.71 N, 74.00 W"`); `NaN` for all other inputs
21	`host`	BioSample attribute	Host organism name
22	`host_disease`	BioSample attribute	Disease associated with host at sampling
23	`host_age`	BioSample attribute	Age of host
24	`host_sex`	BioSample attribute	Biological sex of host
25	`host_tissue_sampled`	BioSample attribute	Tissue or body site sampled
26	`isolation_source`	BioSample attribute	Material or environment from which the isolate was obtained
27	`sample_type`	BioSample attribute	Sample type or specimen classification
28	`one_health_category`	OneHealthClassifier	One of: Human, Animal, Aquatic, Wildlife, Plant, Food, Environmental, Lab, Unclassified
29	`one_health_term`	OneHealthClassifier	The specific term or phrase that triggered the classification
30	`one_health_confidence`	OneHealthClassifier	Float in [0, 1] — see One Health classification
31	`one_health_evidence_level`	OneHealthClassifier	Discretized confidence: `high` (≥0.85), `medium` (≥0.60), `low` (≥0.30), `unresolved`
32	`one_health_processing`	OneHealthClassifier	Processing/handling term detected in the field text (e.g. `pasteurized`, `frozen`), if any
33	`one_health_setting`	OneHealthClassifier	Setting term detected in the field text (e.g. `clinical`, `farm`, `retail`), if any
34	`one_health_source_field`	OneHealthClassifier	Which input field produced the winning classification
35	`isolate`	BioSample attribute	Isolate identifier
36	`strain`	BioSample attribute	Strain designation
37	`sub_strain`	BioSample attribute	Sub-strain designation
38	`serotype`	BioSample attribute	Serotype
39	`serovar`	BioSample attribute	Serovar
40	`genotype`	BioSample attribute	Genotype or sequence type
41	`culture_collection`	BioSample attribute	Culture collection identifier
42	`outbreak`	BioSample attribute	Outbreak identifier
43	`env_broad_scale`	BioSample attribute	Broad environmental context (ENVO)
44	`env_local_scale`	BioSample attribute	Local environmental feature (ENVO)
45	`env_medium`	BioSample attribute	Environmental medium (ENVO)
46	`sequencing_method`	BioSample attribute	Sequencing platform
47	`assembly_method`	BioSample attribute	Genome assembly software
48	`collected_by`	BioSample attribute; `<Owner/Name>` fallback	Collector name or institution
49	`ncbi_package`	BioSample XML	NCBI BioSample package (e.g. `Microbe.1.0`)
50	`submission_date`	BioSample XML	Date first submitted
51	`last_update`	BioSample XML	Date last modified
52	`publication_date`	BioSample XML	Date made publicly available
53	`access`	BioSample XML	`public` or `controlled-access`
54	`status`	BioSample XML	Record status (e.g. `live`, `suppressed`)
55	`status_date`	BioSample XML	Date current status was assigned
56	`title`	BioSample XML	Free-text title of the BioSample record
57	`description_comment`	BioSample XML	Free-text description or comment block
58	`_extra_attributes`	JSON	All attributes that could not be mapped to a schema column, serialized as a JSON dict. Also contains `submission_owner` and `submission_contact` when `<Owner>` provenance is present alongside an explicit collector. For records submitted under pathogen packages, contains an `antibiogram` key (see Antibiogram data).

Antibiogram data

BioSample records submitted under NCBI pathogen packages (Pathogen.cl.1.0, Pathogen.env.1.0, etc.) may include a structured <Antibiogram> section that is a sibling of <Attributes> in the XML — not a child. Standard attribute parsers that only iterate <Attributes> silently drop this section. BioMetaHarmonizer parses it explicitly.

When an antibiogram is present, _extra_attributes["antibiogram"] contains a compact JSON-encoded list of dicts, one per antibiotic row. Each dict includes whichever of the following fields NCBI populated for that row:

Field	Description
`antibiotic_name`	Antibiotic name (e.g. `amikacin`)
`resistance_phenotype`	`susceptible`, `resistant`, or `intermediate`
`measurement_sign`	`==`, `<=`, `>=`, `<`, `>`
`measurement`	Numeric MIC or disk diffusion value
`measurement_units`	`mg/L`, `mm`, etc.
`laboratory_typing_method`	`MIC`, `disk diffusion`, etc.
`laboratory_typing_platform`	Instrument or method platform
`vendor`	Reagent/kit vendor
`laboratory_typing_method_version_or_reagent`	Version or reagent identifier
`testing_standard`	`CLSI`, `EUCAST`, etc.

Fields with null or missing values are omitted from each row dict so the JSON payload stays compact. Rows where all fields resolved to null are excluded entirely.

Extracting antibiogram data from a result DataFrame:

import json
import pandas as pd

def extract_antibiogram(df):
    rows = []
    for _, rec in df.iterrows():
        extras = rec.get("_extra_attributes")
        if not extras:
            continue
        try:
            d = json.loads(extras)
        except (ValueError, TypeError):
            continue
        ab = d.get("antibiogram")
        if not ab:
            continue
        ab_rows = json.loads(ab) if isinstance(ab, str) else ab
        for row in ab_rows:
            row["biosample_accession"] = rec["biosample_accession"]
            rows.append(row)
    return pd.DataFrame(rows)

antibiogram_df = extract_antibiogram(df)

Attribute resolution order

For each <Attribute> element in BioSample XML, the column mapping is resolved in this order:

harmonized_name direct match — if the NCBI-assigned harmonized_name matches a schema column exactly, it is used without any synonym lookup.
Synonym lookup on harmonized_name — if not a direct match, the harmonized_name is looked up in the synonym table. If the resolved key is in the schema, it is used; otherwise the resolved key is stored in _extra_attributes.
Synonym lookup on attribute_name — if harmonized_name is absent or unresolvable, the raw attribute_name is tried.
_extra_attributes — any attribute that could not be resolved by any of the above is written to _extra_attributes as a JSON key-value pair.

The synonym table is built from two layers in synonyms.py and cached for the lifetime of the process:

Layer 1 — schemas/unified.json — manually curated synonym lists for all standard keys.
Layer 2 — schemas/ncbi_attributes.xml — the official NCBI BioSample harmonization table. Optional; loaded only if present.

Both ingestion.py and key_mapper.py use the same build_synonym_lookup() function.

Null normalization

During XML parsing, placeholder values are converted to None before any downstream processing. The full pattern list covers:

missing, missing: lab stock, missing: data agreement established pre-2023
N/A, na, null, none, nil, -, .
unknown, not provided, not collected, not applicable, not available, not determined, not recorded, not reported
unavailable, unspecified, undetermined, unidentified
restricted, restricted access, withheld, confidential
tbd, tba

Common misspellings (misssing, unkown, unknwon) are also matched. Matching is case-insensitive.

Assembly summary cache

On the first run, ingest() downloads two NCBI flat files to resolve assembly accessions and BioProject links:

assembly_summary_refseq.txt (~100–300 MB)
assembly_summary_genbank.txt (~100–300 MB)

These are cached in ~/.biometaharmonizer/cache/ (overridable with --cache-dir or set_cache_dir()). Files older than 7 days are automatically deleted and re-downloaded on the next run.

To force a refresh before the 7-day TTL expires — for example, immediately after a large batch of new assemblies is added to NCBI — pass refresh_cache=True to ingest() or use --refresh-cache on the CLI:

biometaharmonizer run --input ids.txt --email you@example.com \
    --output out.csv --refresh-cache

df = ingest("ids.txt", email="you@example.com", refresh_cache=True)

In Colab:

from biometaharmonizer.ingestion import set_cache_dir
set_cache_dir("/content/bmh_cache")

Entrez rate limits

Without an API key, NCBI allows 3 requests per second. With a key, the limit is 10 requests per second. BioMetaHarmonizer enforces inter-request sleep intervals automatically based on whether an API key is set.

biometaharmonizer run --input ids.txt --email you@example.com \
    --api-key YOUR_KEY --output out.csv

or:

df = ingest("ids.txt", email="you@example.com", api_key="YOUR_KEY")

Geospatial parsing

GeoEngine splits geo_loc_name into geo_country, geo_region, geo_locality, geo_iso3166, geo_sea_ocean, and geo_loc_raw.

The parser recognizes two input formats:

Colon format "Country: Region, Locality" — the part before : becomes geo_country, the first segment after : becomes geo_region, and any remainder after the comma becomes geo_locality.
Comma-only format "Country, Locality" — the part before the first , becomes geo_country and the remainder becomes geo_locality. geo_region is left NaN.

Parenthetical qualifiers (e.g. "United Kingdom (England, Wales & N. Ireland)", "Pacific Ocean (NE)") are stripped from the country token before any lookup. This means ocean and sea names with qualifiers are still correctly routed to geo_sea_ocean rather than falling through to the country resolver.

Input	Result
`"USA: California, Los Angeles"`	country=USA, region=California, locality=Los Angeles, iso=US
`"USA: California"`	country=USA, region=California, iso=US
`"Germany, Bavaria"`	country=Germany, locality=Bavaria, iso=DE
`"France"`	country=France, iso=FR
`"Pacific Ocean"`	sea_ocean=Pacific Ocean
`"Pacific Ocean (NE)"`	sea_ocean=Pacific Ocean
`"Pacific Ocean: Mariana Trench"`	sea_ocean=Pacific Ocean, locality=Mariana Trench
`"Red Sea (sampling site 3): surface"`	sea_ocean=Red Sea, locality=surface
`"40.71 N, 74.00 W"`	geo_loc_raw preserved; all other geo columns NaN
`"Gaza Strip"`	country=Gaza Strip, iso=PS
`"West Bank"`	country=West Bank, iso=PS
`"United Kingdom (England, Wales & N. Ireland)"`	country=United Kingdom, iso=GB
`"not applicable"`	all geo columns NaN

Handling notes:

England, Scotland, Wales, Northern Ireland → United Kingdom, iso GB
United Kingdom (England, Wales & N. Ireland) and similar compound UK variants → United Kingdom, iso GB
Gaza Strip, West Bank, Gaza, Palestine, Palestinian territories → iso PS
Korea (bare, no qualifier) → South Korea (KR); logged at INFO level
Historical country names (USSR, Yugoslavia, Zaire, East Germany, etc.) → preserved in geo_country, geo_iso3166 = HISTORICAL
Coordinate-only strings are preserved in geo_loc_raw and not reverse-geocoded; all other geo columns are NaN
Turkey / Türkiye, Namibia, Burma, DR Congo and several aliases are resolved via a hardcoded table before pycountry fuzzy lookup
All unique geo_loc_name values are resolved once and cached; pycountry fuzzy lookup runs at most once per unique country string regardless of row count

One Health classification

OneHealthClassifier loads all biological knowledge from schemas/one_health_dictionaries.json and assigns each record one of nine categories: Human, Animal, Aquatic, Wildlife, Plant, Food, Environmental, Lab, Unclassified.

classify_multi_field() accepts up to six named pd.Series and returns a DataFrame with seven columns:

Column	Type	Description
`one_health_category`	str	Assigned category; always a string, never NaN
`one_health_term`	str / NaN	The specific term or phrase that triggered the classification
`one_health_confidence`	float	Score in [0, 1]; computed as `term_specificity × field_weight + corroboration_bonus`
`one_health_evidence_level`	str	`high` (≥0.85), `medium` (≥0.60), `low` (≥0.30), `unresolved`
`one_health_processing`	str / NaN	Processing/handling term detected in the text (e.g. `pasteurized`, `frozen`)
`one_health_setting`	str / NaN	Setting term detected in the text (e.g. `clinical`, `farm`, `retail`)
`one_health_source_field`	str / NaN	Input field that produced the winning classification

Confidence model. For each field, confidence = min(1.0, term_specificity × field_weight + corroboration_bonus):

term_specificity: 1.0 for host dictionary or unambiguous list hits; 0.90/0.75/0.50 for tier1 phrases by length; WRatio / 100 for rapidfuzz fallback; 0.30 for ambiguous terms.
field_weight: isolation_source / host dict hit → 1.00; host text hit → 0.90; env_medium → 0.85; env_local_scale → 0.80; sample_type → 0.70; env_broad_scale → 0.50.
corroboration_bonus: +0.10 when a second independent field agrees with the same category.

Classification pipeline per record:

host field: institution guard (strips culture collection prefixes; returns Lab if residual < 4 chars), then host_to_category dictionary lookup, then text classification fallback.
isolation_source, env_medium, env_local_scale: matched against unambiguous human/animal term lists, then tier1 patterns, then rapidfuzz fuzzy fallback against the ontology map.
sample_type: domain-level signal; used to set category if no specimen field matched.
env_broad_scale: supporting signal only; contributes a corroboration bonus but does not set the primary category on its own.
Pass 2 resolves the winning category from accumulated domain/specimen/supporting evidence.

`collected_by` priority

Explicit BioSample attribute — any <Attribute harmonized_name="collected_by"> or synonym is always preferred.
<Owner/Name> fallback — used only if no explicit collector attribute was found.

When both are present, the submission-side provenance is written to _extra_attributes:

submission_owner — <Owner/Name> value
submission_contact — full name from <Owner/Contacts/Contact>

Output formats

from biometaharmonizer import write, write_summary

write(df, "out.csv")                        # CSV
write(df, "out.tsv", fmt="tsv")             # TSV
write(df, "out.xlsx", fmt="excel")          # Excel
write(df, "out.parquet", fmt="parquet")     # Parquet

write_summary(df, "fill_rates.csv")         # column, non_null_count, fill_pct

Format strings are case-insensitive. If --format is not specified on the CLI, the format is inferred from the output file extension.

Rebuilding schema files

The package ships with pre-built schema files. Rebuild them only when you want to incorporate upstream ontology or NCBI updates.

`one_health_dictionaries.json`

Generated by scripts/build_dictionaries.py. It queries OLS4 (ENVO, FoodOn, UBERON, Plant Ontology), downloads the NCBI Taxonomy dump (~65 MB), and optionally queries the UMLS API for synonym expansion. Hand-curated entries in the base file always win over ontology-derived ones.

# Full rebuild (downloads taxdmp.zip from NCBI automatically)
python scripts/build_dictionaries.py \
    --base   src/biometaharmonizer/schemas/one_health_dictionaries.json \
    --output src/biometaharmonizer/schemas/one_health_dictionaries.json

# Use a pre-downloaded taxdmp.zip
python scripts/build_dictionaries.py --taxdmp /path/to/taxdmp.zip

# Skip NCBI Taxonomy entirely
python scripts/build_dictionaries.py --skip-ncbi

# Add UMLS synonym expansion (requires a free UMLS API key)
python scripts/build_dictionaries.py --umls-key YOUR_UMLS_KEY

`ncbi_attributes.xml`

Generated by scripts/build_ncbi_attribute_cache.py. Downloads the official NCBI BioSample attribute harmonization table and stores it as schemas/ncbi_attributes.xml, which becomes Layer 2 of the synonym lookup.

python scripts/build_ncbi_attribute_cache.py

Repository structure

BioMetaHarmonizer/
├── src/biometaharmonizer/
│   ├── __init__.py             # public API, version 0.6.0
│   ├── cli.py                  # CLI entrypoint
│   ├── ingestion.py            # Entrez fetching, XML parsing, schema definition
│   ├── synonyms.py             # two-layer synonym lookup (unified.json + NCBI XML)
│   ├── key_mapper.py           # column rename, coalesce, reindex
│   ├── date_engine.py          # date parsing, ISO 8601 output
│   ├── geo_engine.py           # geo_loc_name splitting, ISO-3166 resolution
│   ├── one_health.py           # One Health categorization
│   ├── output.py               # write CSV / TSV / Excel / Parquet
│   └── schemas/
│       ├── unified.json                      # standard keys + synonym lists
│       ├── one_health_dictionaries.json      # One Health keyword/ontology dict
│       └── ncbi_attributes.xml               # NCBI harmonization table (optional)
├── scripts/
│   ├── build_dictionaries.py               # rebuild one_health_dictionaries.json
│   └── build_ncbi_attribute_cache.py       # rebuild ncbi_attributes.xml
├── tests/
│   ├── test_ingestion.py
│   ├── test_key_mapper.py
│   ├── test_date_engine.py
│   ├── test_geo_engine.py
│   ├── test_one_health.py
│   ├── test_output.py
│   └── test_pipeline.py
└── pyproject.toml

Running tests

pip install pytest
pytest tests/ -v --tb=short

All tests use synthetic data — no live NCBI calls are made.

License

MIT

Project details

These details have not been verified by PyPI

Release history Release notifications | RSS feed

1.0.1

Jun 9, 2026

1.0.0

May 12, 2026

This version

0.6.0

May 4, 2026

Download files

Download the file for your platform. If you're not sure which to choose, learn more about installing packages.

Source Distribution

biometaharmonizer-0.6.0.tar.gz (212.1 kB view details)

Uploaded May 4, 2026 Source

Built Distribution

If you're not sure about the file name format, learn more about wheel file names.

The dropdown lists show the available interpreters, ABIs, and platforms. Enable javascript to be able to filter the list of wheel files.

biometaharmonizer-0.6.0-py3-none-any.whl (207.6 kB view details)

Uploaded May 4, 2026 Python 3

File details

Details for the file biometaharmonizer-0.6.0.tar.gz.

File metadata

Download URL: biometaharmonizer-0.6.0.tar.gz
Upload date: May 4, 2026
Size: 212.1 kB
Tags: Source
Uploaded using Trusted Publishing? No
Uploaded via: twine/6.2.0 CPython/3.12.13

File hashes

Hashes for biometaharmonizer-0.6.0.tar.gz
Algorithm	Hash digest
SHA256	`a4e0110244f34e0c0b82202031515ec773256b2f99592390ae4fa177f015313c`
MD5	`44eb3dc496ba8efd1251d94d0e20f38c`
BLAKE2b-256	`3be1f8574efe265ef614ad9de37a7f0643b139807e072d2117dacddd0ec88715`

See more details on using hashes here.

File details

Details for the file biometaharmonizer-0.6.0-py3-none-any.whl.

File metadata

Download URL: biometaharmonizer-0.6.0-py3-none-any.whl
Upload date: May 4, 2026
Size: 207.6 kB
Tags: Python 3
Uploaded using Trusted Publishing? No
Uploaded via: twine/6.2.0 CPython/3.12.13

File hashes

Hashes for biometaharmonizer-0.6.0-py3-none-any.whl
Algorithm	Hash digest
SHA256	`d4ad2bef22d564b91dfbb1b70e394c94b69e24cce627560cf71de8d8c1e9f300`
MD5	`f429a4490a21b82992ab314549fa18fa`
BLAKE2b-256	`59f3fe6c4d5968b127fc3d0fa616ae2af24e699bf6891cc41a1955c12994ca84`

See more details on using hashes here.

biometaharmonizer 0.6.0

Navigation

Verified details

Maintainers

Unverified details

Meta

Classifiers

Project description

BioMetaHarmonizer

What it does

Installation

Quick start

Command line

Python API

Output columns

Antibiogram data

Attribute resolution order

Null normalization

Assembly summary cache

Entrez rate limits

Geospatial parsing

One Health classification

collected_by priority

Output formats

Rebuilding schema files

one_health_dictionaries.json

ncbi_attributes.xml

Repository structure

Running tests

License

Project details

Verified details

Maintainers

Unverified details

Meta

Classifiers

Release history Release notifications | RSS feed

Download files

Source Distribution

Built Distribution

File details

File metadata

File hashes

File details

File metadata

File hashes

`collected_by` priority

`one_health_dictionaries.json`

`ncbi_attributes.xml`