Library-first extraction helpers for bioinformatics resource snapshots.

Project description

bioextract

Library-first extraction helpers for bioinformatics resource snapshots.

Install

pip install bioextract

STRINGdb

from bioextract.stringdb import StringDb, StringResourceLimits

selection = (
    StringDb.from_files(
        file_aliases="9606.protein.aliases.v12.0.txt.gz",
        file_links="9606.protein.links.v12.0.txt.gz",
        limits=StringResourceLimits(num_input_ids_max=50_000),
    )
    .select_ids(["P04637", "EGFR", "CDK2"])
    .with_score_min(400)
)

df_mapping = selection.extract_string_mapping()
df_unmapped = selection.extract_unmapped_input_ids()
df_edges = selection.extract_edges()

print(df_mapping)
print(df_unmapped)
print(df_edges)

from bioextract.stringdb import StringDb

df_group_edges = (
    StringDb.from_files(
        file_aliases="9606.protein.aliases.v12.0.txt.gz",
        file_links="9606.protein.links.v12.0.txt.gz",
    )
    .select_groups(
        {
            "TumorA": ["TP53", "EGFR"],
            "TumorB": ["CDK2", "TP53"],
        }
    )
    .with_score_min(400)
    .extract_edges()
)

OmniPath

from bioextract.omnipath import OmniPathDb

selection = (
    OmniPathDb.from_files(
        file_enzsub="enzsub.tsv.gz",
        file_interactions="interactions.tsv.gz",
    )
    .select_ids(["P31749", "AKT1", "BAD"])
    .with_enzsub()
)

df_enzsub = selection.extract_enzsub()
df_unmapped = selection.extract_unmapped_input_ids()

print(df_enzsub)
print(df_unmapped)

from bioextract.omnipath import OmniPathDb

df_group_interactions = (
    OmniPathDb.from_files(file_interactions="interactions.tsv.gz")
    .select_groups(
        {
            "TumorA": ["AKT1", "MTOR"],
            "TumorB": ["EGFR", "ERBB2"],
        }
    )
    .with_interactions()
    .extract_interactions()
)

GO

from bioextract.go import GoDb

go = GoDb.from_obo("go-basic.obo")
tidy = go.build_tidy()

df_term = tidy.frames["term"]
df_edge = tidy.frames["edge"]
df_ancestor = tidy.frames["ancestor_all"]
df_subsets = go.list_subsets()
df_goslim_generic = go.select_terms(subset_id="goslim_generic")
df_subcell = go.extract_subcell()

report = tidy.write("out/go-basic")

GoDb.from_obo(...).write_tidy("out/go-basic") is also available as a convenience wrapper when only persisted parquet outputs are needed. Pass should_write_manifest=True to also write manifest.json. GoDb.from_obo(...).write_subcell("out/subcell.parquet") writes non-obsolete cellular component terms as a subcellular-location table. GO OBO subset memberships are available through select_terms(subset_id=...) and the subset_membership tidy frame.

KEGG

from bioextract.kegg import KeggDb

tidy = KeggDb.from_brite_json("br08901.json").build_tidy()

df_pathway = tidy.frames["pathway"]

report = tidy.write("out/br08901")

The GO and KEGG tidy writers emit flat parquet files by default. See docs/architecture/go-kegg-tidy.md.

Reactome

from bioextract.reactome import ReactomeDb

db = ReactomeDb.from_files(
    file_uniprot2reactome="UniProt2Reactome.txt",
    file_pathways="ReactomePathways.txt",
    file_relations="ReactomePathwaysRelation.txt",
)

selection = db.with_species("Homo sapiens").select_ids(["P04637", "Q9Y243"])

df_mapping = selection.extract_mapping()
df_unmapped = selection.extract_unmapped_input_ids()
df_term2gene = db.with_species("Homo sapiens").extract_term2gene()
df_term2name = db.with_species("Homo sapiens").extract_term2name()

ReactomeDb reads local Reactome mapping files and emits annotation tables plus standard enrichment inputs. The three raw files are composable: mapping-only snapshots can still emit mapping and term2gene, pathways-only snapshots can emit pathway and term2name, and relation extraction uses the relation file. It does not call Reactome web services or calculate enrichment p-values.

WikiPathways

from bioextract.wikipathways import WikiPathwaysDb

db = WikiPathwaysDb.from_gmt(
    "wikipathways-20260510-gmt-Homo_sapiens.gmt",
    species="Homo sapiens",
)

df_pathway = db.extract_pathway()
df_term2gene = db.extract_term2gene()
df_term2name = db.extract_term2name()

selection = db.select_ids(["2687", "435", "MISSING"])
df_mapping = selection.extract_mapping()
df_unmapped = selection.extract_unmapped_input_ids()

report = db.write_tidy("out/wikipathways-hsa", should_write_manifest=True)

WikiPathwaysDb reads local WikiPathways GMT files. GMT gene content is treated as NCBI Entrez Gene IDs; the library does not perform identifier conversion or calculate enrichment p-values.

eggNOG

from bioextract.eggnog import EggnogDb

db = EggnogDb.from_files(
    file_eggnog_db="eggnog.db.gz",
    file_cog_fun="cog-24.fun.tab",
)

df_mapping = db.extract_mapping()
report = db.write_tidy("out/eggnog", should_write_manifest=True)

EggnogDb reads local eggNOG mapper SQLite snapshots and optional COG function lookup tables. It emits a wide protein-to-COG mapping table for annotation and enrichment-input preparation.

InterPro

from bioextract.interpro import InterProDb

db = InterProDb.from_mapping_files(
    file_protein2ipr="protein2ipr.dat.gz",
    file_interpro_xml="interpro.xml.gz",
)

df_mapping = db.extract_mapping()
report = db.write_tidy("out/interpro", should_write_manifest=True)

InterProDb reads local protein2ipr mapping files and optional InterPro XML metadata, then emits one canonical UniProt-to-InterPro mapping parquet.

UniProt

from bioextract.uniprot import UniprotDb

db = UniprotDb.from_files(
    file_idmapping_selected="idmapping_selected.tab.gz",
)

df_hsa = db.with_taxids("9606").extract_mapping()

report = db.with_taxids("9606", "10090").write_tidy(
    "out/uniprot-idmapping",
    should_write_manifest=True,
)

UniprotDb reads raw UniProt idmapping_selected.tab(.gz), single parquet files, or hive parquet dataset directories. Tidy writing emits a canonical mapping.parquet; all-taxid export requires should_allow_all=True. Use policy_existing="overwrite" or policy_existing="skip" when the output directory already exists.

For UniProt knowledge-base flat files, write_eggnog_xref_tidy() can emit a canonical UniProt-to-eggNOG xref parquet from uniprot_sprot.dat(.gz).

Development

PYTHONPATH=src pytest
PYTHONPATH=src python scripts/benchmark_stringdb.py

Release

GitHub Actions now provides:
- .github/workflows/py-ci.yml for test-and-build checks on push and pull request
- .github/workflows/publish.yml for tag-triggered PyPI publishing
Release tags must be canonical PEP 440 versions such as 0.1.1
The publish workflow expects PyPI trusted publishing to be configured for the pypi environment

Project details

Release history Release notifications | RSS feed

This version

0.0.7

Jun 15, 2026

0.0.6

Jun 12, 2026

0.0.5

May 27, 2026

0.0.4

May 14, 2026

0.0.2

Apr 14, 2026

0.0.1

Apr 14, 2026

Download files

Download the file for your platform. If you're not sure which to choose, learn more about installing packages.

Source Distribution

bioextract-0.0.7.tar.gz (73.8 kB view details)

Uploaded Jun 15, 2026 Source

Built Distribution

If you're not sure about the file name format, learn more about wheel file names.

The dropdown lists show the available interpreters, ABIs, and platforms. Enable javascript to be able to filter the list of wheel files.

bioextract-0.0.7-py3-none-any.whl (77.5 kB view details)

Uploaded Jun 15, 2026 Python 3

File details

Details for the file bioextract-0.0.7.tar.gz.

File metadata

Download URL: bioextract-0.0.7.tar.gz
Upload date: Jun 15, 2026
Size: 73.8 kB
Tags: Source
Uploaded using Trusted Publishing? Yes
Uploaded via: twine/6.1.0 CPython/3.13.12

File hashes

Hashes for bioextract-0.0.7.tar.gz
Algorithm	Hash digest
SHA256	`be03c4d284c5ad16a237fa8eb0c50cb5f08bd813010d53e74fa67e1c0100f88c`
MD5	`d7e8b65ef6ae34925f04718bb0acaf6e`
BLAKE2b-256	`90839f17601b1c7103773177d5c721517cb18fbdc9d145141564880b27bcb51d`

See more details on using hashes here.

Provenance

The following attestation bundles were made for bioextract-0.0.7.tar.gz:

Publisher: publish.yml on FuqingZh/bioextract

Attestations: Values shown here reflect the state when the release was signed and may no longer be current.

Statement:
- Statement type: https://in-toto.io/Statement/v1
- Predicate type: https://docs.pypi.org/attestations/publish/v1
- Subject name: bioextract-0.0.7.tar.gz
- Subject digest: be03c4d284c5ad16a237fa8eb0c50cb5f08bd813010d53e74fa67e1c0100f88c
- Sigstore transparency entry: 1823263009
- Sigstore integration time: Jun 15, 2026
Source repository:
- Permalink: FuqingZh/bioextract@0679341e7809a608b5303bdcd2a64761d07cd325
- Branch / Tag: refs/tags/0.0.7
- Owner: https://github.com/FuqingZh
- Access: public
Publication detail:
- Token Issuer: https://token.actions.githubusercontent.com
- Runner Environment: github-hosted
- Publication workflow: publish.yml@0679341e7809a608b5303bdcd2a64761d07cd325
- Trigger Event: push

File details

Details for the file bioextract-0.0.7-py3-none-any.whl.

File metadata

Download URL: bioextract-0.0.7-py3-none-any.whl
Upload date: Jun 15, 2026
Size: 77.5 kB
Tags: Python 3
Uploaded using Trusted Publishing? Yes
Uploaded via: twine/6.1.0 CPython/3.13.12

File hashes

Hashes for bioextract-0.0.7-py3-none-any.whl
Algorithm	Hash digest
SHA256	`b015ae4dd7d2313cb487ecd17fc34789834c7e389634b74884e84e3f9682072b`
MD5	`f73232127eb1d21f847ce6a6c74fc2f0`
BLAKE2b-256	`3ffb84956efec6fc6f41ecb1e4007e694c25659433034b9148341a2759a42c04`

See more details on using hashes here.

Provenance

The following attestation bundles were made for bioextract-0.0.7-py3-none-any.whl:

Publisher: publish.yml on FuqingZh/bioextract

Attestations: Values shown here reflect the state when the release was signed and may no longer be current.

Statement:
- Statement type: https://in-toto.io/Statement/v1
- Predicate type: https://docs.pypi.org/attestations/publish/v1
- Subject name: bioextract-0.0.7-py3-none-any.whl
- Subject digest: b015ae4dd7d2313cb487ecd17fc34789834c7e389634b74884e84e3f9682072b
- Sigstore transparency entry: 1823263080
- Sigstore integration time: Jun 15, 2026
Source repository:
- Permalink: FuqingZh/bioextract@0679341e7809a608b5303bdcd2a64761d07cd325
- Branch / Tag: refs/tags/0.0.7
- Owner: https://github.com/FuqingZh
- Access: public
Publication detail:
- Token Issuer: https://token.actions.githubusercontent.com
- Runner Environment: github-hosted
- Publication workflow: publish.yml@0679341e7809a608b5303bdcd2a64761d07cd325
- Trigger Event: push

bioextract 0.0.7

Navigation

Verified details

Maintainers

Unverified details

Meta

Project description

bioextract

Install

STRINGdb

OmniPath

GO

KEGG

Reactome

WikiPathways

eggNOG

InterPro

UniProt

Development

Release

Project details

Verified details

Maintainers

Unverified details

Meta

Release history Release notifications | RSS feed

Download files

Source Distribution

Built Distribution

File details

File metadata

File hashes

Provenance

File details

File metadata

File hashes

Provenance