Library-first extraction helpers for bioinformatics resource snapshots.
Project description
bioextract
Library-first extraction helpers for bioinformatics resource snapshots.
Install
pip install bioextract
STRINGdb
from bioextract.stringdb import StringDb, StringResourceLimits
selection = (
StringDb.from_files(
file_aliases="9606.protein.aliases.v12.0.txt.gz",
file_links="9606.protein.links.v12.0.txt.gz",
limits=StringResourceLimits(num_input_ids_max=50_000),
)
.select_ids(["P04637", "EGFR", "CDK2"])
.with_score_min(400)
)
df_mapping = selection.extract_string_mapping()
df_unmapped = selection.extract_unmapped_input_ids()
df_edges = selection.extract_edges()
print(df_mapping)
print(df_unmapped)
print(df_edges)
from bioextract.stringdb import StringDb
df_group_edges = (
StringDb.from_files(
file_aliases="9606.protein.aliases.v12.0.txt.gz",
file_links="9606.protein.links.v12.0.txt.gz",
)
.select_groups(
{
"TumorA": ["TP53", "EGFR"],
"TumorB": ["CDK2", "TP53"],
}
)
.with_score_min(400)
.extract_edges()
)
OmniPath
from bioextract.omnipath import OmniPathDb
selection = (
OmniPathDb.from_files(
file_enzsub="enzsub.tsv.gz",
file_interactions="interactions.tsv.gz",
)
.select_ids(["P31749", "AKT1", "BAD"])
.with_enzsub()
)
df_enzsub = selection.extract_enzsub()
df_unmapped = selection.extract_unmapped_input_ids()
print(df_enzsub)
print(df_unmapped)
from bioextract.omnipath import OmniPathDb
df_group_interactions = (
OmniPathDb.from_files(file_interactions="interactions.tsv.gz")
.select_groups(
{
"TumorA": ["AKT1", "MTOR"],
"TumorB": ["EGFR", "ERBB2"],
}
)
.with_interactions()
.extract_interactions()
)
GO
from bioextract.go import GoDb
go = GoDb.from_obo("go-basic.obo")
tidy = go.build_tidy()
df_term = tidy.frames["term"]
df_edge = tidy.frames["edge"]
df_ancestor = tidy.frames["ancestor_all"]
df_subsets = go.list_subsets()
df_goslim_generic = go.select_terms(subset_id="goslim_generic")
df_subcell = go.extract_subcell()
report = tidy.write("out/go-basic")
GoDb.from_obo(...).write_tidy("out/go-basic") is also available as a
convenience wrapper when only persisted parquet outputs are needed.
Pass should_write_manifest=True to also write manifest.json.
GoDb.from_obo(...).write_subcell("out/subcell.parquet") writes non-obsolete
cellular component terms as a subcellular-location table.
GO OBO subset memberships are available through select_terms(subset_id=...)
and the subset_membership tidy frame.
KEGG
from bioextract.kegg import KeggDb
tidy = KeggDb.from_brite_json("br08901.json").build_tidy()
df_pathway = tidy.frames["pathway"]
report = tidy.write("out/br08901")
The GO and KEGG tidy writers emit flat parquet files by default. See
docs/architecture/go-kegg-tidy.md.
Reactome
from bioextract.reactome import ReactomeDb
db = ReactomeDb.from_files(
file_uniprot2reactome="UniProt2Reactome.txt",
file_pathways="ReactomePathways.txt",
file_relations="ReactomePathwaysRelation.txt",
)
selection = db.with_species("Homo sapiens").select_ids(["P04637", "Q9Y243"])
df_mapping = selection.extract_mapping()
df_unmapped = selection.extract_unmapped_input_ids()
df_term2gene = db.with_species("Homo sapiens").extract_term2gene()
df_term2name = db.with_species("Homo sapiens").extract_term2name()
ReactomeDb reads local Reactome mapping files and emits annotation tables plus
standard enrichment inputs. The three raw files are composable: mapping-only
snapshots can still emit mapping and term2gene, pathways-only snapshots can
emit pathway and term2name, and relation extraction uses the relation file.
It does not call Reactome web services or calculate enrichment p-values.
WikiPathways
from bioextract.wikipathways import WikiPathwaysDb
db = WikiPathwaysDb.from_gmt(
"wikipathways-20260510-gmt-Homo_sapiens.gmt",
species="Homo sapiens",
)
df_pathway = db.extract_pathway()
df_term2gene = db.extract_term2gene()
df_term2name = db.extract_term2name()
selection = db.select_ids(["2687", "435", "MISSING"])
df_mapping = selection.extract_mapping()
df_unmapped = selection.extract_unmapped_input_ids()
report = db.write_tidy("out/wikipathways-hsa", should_write_manifest=True)
WikiPathwaysDb reads local WikiPathways GMT files. GMT gene content is treated
as NCBI Entrez Gene IDs; the library does not perform identifier conversion or
calculate enrichment p-values.
eggNOG
from bioextract.eggnog import EggnogDb
db = EggnogDb.from_files(
file_eggnog_db="eggnog.db.gz",
file_cog_fun="cog-24.fun.tab",
)
df_mapping = db.extract_mapping()
report = db.write_tidy("out/eggnog", should_write_manifest=True)
EggnogDb reads local eggNOG mapper SQLite snapshots and optional COG
function lookup tables. It emits a wide protein-to-COG mapping table for
annotation and enrichment-input preparation.
InterPro
from bioextract.interpro import InterProDb
db = InterProDb.from_mapping_files(
file_protein2ipr="protein2ipr.dat.gz",
file_interpro_xml="interpro.xml.gz",
)
df_mapping = db.extract_mapping()
report = db.write_tidy("out/interpro", should_write_manifest=True)
InterProDb reads local protein2ipr mapping files and optional InterPro XML
metadata, then emits one canonical UniProt-to-InterPro mapping parquet.
UniProt
from bioextract.uniprot import UniprotDb
db = UniprotDb.from_files(
file_idmapping_selected="idmapping_selected.tab.gz",
)
df_hsa = db.with_taxids("9606").extract_mapping()
report = db.with_taxids("9606", "10090").write_tidy(
"out/uniprot-idmapping",
should_write_manifest=True,
)
UniprotDb reads raw UniProt idmapping_selected.tab(.gz), single parquet
files, or hive parquet dataset directories. Tidy writing emits a canonical
mapping.parquet; all-taxid export requires should_allow_all=True.
Use policy_existing="overwrite" or policy_existing="skip" when the output
directory already exists.
For UniProt knowledge-base flat files, write_eggnog_xref_tidy() can emit a
canonical UniProt-to-eggNOG xref parquet from uniprot_sprot.dat(.gz).
Development
PYTHONPATH=src pytestPYTHONPATH=src python scripts/benchmark_stringdb.py
Release
- GitHub Actions now provides:
.github/workflows/py-ci.ymlfor test-and-build checks on push and pull request.github/workflows/publish.ymlfor tag-triggered PyPI publishing
- Release tags must be canonical PEP 440 versions such as
0.1.1 - The publish workflow expects PyPI trusted publishing to be configured for the
pypienvironment
Project details
Download files
Download the file for your platform. If you're not sure which to choose, learn more about installing packages.
Source Distribution
Built Distribution
Filter files by name, interpreter, ABI, and platform.
If you're not sure about the file name format, learn more about wheel file names.
Copy a direct link to the current filters
File details
Details for the file bioextract-0.0.7.tar.gz.
File metadata
- Download URL: bioextract-0.0.7.tar.gz
- Upload date:
- Size: 73.8 kB
- Tags: Source
- Uploaded using Trusted Publishing? Yes
- Uploaded via: twine/6.1.0 CPython/3.13.12
File hashes
| Algorithm | Hash digest | |
|---|---|---|
| SHA256 |
be03c4d284c5ad16a237fa8eb0c50cb5f08bd813010d53e74fa67e1c0100f88c
|
|
| MD5 |
d7e8b65ef6ae34925f04718bb0acaf6e
|
|
| BLAKE2b-256 |
90839f17601b1c7103773177d5c721517cb18fbdc9d145141564880b27bcb51d
|
Provenance
The following attestation bundles were made for bioextract-0.0.7.tar.gz:
Publisher:
publish.yml on FuqingZh/bioextract
-
Statement:
-
Statement type:
https://in-toto.io/Statement/v1 -
Predicate type:
https://docs.pypi.org/attestations/publish/v1 -
Subject name:
bioextract-0.0.7.tar.gz -
Subject digest:
be03c4d284c5ad16a237fa8eb0c50cb5f08bd813010d53e74fa67e1c0100f88c - Sigstore transparency entry: 1823263009
- Sigstore integration time:
-
Permalink:
FuqingZh/bioextract@0679341e7809a608b5303bdcd2a64761d07cd325 -
Branch / Tag:
refs/tags/0.0.7 - Owner: https://github.com/FuqingZh
-
Access:
public
-
Token Issuer:
https://token.actions.githubusercontent.com -
Runner Environment:
github-hosted -
Publication workflow:
publish.yml@0679341e7809a608b5303bdcd2a64761d07cd325 -
Trigger Event:
push
-
Statement type:
File details
Details for the file bioextract-0.0.7-py3-none-any.whl.
File metadata
- Download URL: bioextract-0.0.7-py3-none-any.whl
- Upload date:
- Size: 77.5 kB
- Tags: Python 3
- Uploaded using Trusted Publishing? Yes
- Uploaded via: twine/6.1.0 CPython/3.13.12
File hashes
| Algorithm | Hash digest | |
|---|---|---|
| SHA256 |
b015ae4dd7d2313cb487ecd17fc34789834c7e389634b74884e84e3f9682072b
|
|
| MD5 |
f73232127eb1d21f847ce6a6c74fc2f0
|
|
| BLAKE2b-256 |
3ffb84956efec6fc6f41ecb1e4007e694c25659433034b9148341a2759a42c04
|
Provenance
The following attestation bundles were made for bioextract-0.0.7-py3-none-any.whl:
Publisher:
publish.yml on FuqingZh/bioextract
-
Statement:
-
Statement type:
https://in-toto.io/Statement/v1 -
Predicate type:
https://docs.pypi.org/attestations/publish/v1 -
Subject name:
bioextract-0.0.7-py3-none-any.whl -
Subject digest:
b015ae4dd7d2313cb487ecd17fc34789834c7e389634b74884e84e3f9682072b - Sigstore transparency entry: 1823263080
- Sigstore integration time:
-
Permalink:
FuqingZh/bioextract@0679341e7809a608b5303bdcd2a64761d07cd325 -
Branch / Tag:
refs/tags/0.0.7 - Owner: https://github.com/FuqingZh
-
Access:
public
-
Token Issuer:
https://token.actions.githubusercontent.com -
Runner Environment:
github-hosted -
Publication workflow:
publish.yml@0679341e7809a608b5303bdcd2a64761d07cd325 -
Trigger Event:
push
-
Statement type: