Audited Library Integration for External Namespaces

These details have been verified by PyPI

Project links

GitHub Statistics

Maintainers

These details have not been verified by PyPI

Project description

ALIEN

ALIEN: Audited Library Integration for External Namespaces is a fully Python-based tool for building namespace-specific GMT libraries for human gene-set workflows.

ALIEN is centered on one job: take configured source libraries, normalize their gene memberships into a canonical table, project them into configured target namespaces, and write combined GMT files with audit metadata.

Install

Install ALIEN from PyPI:

pip install bioalien

The PyPI distribution name is bioalien; the Python import and command-line tool are alien.

Quick Start

Prepare a config with sources and a target annotation, then run:

alien build --config examples/cancer_dependency.yml --workers 16

Ready-made configs are available for GTEx v11 / GENCODE v47 and TCGA recount3 / GENCODE v29 targets. The first three configs write both GTEx and TCGA GMTs; the cancer dependency config is TCGA-only. TCGA configs use GENCODE v29 as the primary annotation helper and recount3 G029 as a metadata fallback for measured IDs missing from GENCODE.

examples/pathways.yml: Reactome, WikiPathways, KEGG MEDICUS, and GO biological process terms.
examples/function_location.yml: GO molecular function and cellular component terms.
examples/disease_phenotype.yml: HPO, DisGeNET, ClinVar, GWAS Catalog, and Jensen disease libraries.
examples/cancer_dependency.yml: cancer and dependency signatures for TCGA recount3.

The primary outputs are:

gmt/<target_namespace>.gmt: combined GMT for each configured target namespace.
metadata/: source manifest, term manifest, gene mapping tables, removed terms, unmapped genes, ambiguity logs, and provenance.
qc/: collection, mapping, redundancy, target coverage, and warning summaries.

The two filesystem roots are configured in YAML: project.source_dir is for downloaded/cached source and mapping resources, while project.outdir is the output root containing gmt/, metadata/, and qc/.

Downstream enrichment reports using these GMTs are included in docs/sex_contrast/gtex_thyroid/analysis.md and docs/sex_contrast/tcga_lung/analysis.md. Scripts to regenerate the report data, text, and figures are available in scripts/.

How It Works

ALIEN builds one canonical membership table from configured sources, audits source gene symbols against human mapping resources, then projects each term into the requested output namespaces.

The target namespace is defined in the config. For an Ensembl-style namespace, provide a target name and a GTF annotation:

targets:
  - name: human_gencode49
    type: ensembl_gtf
    annotation:
      source: GENCODE
      version: "49"

This writes gmt/human_gencode49.gmt.

For human GENCODE releases, any numeric version is enough; ALIEN builds the official FTP URL and caches the GTF under data/alien_sources/gencode/ if it is missing. You can still provide annotation.path to pin a local file explicitly.

source: GENCODE is not special to the config shape; other Ensembl-style GTF origins can use the same adapter by providing a local path or URL. Fully different target ID systems, such as Entrez or UniProt GMT output, are planned as future target adapters.

Targets separate the output gene set from the annotation helper. By default the GTF supplies both. Use output_genes when a dataset file supplies the final Ensembl ID namespace, or gene_filter when the annotation namespace should be intersected with a file/list of allowed IDs. In output_genes builds, id_column contains Ensembl IDs and the optional symbol_column adds dataset-provided symbol metadata for IDs absent from the GTF. annotation.metadata_fallbacks can add secondary GTF metadata only for output genes missing from the primary annotation; primary annotation mappings keep priority.

The build keeps audit outputs beside the GMTs so each namespace projection can be traced back to source terms, symbol repairs, unmapped genes, filters, redundancy decisions, and provenance. The compact metadata/source_manifest.tsv table is the main record of the exact source collections used in a build; for regex-matched Enrichr libraries it records the resolved library, match method, and candidate names.

term_id values must identify one source term unambiguously. If two source libraries reuse the same term_id for different term metadata, ALIEN fails by default and writes metadata/term_id_collisions.tsv so the IDs can be renamed or prefixed before rebuilding.

Redundancy filtering removes exact duplicate terms and then clusters highly overlapping terms by Jaccard similarity within each namespace and family. The default cutoff is 0.85:

redundancy:
  jaccard_cutoff: 0.85

Within each redundant cluster, ALIEN keeps one representative using the configured source priority, then term size and name-based tie-breaks.

Source priority is configured per term family and controls which library wins when redundant terms overlap. For example, pathway terms can prefer REACTOME over broader ontology-derived terms:

source_priority:
  biology_process_pathway: [REACTOME, WIKIPATHWAYS, KEGG_MEDICUS, GOBP]

See docs/usage.md for the full configuration reference.

Source Inputs

ALIEN 0.1.5 supports managed MSigDB and Enrichr download/cache sources and local file sources:

msigdb_remote: a Python downloader/reader for configured MSigDB release archives, with normalized Parquet caches for repeated builds.
enrichr_remote: a Python downloader/reader for Enrichr libraries by public library name.
symbol_gmt: a GMT file whose members are gene symbols.

Additional source metadata fields are documented in docs/usage.md.

Python API

from alien import build

result = build("examples/pathways.yml", workers=16)
print(result.namespaces)

You can also pass the same configuration as a Python dictionary, including target output genes such as output_genes: {"path": "expression.tsv.gz", "id_column": "feature_id", "symbol_column": "gene_symbol"} for a known Ensembl ID field.

MSigDB

sources:
  - type: msigdb_remote
    version: "2026.1"
    db_species: HS
    collection: C2

This stores the configured MSigDB release archive under project.source_dir/msigdb_remote/ by default, or a source-level cache_dir override when provided. ALIEN verifies the archive by MD5, extracts the matching RDS files, and writes normalized Parquet memberships for later builds.

Remote caches are reused by default. ALIEN also caches normalized MSigDB memberships and prepared NCBI rescue maps, so repeated large builds avoid expensive source-format conversion. Use alien build --force-download or per-source force: true to refresh managed downloads such as MSigDB, Enrichr, GENCODE, HGNC, and NCBI resources. For publication configs, prefer exact Enrichr library names and keep metadata/source_manifest.tsv with the released GMTs. Enabled sources are always required; use enabled: false to exclude a source deliberately.

Scope

The 0.1.5 release officially supports human gene sets using HGNC symbols, Python MSigDB and Enrichr cache integration, optimized repeated-build caches, and Ensembl-style target namespaces. The code is organized so broader namespace integrations can be added later without tying the package to any single downstream analysis project.

Contributing

Contributions are welcome. Useful areas include additional tests, documentation, source adapters, target namespace adapters, mapping-audit improvements, curated filtering/source-priority defaults, and validation against established gene-set resources. See docs/development.md for development setup and current future plans.

Project details

These details have been verified by PyPI

Project links

GitHub Statistics

Maintainers

deminden

These details have not been verified by PyPI

Release history Release notifications | RSS feed

0.1.6

May 19, 2026

This version

0.1.5

May 15, 2026

0.1.4

May 15, 2026

Download files

Download the file for your platform. If you're not sure which to choose, learn more about installing packages.

Source Distribution

bioalien-0.1.5.tar.gz (60.6 kB view details)

Uploaded May 15, 2026 Source

Built Distribution

If you're not sure about the file name format, learn more about wheel file names.

The dropdown lists show the available interpreters, ABIs, and platforms. Enable javascript to be able to filter the list of wheel files.

bioalien-0.1.5-py3-none-any.whl (49.1 kB view details)

Uploaded May 15, 2026 Python 3

File details

Details for the file bioalien-0.1.5.tar.gz.

File metadata

Download URL: bioalien-0.1.5.tar.gz
Upload date: May 15, 2026
Size: 60.6 kB
Tags: Source
Uploaded using Trusted Publishing? Yes
Uploaded via: twine/6.1.0 CPython/3.13.12

File hashes

Hashes for bioalien-0.1.5.tar.gz
Algorithm	Hash digest
SHA256	`a8187940b15903012e57b9625f513b02bb784485230e7ce92a4fb03b993ae1e4`
MD5	`1dddbdf877ebdcccd5ddf427d543a82c`
BLAKE2b-256	`612adf3982c7b2ce1553f5d02905ca23b027b7df5293489a1ae3da042cd67821`

See more details on using hashes here.

Provenance

The following attestation bundles were made for bioalien-0.1.5.tar.gz:

Publisher: release.yml on deminden/alien

Attestations: Values shown here reflect the state when the release was signed and may no longer be current.

Statement:
- Statement type: https://in-toto.io/Statement/v1
- Predicate type: https://docs.pypi.org/attestations/publish/v1
- Subject name: bioalien-0.1.5.tar.gz
- Subject digest: a8187940b15903012e57b9625f513b02bb784485230e7ce92a4fb03b993ae1e4
- Sigstore transparency entry: 1550905648
- Sigstore integration time: May 15, 2026
Source repository:
- Permalink: deminden/alien@a3d151ab2028b94f68a809307f051d893bcbccb2
- Branch / Tag: refs/tags/v0.1.5
- Owner: https://github.com/deminden
- Access: public
Publication detail:
- Token Issuer: https://token.actions.githubusercontent.com
- Runner Environment: github-hosted
- Publication workflow: release.yml@a3d151ab2028b94f68a809307f051d893bcbccb2
- Trigger Event: push

File details

Details for the file bioalien-0.1.5-py3-none-any.whl.

File metadata

Download URL: bioalien-0.1.5-py3-none-any.whl
Upload date: May 15, 2026
Size: 49.1 kB
Tags: Python 3
Uploaded using Trusted Publishing? Yes
Uploaded via: twine/6.1.0 CPython/3.13.12

File hashes

Hashes for bioalien-0.1.5-py3-none-any.whl
Algorithm	Hash digest
SHA256	`1ed49b7e5e20d8ecdb1265e49dfc16cc5a795a9254ee3cfc4761cecab569d6ae`
MD5	`136432f549903f15abff7bc2eb1bf92d`
BLAKE2b-256	`51ff427a212d22b768bdd0b494a62d14afbd9be781898a433f351723df36abdf`

See more details on using hashes here.

Provenance

The following attestation bundles were made for bioalien-0.1.5-py3-none-any.whl:

Publisher: release.yml on deminden/alien

Attestations: Values shown here reflect the state when the release was signed and may no longer be current.

Statement:
- Statement type: https://in-toto.io/Statement/v1
- Predicate type: https://docs.pypi.org/attestations/publish/v1
- Subject name: bioalien-0.1.5-py3-none-any.whl
- Subject digest: 1ed49b7e5e20d8ecdb1265e49dfc16cc5a795a9254ee3cfc4761cecab569d6ae
- Sigstore transparency entry: 1550905724
- Sigstore integration time: May 15, 2026
Source repository:
- Permalink: deminden/alien@a3d151ab2028b94f68a809307f051d893bcbccb2
- Branch / Tag: refs/tags/v0.1.5
- Owner: https://github.com/deminden
- Access: public
Publication detail:
- Token Issuer: https://token.actions.githubusercontent.com
- Runner Environment: github-hosted
- Publication workflow: release.yml@a3d151ab2028b94f68a809307f051d893bcbccb2
- Trigger Event: push

bioalien 0.1.5

Navigation

Verified details

Project links

GitHub Statistics

Maintainers

Unverified details

Meta

Classifiers

Project description

ALIEN

Install

Quick Start

How It Works

Source Inputs

Python API

MSigDB

Scope

Contributing

Project details

Verified details

Project links

GitHub Statistics

Maintainers

Unverified details

Meta

Classifiers

Release history Release notifications | RSS feed

Download files

Source Distribution

Built Distribution

File details

File metadata

File hashes

Provenance

File details

File metadata

File hashes

Provenance