Infer genetic sex from variant data — Python port of the SauersML/infer_sex Rust crate.
Project description
infer_sex (Python)
Pythonic port of SauersML/infer_sex.
Counts heterozygous calls per chromosome region and applies a linear
decision boundary on (x_autosome_het_ratio, y_genome_density) to call
genetic sex. The algorithm + every numeric constant are kept byte-
identical to the Rust crate.
from infer_sex import SexInferer, platform_from_bim
inferer = SexInferer(
build="hg38",
platform=platform_from_bim("/data/cohort.bim", build="hg38"),
)
# Pick your input — they're all type-checked and return the same shape:
result = inferer.infer_from_vcf("/data/sample.vcf.gz")
# or inferer.infer_from_plink("/data/cohort")
# or inferer.infer_from_records([("X", 100_000_000, True), ...])
# or inferer.infer_from_arrays(chrom_codes, positions, is_het)
print(result.final_call) # InferredSex.MALE / .FEMALE / .INDETERMINATE
print(result.report.composite_sex_index)
Install
pip install infer_sex
Pure Python + numpy. No Rust toolchain required.
Platform definitions
The algorithm normalises observed counts by the attempted counts on
the platform — pass them in via PlatformDefinition. Two helpers
compute them for you in one call:
from infer_sex import platform_from_bim, platform_from_vcf
platform = platform_from_bim("/data/cohort.bim", build="hg38")
platform = platform_from_vcf("/data/cohort.vcf.gz", build="hg38")
These walk the file once, counting autosomal rows and Y-non-PAR rows. Everything else (X, Y-PAR, MT, alt contigs) is ignored — exactly the locus set the inference algorithm uses for normalisation.
If you already know the counts (e.g. from a manifest), construct
PlatformDefinition directly:
from infer_sex import PlatformDefinition
platform = PlatformDefinition(
n_attempted_autosomes=2_000,
n_attempted_y_nonpar=1_000,
)
Shortcuts: pass what you already know
infer_sex never touches the network. Skip build detection by passing
build= directly. Use custom decision thresholds (e.g. one fit on your
own labelled data) via DecisionThresholds:
from infer_sex import DecisionThresholds
inferer = SexInferer(
build="hg38",
platform=PlatformDefinition(...),
thresholds=DecisionThresholds(slope=0.30, intercept=0.25),
)
Inputs
infer_from_vcf(path)—.vcf/.vcf.gz. Multi-sample files default to the first column with aUserWarning; passsample=to pick by ID (str) or 0-based index (int).infer_from_plink(prefix)— variant-major.bed/.bim/.fam. Passsample=to pick a specific row of the FAM (string IID/FID or 0-based index). Reads vianp.memmap; biobank-scale.beds are fine.infer_from_records(iterable)— accepts(chrom, pos, is_het)triples. Useful when reading from a custom source.infer_from_arrays(chrom, pos, is_het)— parallel numpy arrays; ~10× faster thaninfer_from_recordsfor the same data.
Missing genotypes (./. in VCF, 0b01 in PLINK) are dropped — they
don't count toward the denominator, matching the Rust crate.
Streaming API
SexInferenceAccumulator is the streaming primitive:
acc = inferer.accumulator()
for chrom, pos, is_het in my_stream:
acc.add(chrom, pos, is_het)
# bulk path:
acc.add_batch(chrom_array, pos_array, is_het_array)
# any time:
print(acc.snapshot()) # raw counts, no classification
result = acc.finish() # full inference; does not consume the accumulator
Results
result.final_call # InferredSex enum
result.is_male / .is_female / .is_indeterminate
result.report.y_genome_density # Optional[float]
result.report.x_autosome_het_ratio # Optional[float]
result.report.composite_sex_index # Optional[float]
result.report.auto_valid_count # int (and many more counts)
result.report.as_dict() # plain dict, JSON-ready
Errors
InvalidPlatformCounts— thePlatformDefinitionis unusable (e.g. zero autosomes).ObservedExceedsAttempted— more observations than the platform claims (your platform definition doesn't match the input stream).
Both subclass InferenceError.
Cross-language guarantee
Every PAR/non-PAR coordinate, the 1e-9 epsilon, the default
DecisionThresholds(slope=0.3566, intercept=0.2738), and the
classification formula are kept byte-identical to src/lib.rs in this
same repo. Calls match across languages — feed the same variant stream
to the Rust accumulator and the Python SexInferer and you get the
same InferredSex.
Project details
Release history Release notifications | RSS feed
Download files
Download the file for your platform. If you're not sure which to choose, learn more about installing packages.
Source Distribution
Built Distribution
Filter files by name, interpreter, ABI, and platform.
If you're not sure about the file name format, learn more about wheel file names.
Copy a direct link to the current filters
File details
Details for the file infer_sex-0.1.0.tar.gz.
File metadata
- Download URL: infer_sex-0.1.0.tar.gz
- Upload date:
- Size: 17.3 kB
- Tags: Source
- Uploaded using Trusted Publishing? No
- Uploaded via: twine/6.2.0 CPython/3.14.5
File hashes
| Algorithm | Hash digest | |
|---|---|---|
| SHA256 |
91f2539d178fab162425f1a3a52d86b959e5206eef1d13e10a4f9209c6ef83db
|
|
| MD5 |
2ab910e58bcba0eb485bca2ec7dbb7ac
|
|
| BLAKE2b-256 |
ce3dbb412e033924fff6b1f69ee1bcdc2177906c29a371229541f525ae16f448
|
File details
Details for the file infer_sex-0.1.0-py3-none-any.whl.
File metadata
- Download URL: infer_sex-0.1.0-py3-none-any.whl
- Upload date:
- Size: 14.1 kB
- Tags: Python 3
- Uploaded using Trusted Publishing? No
- Uploaded via: twine/6.2.0 CPython/3.14.5
File hashes
| Algorithm | Hash digest | |
|---|---|---|
| SHA256 |
bb2a2051a26e475a6faf8e6e6158e74ec125e3c75d9e6a8299da715f090d7918
|
|
| MD5 |
65e6491ea98bd17d8ed06468d537b074
|
|
| BLAKE2b-256 |
86b462d44ad82ee71f391d786f6866edb87ad36e0ab8dae4aab3131eb30ea0a2
|