Skip to main content

Infer genetic sex from variant data — Python port of the SauersML/infer_sex Rust crate.

Project description

infer_sex (Python)

Pythonic port of SauersML/infer_sex. Counts heterozygous calls per chromosome region and applies a linear decision boundary on (x_autosome_het_ratio, y_genome_density) to call genetic sex. The algorithm + every numeric constant are kept byte- identical to the Rust crate.

from infer_sex import SexInferer, platform_from_bim

inferer = SexInferer(
    build="hg38",
    platform=platform_from_bim("/data/cohort.bim", build="hg38"),
)

# Pick your input — they're all type-checked and return the same shape:
result = inferer.infer_from_vcf("/data/sample.vcf.gz")
# or  inferer.infer_from_plink("/data/cohort")
# or  inferer.infer_from_records([("X", 100_000_000, True), ...])
# or  inferer.infer_from_arrays(chrom_codes, positions, is_het)

print(result.final_call)   # InferredSex.MALE / .FEMALE / .INDETERMINATE
print(result.report.composite_sex_index)

Install

pip install infer_sex

Pure Python + numpy. No Rust toolchain required.

Platform definitions

The algorithm normalises observed counts by the attempted counts on the platform — pass them in via PlatformDefinition. Two helpers compute them for you in one call:

from infer_sex import platform_from_bim, platform_from_vcf

platform = platform_from_bim("/data/cohort.bim", build="hg38")
platform = platform_from_vcf("/data/cohort.vcf.gz", build="hg38")

These walk the file once, counting autosomal rows and Y-non-PAR rows. Everything else (X, Y-PAR, MT, alt contigs) is ignored — exactly the locus set the inference algorithm uses for normalisation.

If you already know the counts (e.g. from a manifest), construct PlatformDefinition directly:

from infer_sex import PlatformDefinition

platform = PlatformDefinition(
    n_attempted_autosomes=2_000,
    n_attempted_y_nonpar=1_000,
)

Shortcuts: pass what you already know

infer_sex never touches the network. Skip build detection by passing build= directly. Use custom decision thresholds (e.g. one fit on your own labelled data) via DecisionThresholds:

from infer_sex import DecisionThresholds

inferer = SexInferer(
    build="hg38",
    platform=PlatformDefinition(...),
    thresholds=DecisionThresholds(slope=0.30, intercept=0.25),
)

Inputs

  • infer_from_vcf(path).vcf / .vcf.gz. Multi-sample files default to the first column with a UserWarning; pass sample= to pick by ID (str) or 0-based index (int).
  • infer_from_plink(prefix) — variant-major .bed/.bim/.fam. Pass sample= to pick a specific row of the FAM (string IID/FID or 0-based index). Reads via np.memmap; biobank-scale .beds are fine.
  • infer_from_records(iterable) — accepts (chrom, pos, is_het) triples. Useful when reading from a custom source.
  • infer_from_arrays(chrom, pos, is_het) — parallel numpy arrays; ~10× faster than infer_from_records for the same data.

Missing genotypes (./. in VCF, 0b01 in PLINK) are dropped — they don't count toward the denominator, matching the Rust crate.

Streaming API

SexInferenceAccumulator is the streaming primitive:

acc = inferer.accumulator()
for chrom, pos, is_het in my_stream:
    acc.add(chrom, pos, is_het)
# bulk path:
acc.add_batch(chrom_array, pos_array, is_het_array)
# any time:
print(acc.snapshot())       # raw counts, no classification
result = acc.finish()        # full inference; does not consume the accumulator

Results

result.final_call           # InferredSex enum
result.is_male / .is_female / .is_indeterminate
result.report.y_genome_density          # Optional[float]
result.report.x_autosome_het_ratio      # Optional[float]
result.report.composite_sex_index       # Optional[float]
result.report.auto_valid_count          # int (and many more counts)
result.report.as_dict()                  # plain dict, JSON-ready

Errors

  • InvalidPlatformCounts — the PlatformDefinition is unusable (e.g. zero autosomes).
  • ObservedExceedsAttempted — more observations than the platform claims (your platform definition doesn't match the input stream).

Both subclass InferenceError.

Cross-language guarantee

Every PAR/non-PAR coordinate, the 1e-9 epsilon, the default DecisionThresholds(slope=0.3566, intercept=0.2738), and the classification formula are kept byte-identical to src/lib.rs in this same repo. Calls match across languages — feed the same variant stream to the Rust accumulator and the Python SexInferer and you get the same InferredSex.

Project details


Download files

Download the file for your platform. If you're not sure which to choose, learn more about installing packages.

Source Distribution

infer_sex-0.1.0.tar.gz (17.3 kB view details)

Uploaded Source

Built Distribution

If you're not sure about the file name format, learn more about wheel file names.

infer_sex-0.1.0-py3-none-any.whl (14.1 kB view details)

Uploaded Python 3

File details

Details for the file infer_sex-0.1.0.tar.gz.

File metadata

  • Download URL: infer_sex-0.1.0.tar.gz
  • Upload date:
  • Size: 17.3 kB
  • Tags: Source
  • Uploaded using Trusted Publishing? No
  • Uploaded via: twine/6.2.0 CPython/3.14.5

File hashes

Hashes for infer_sex-0.1.0.tar.gz
Algorithm Hash digest
SHA256 91f2539d178fab162425f1a3a52d86b959e5206eef1d13e10a4f9209c6ef83db
MD5 2ab910e58bcba0eb485bca2ec7dbb7ac
BLAKE2b-256 ce3dbb412e033924fff6b1f69ee1bcdc2177906c29a371229541f525ae16f448

See more details on using hashes here.

File details

Details for the file infer_sex-0.1.0-py3-none-any.whl.

File metadata

  • Download URL: infer_sex-0.1.0-py3-none-any.whl
  • Upload date:
  • Size: 14.1 kB
  • Tags: Python 3
  • Uploaded using Trusted Publishing? No
  • Uploaded via: twine/6.2.0 CPython/3.14.5

File hashes

Hashes for infer_sex-0.1.0-py3-none-any.whl
Algorithm Hash digest
SHA256 bb2a2051a26e475a6faf8e6e6158e74ec125e3c75d9e6a8299da715f090d7918
MD5 65e6491ea98bd17d8ed06468d537b074
BLAKE2b-256 86b462d44ad82ee71f391d786f6866edb87ad36e0ab8dae4aab3131eb30ea0a2

See more details on using hashes here.

Supported by

AWS Cloud computing and Security Sponsor Datadog Monitoring Depot Continuous Integration Fastly CDN Google Download Analytics Pingdom Monitoring Sentry Error logging StatusPage Status page