Skip to main content

Human CHromosome Accession CHAnge - Convert between different human chromosome naming systems (of the same assembly/version)

Project description

hchacha - Human CHromosome Accession CHange

Translate among the different naming systems used for human chromosomes (of the same assembly)

Background

There are a number of different groups that participant in and/or provide reference human sequence data from the Genome Reference Consortium. However, the same reference sequence data for each chromosome get accessioned under different identifiers. This script converts among these identifiers (just within versions-- this is not a crossMap or liftOver), for several commonly-used file formats, including VCF, SAM, FASTA, chain files...

Why? Well, there are several conventions for the naming of human chromosomes. The "ensembl" style numbers them 1-22 then X and Y. The "ucsc" style (named after the UCSC genome browser, also used in GATK's reference bundles) prepends these with 'chr'. However, a downside of both of these is that '11' or 'chr11' do not uniquely identify a sequence (although they may in the context of a specific assembly version like GRCh38.p13. On the other hand, 'NC_000011.10' is a specific accessioned sequence (which happens to be the chromosome 11 sequence version used in the GRCh38 primary assembly. Likewise, the genbank accession rather than the refseq accession could be used.

Examples

hchacha --help
zcat input.vcf.gz | hchacha vcf -a 37 -t ensembl | bgzip -c > output.vcf.gz
samtools view -h input.bam | hchacha sam -a 38 -t refseq | samtools view -b > output.bam

Data used

NCBI provides a useful file (*.assembly_report.txt) for different GRCh reference versions and patch levels, for instance here, that maps among these names. To get the data included in the repository (for GRCh versions 37 and 38), I did the following:

curl https://ftp.ncbi.nlm.nih.gov/genomes/all/GCF/000/001/405/GCF_000001405.40_GRCh38.p14/GCF_000001405.40_GRCh38.p14_assembly_report.txt | gzip -9 > src/hchacha/data/GCF_000001405.39_GRCh38.p14_assembly_report.txt.gz

curl https://ftp.ncbi.nlm.nih.gov/genomes/all/GCF/000/001/405/GCF_000001405.25_GRCh37.p13/GCF_000001405.25_GRCh37.p13_assembly_report.txt | gzip -9 > src/hchacha/data/GCF_000001405.25_GRCh37.p13_assembly_report.txt.gz

Changing the above (like for new patch levels) would also require changing the relevant filenames in the script.

The mapping to ensEMBL names is not quite as straightforward. It looks like they use the "short" names (like 1, 2, 3, ... X, Y) for the primary chromosomes, then RefSeq accessions for the others, so that is what this script does.

License

MIT license, but I am open to re-licensing this simple to script some other way if you have a good reason.

It is my understandig that data derived from RefSeq/NCBI are in the public domain as the work product of an institution of the governement of the United States of America.

Project details


Download files

Download the file for your platform. If you're not sure which to choose, learn more about installing packages.

Source Distribution

hchacha-1.0.1.tar.gz (41.2 kB view details)

Uploaded Source

Built Distribution

hchacha-1.0.1-py3-none-any.whl (39.7 kB view details)

Uploaded Python 3

File details

Details for the file hchacha-1.0.1.tar.gz.

File metadata

  • Download URL: hchacha-1.0.1.tar.gz
  • Upload date:
  • Size: 41.2 kB
  • Tags: Source
  • Uploaded using Trusted Publishing? No
  • Uploaded via: poetry/1.3.2 CPython/3.11.1 Darwin/22.2.0

File hashes

Hashes for hchacha-1.0.1.tar.gz
Algorithm Hash digest
SHA256 241b7fdc3322f0300d7e387927e8b1ef94a7c7237cca6a2a5d66fb94a8fe8d0d
MD5 28d62afdb0431a35330a79f9f0b1cedb
BLAKE2b-256 5d5d63cff791b16a7e6a1f3d2fecb6c30318154eccddfae4efccbb540fcf074c

See more details on using hashes here.

File details

Details for the file hchacha-1.0.1-py3-none-any.whl.

File metadata

  • Download URL: hchacha-1.0.1-py3-none-any.whl
  • Upload date:
  • Size: 39.7 kB
  • Tags: Python 3
  • Uploaded using Trusted Publishing? No
  • Uploaded via: poetry/1.3.2 CPython/3.11.1 Darwin/22.2.0

File hashes

Hashes for hchacha-1.0.1-py3-none-any.whl
Algorithm Hash digest
SHA256 93fb62ca7347cb3b9190b22ef46f1f9f1e14d0a7d983166a29e20857e5ff4134
MD5 32825619bac9e5227ada1b1acb6a9cbe
BLAKE2b-256 fc422a3d3172d1313d076ce32159951cb5e6c3b21fe275948a1ba06afb21fd53

See more details on using hashes here.

Supported by

AWS AWS Cloud computing and Security Sponsor Datadog Datadog Monitoring Fastly Fastly CDN Google Google Download Analytics Microsoft Microsoft PSF Sponsor Pingdom Pingdom Monitoring Sentry Sentry Error logging StatusPage StatusPage Status page