Skip to main content

Human CHromosome Accession CHAnge - Convert between different human chromosome naming systems (of the same assembly/version)

Project description

hchacha - Human CHromosome Accession CHange

Translate among the different naming systems used for human chromosomes (of the same assembly)

Background

There are a number of different groups that participant in and/or provide reference human sequence data from the Genome Reference Consortium. However, the same reference sequence data for each chromosome get accessioned under different identifiers. This script converts among these identifiers (just within versions-- this is not a crossMap or liftOver), for several commonly-used file formats, including VCF, SAM, FASTA, chain files...

Why? Well, there are several conventions for the naming of human chromosomes. The "ensembl" style numbers them 1-22 then X and Y. The "ucsc" style (named after the UCSC genome browser, also used in GATK's reference bundles) prepends these with 'chr'. However, a downside of both of these is that '11' or 'chr11' do not uniquely identify a sequence (although they may in the context of a specific assembly version like GRCh38.p13. On the other hand, 'NC_000011.10' is a specific accessioned sequence (which happens to be the chromosome 11 sequence version used in the GRCh38 primary assembly. Likewise, the genbank accession rather than the refseq accession could be used.

Examples

hchacha --help
zcat input.vcf.gz | hchacha vcf -a 37 -t ensembl | bgzip -c > output.vcf.gz
samtools view -h input.bam | hchacha sam -a 38 -t refseq | samtools view -b > output.bam

Smarter handling for BAM/CRAM files

Since all you are doing is really renaming the sequences in the header (and individual BAM/CRAM records refer back to those sequence names by an integer index), you can do things much more quickly and with less CPU usage using samtools reheader if it is available on your system.

For example:

samtools reheader -P -c 'hchacha sam -a 38 -t ucsc -s' input.bam > output.bam

With some clever use of the tee command to output the new bam file and continue the shell pipeline going, you can even make the new index at the same time:

samtools reheader -P -c 'hchacha sam -a 38 -t ucsc -s' input.bam | tee output.bam | samtools index - output.bam.bai

Data used

NCBI provides a useful file (*.assembly_report.txt) for different GRCh reference versions and patch levels, for instance here, that maps among these names. To get the data included in the repository (for GRCh versions 37 and 38), I ran the bash script get-assembly-reports.bash (requires curl) whichh writes files to src/hchacha/data.

Mapping for ensembl names

The mapping to ensEMBL names is not quite as straightforward. It looks like they use the "short" names (like 1, 2, 3, ... X, Y) for the primary chromosomes, then RefSeq accessions for the others, so that is what this script does.

License

MIT license, but I am open to re-licensing this simple to script some other way if you have a good reason.

It is my understandig that data derived from RefSeq/NCBI are in the public domain as the work product of an institution of the governement of the United States of America.

Samtools (included as part of the docker image) is MIT/Expat licensed and Copyright (C) 2008-2023 Genome Research Ltd.

Project details


Download files

Download the file for your platform. If you're not sure which to choose, learn more about installing packages.

Source Distribution

hchacha-1.0.4.tar.gz (21.1 kB view details)

Uploaded Source

Built Distribution

If you're not sure about the file name format, learn more about wheel file names.

hchacha-1.0.4-py3-none-any.whl (19.8 kB view details)

Uploaded Python 3

File details

Details for the file hchacha-1.0.4.tar.gz.

File metadata

  • Download URL: hchacha-1.0.4.tar.gz
  • Upload date:
  • Size: 21.1 kB
  • Tags: Source
  • Uploaded using Trusted Publishing? No
  • Uploaded via: poetry/1.5.1 CPython/3.11.4 Darwin/22.5.0

File hashes

Hashes for hchacha-1.0.4.tar.gz
Algorithm Hash digest
SHA256 79aaf8e590d287d1cb5a2e218204c0ebd12b3d6fd97d663e4bff0c3fbe336da3
MD5 28c3a5ab66bb65071a733e8c5d4775a6
BLAKE2b-256 6ac04b9af5b5bbafb63fe84c3420519c10d7c04a80efa88966360c1e9dc245d9

See more details on using hashes here.

File details

Details for the file hchacha-1.0.4-py3-none-any.whl.

File metadata

  • Download URL: hchacha-1.0.4-py3-none-any.whl
  • Upload date:
  • Size: 19.8 kB
  • Tags: Python 3
  • Uploaded using Trusted Publishing? No
  • Uploaded via: poetry/1.5.1 CPython/3.11.4 Darwin/22.5.0

File hashes

Hashes for hchacha-1.0.4-py3-none-any.whl
Algorithm Hash digest
SHA256 78bbff444832c841e2a20e34ad1d445d19e7a62835ede96b7a488c6d9e534d7e
MD5 8647f2532b0d88e5f225b4d63013d7ed
BLAKE2b-256 66eeb573ef241bbd65d011c70d0cafea77ef744904514a1a6af19ae56de80c67

See more details on using hashes here.

Supported by

AWS Cloud computing and Security Sponsor Datadog Monitoring Depot Continuous Integration Fastly CDN Google Download Analytics Pingdom Monitoring Sentry Error logging StatusPage Status page