Skip to main content

Package to convert a vcf into a pandas dataframe.

Project description

vcf2pandas

PyPI Downloads PyPI Downloads

vcf2pandas is a python package to convert vcf files to pandas dataframes.

Install

pip install vcf2pandas

Dependencies

  • pandas (2.1.0)
  • pysam (0.22.1)

Usage

Selecting all columns (default behaviour)

from vcf2pandas import vcf2pandas
import pandas

df = vcf2pandas("path_to_vcf.vcf")

Remove all empty columns

Sometimes where will be INFO or FORMAT fields from the header where none of the variants or samples have that field. You can choose to remove all of these from the pandas dataframe.

df = vcf2pandas("path_to_vcf.vcf", remove_empty_columns=True)

Selecting custom columns and samples

info_fields = ["info_field_1", "info_field_2"]
sample_list = ["sample_name_1", "sample_name_2"]
format_fields = ["format_name_1", "format_name_2"]

df_selected = vcf2pandas(
    "path_to_vcf.vcf",
    info_fields=info_fields,
    sample_list=sample_list,
    format_fields=format_fields,
)

Renaming custom columns and samples

From v0.2.0, renaming column and sample names is supported. Simply input a dictionary instead of a list with your name mapping. See example below.

info_fields = {
    "info_field_1": "renamed_info_field_1",
    "info_field_2": "renamed_info_field_2"
}
sample_list = {
    "sample_name_1": "renamed_sample_name_1",
    "sample_name_2": "renamed_sample_name_2"
}
format_fields = {
    "format_name_1": "renamed_format_name_1",
    "format_name_2": "renamed_format_name_2"
}

df_renamed = vcf2pandas(
    "path_to_vcf.vcf",
    info_fields=info_fields,
    sample_list=sample_list,
    format_fields=format_fields,
)

[!NOTE] You do not need to have everything a list or everything a dictionary, you can mix and match defaults, lists and dictionaries for info_fields, sample_list and format_fields.

Custom column ordering

vcf2pandas can select custom/specific:

  • INFO fields
  • samples
  • FORMAT fields

And order the selected columns based on the input list.

E.g. The following list:

info_fields = ["DP", "MQM", "QA"]

Gets the columns (in that order)

INFO:DP    INFO:MQM    INFO:QA

Output

INFO and FORMAT headings

INFO:INFO_FIELD                     e.g. INFO:DP
FORMAT:SAMPLE_NAME:FORMAT_FIELD     e.g. FORMAT:HG002:GT

The info field, format field and sample names can also be mapped to custom values by using a dictionary. See Renaming custom columns and samples.

INFO or FORMAT fields not present for some variants

When certain INFO or FORMAT fields are not present for certain variants, vcf2pandas inserts a . instead in that cell. E.g. for vcf3_all.txt you can see INFO:GENE column has . for the first 7 variants.

Examples

Example vcf and output files (dataframes as a .txt file) are available in examples/

Example Usage

df1_all = vcf2pandas("examples/vcf1.vcf")
df2_all = vcf2pandas("examples/vcf2.vcf")

df3_all = vcf2pandas("examples/vcf3.vcf")

info_fields = ["DP"]
sample_list = ["HG002"]
format_fields = ["GT", "AO"]

df3_selected = vcf2pandas(
    "examples/vcf3.vcf",
    info_fields=info_fields,
    sample_list=sample_list,
    format_fields=format_fields
)

To print to a text file:

with open("path_to_txt_file.txt", "w", encoding='utf-8') as f:
    f.write(df.to_string())

For more examples, see tests/run_examples.py.

To recreate the examples in the examples/ folder, run:

cd vcf2pandas
poetry run python tests/run_examples.py

Changelog

v0.1.0

  • Initial project.

v0.1.1

  • Fixed converting variant filter into string properly.

v0.1.2

  • Updated pysam version to 0.22.1.

v0.2.0

  • Fixed bug where some info/format fields would be overwritten with . if not all samples/variants had all the info/format values.
  • Changed behaviour of getting info/format fields, it now takes from the vcf headers.
  • Added functionality to rename columns using dictionaries. This is a non-breaking change, all existing uses of this package will still work.
  • Added functionality to remove columns that are completely empty. Also a non-breaking change.
  • Updated README with more examples.
  • Added more tests for renaming columns.
  • Added unit testing with pytest.

Issues

Please open an issue if you encounter any problems! Thanks!

Project details


Download files

Download the file for your platform. If you're not sure which to choose, learn more about installing packages.

Source Distribution

vcf2pandas-0.2.0.tar.gz (4.1 kB view details)

Uploaded Source

Built Distribution

If you're not sure about the file name format, learn more about wheel file names.

vcf2pandas-0.2.0-py3-none-any.whl (4.9 kB view details)

Uploaded Python 3

File details

Details for the file vcf2pandas-0.2.0.tar.gz.

File metadata

  • Download URL: vcf2pandas-0.2.0.tar.gz
  • Upload date:
  • Size: 4.1 kB
  • Tags: Source
  • Uploaded using Trusted Publishing? No
  • Uploaded via: poetry/2.1.1 CPython/3.12.3 Linux/6.11.0-17-generic

File hashes

Hashes for vcf2pandas-0.2.0.tar.gz
Algorithm Hash digest
SHA256 dbd52c2ef6e960254982205ad4ba47d82e13a7814ba301d50c159245ef1c36e5
MD5 7306f6cb783d204a8dfa99911b7ddbed
BLAKE2b-256 b86165a7ff2133e4cc21db68eae6a4093b48f977eb18dc727cb6eac40c402c08

See more details on using hashes here.

File details

Details for the file vcf2pandas-0.2.0-py3-none-any.whl.

File metadata

  • Download URL: vcf2pandas-0.2.0-py3-none-any.whl
  • Upload date:
  • Size: 4.9 kB
  • Tags: Python 3
  • Uploaded using Trusted Publishing? No
  • Uploaded via: poetry/2.1.1 CPython/3.12.3 Linux/6.11.0-17-generic

File hashes

Hashes for vcf2pandas-0.2.0-py3-none-any.whl
Algorithm Hash digest
SHA256 f0d4872b6da773aedc0cce31a171e16c38178bf926609666342994d229ad7209
MD5 a95f17b1acb169f64b3db83051aed2e4
BLAKE2b-256 8ab56566d7cc715639cb9eba0f94b7d388a185c9d7c0812169285df6539ade10

See more details on using hashes here.

Supported by

AWS Cloud computing and Security Sponsor Datadog Monitoring Depot Continuous Integration Fastly CDN Google Download Analytics Pingdom Monitoring Sentry Error logging StatusPage Status page