Skip to main content

Convert GFF3-formatted data to BED format

Project description

gff2bed

Overview

GFF3 and BED are common formats for storing the coordinates of genomic features such as genes. GFF3 format is more versatile, but BED format is simpler and enjoys a rich ecosystem of utilities such as bedtools. For this reason, it is often convenient to store genomic features in GFF3 format and convert them to BED format for genome arithmetic.

This module provides two convenience functions to streamline converting data from GFF3 to BED format for bioinformatics analysis: parse(), which reads data from a GFF3 file, and convert(), which converts GFF3-formatted data to BED-formatted data that can be passed on e.g. to pybedtools.

Documentation

See full online documentation at http://salk-tm.gitlab.io/gff2bed

Installation

With conda

gff2bed is available from bioconda, and can be installed with conda

conda install -c bioconda gff2bed

With pip

gff2bed is available from PyPI, and can be installed with pip

pip install gff2bed

Tutorial

To follow this tutorial, first ensure you have the following modules installed in addition to gff2bed:

This tutorial will involve working with some files on disk, so we'll make a temporary directory for easy cleanup later.

from tempfile import TemporaryDirectory
temp_dir = TemporaryDirectory()

Next, download an example GFF3 file

import urllib3
import shutil
import os.path
GFF3_URL = 'https://gitlab.com/salk-tm/gff2bed/-/raw/main/test/data/ColCEN_AT1G01010-20_TAIR10.gff3.gz'
GFF3_FILE = os.path.join(temp_dir.name, 'ColCEN_AT1G01010-20_TAIR10.gff3.gz')
http = urllib3.PoolManager()
with http.request('GET', GFF3_URL, preload_content=False) as r, open(GFF3_FILE, 'wb') as dest_file:
    shutil.copyfileobj(r, dest_file)

To read the GFF3 file into a Pandas data frame without converting to BED, use gff2bed.parse()

import pandas as pd
import gff2bed
gff_data = pd.DataFrame(gff2bed.parse(GFF3_FILE))
gff_data.head()
      0     1      2  3                                                  4
0  Chr1  7489   9757  +  {'ID': 'AT1G01010', 'Note': 'protein_coding_ge...
1  Chr1  9786  12596  -  {'ID': 'AT1G01020', 'Note': 'protein_coding_ge...

Note: The implementation of gff2bed follows a philosophy of simplicity. It depends on nothing but the built-in python libraries, and it includes nothing but the parse() and convert() functions. Typically when applying gff2bed in practice, you will use it in conjunction with other modules such as pandas or pybedtools.

To create a data frame of BED formatted data, pass the stream to gff2bed.convert() before passing to pd.DataFrame()

bed_data = pd.DataFrame(gff2bed.convert(gff2bed.parse(GFF3_FILE)))
bed_data.head()
      0     1      2          3  4  5
0  Chr1  7488   9757  AT1G01010  0  +
1  Chr1  9785  12596  AT1G01020  0  -

You can similarly create a BedTool with pybedtools

from pybedtools import BedTool
bed_data = BedTool(gff2bed.convert(gff2bed.parse(GFF3_FILE))).saveas()
bed_data.head()
Chr1    7488    9757    AT1G01010       0       +
 Chr1   9785    12596   AT1G01020       0       -

To complete the tutorial, clean up the temporary directory

temp_dir.cleanup()

Project details


Download files

Download the file for your platform. If you're not sure which to choose, learn more about installing packages.

Source Distribution

gff2bed-1.0.3.tar.gz (5.0 kB view hashes)

Uploaded Source

Built Distribution

gff2bed-1.0.3-py3-none-any.whl (5.5 kB view hashes)

Uploaded Python 3

Supported by

AWS AWS Cloud computing and Security Sponsor Datadog Datadog Monitoring Fastly Fastly CDN Google Google Download Analytics Microsoft Microsoft PSF Sponsor Pingdom Pingdom Monitoring Sentry Sentry Error logging StatusPage StatusPage Status page