Skip to main content

pythonic access to fasta sequence files

Project description

Author:

Brent Pedersen (brentp)

License:

MIT

Implementation

Requires Python >= 2.5. Stores a flattened version of the fasta file without spaces or headers. And a pickle of the start, stop (for fseek) locations of each header in the fasta file for internal use. Now supports the numpy array interface.

Usage

>>> from pyfasta import Fasta

>>> f = Fasta('tests/data/three_chrs.fasta')
>>> sorted(f.keys())
['chr1', 'chr2', 'chr3']

>>> f['chr1']
FastaRecord('tests/data/three_chrs.fasta.flat', 0..80)

Slicing

>>> f['chr1'][:10]
'ACTGACTGAC'

# get the 1st basepair in every codon (it's python yo)
>>> f['chr1'][::3]
'AGTCAGTCAGTCAGTCAGTCAGTCAGT'


# the index stores the start and stop of each header from the fasta file.
# (you should never need this)
>>> f.index
{'chr3': (160, 3760), 'chr2': (80, 160), 'chr1': (0, 80)}


# can query by a 'feature' dictionary
>>> f.sequence({'chr': 'chr1', 'start': 2, 'stop': 9})
'CTGACTGA'

# with reverse complement for - strand
>>> f.sequence({'chr': 'chr1', 'start': 2, 'stop': 9, 'strand': '-'})
'TCAGTCAG'

Numpy Array Interface

# FastaRecords support the numpy array interface.
>>> import numpy as np
>>> a = np.array(f['chr2'])
>>> a.shape[0] == len(f['chr2'])
True

>>> a[10:14]
array(['A', 'A', 'A', 'A'],
      dtype='|S1')


# cleanup (though for real use these will remain for faster access)
>>> import os
>>> os.unlink('tests/data/three_chrs.fasta.gdx')
>>> os.unlink('tests/data/three_chrs.fasta.flat')

Project details


Download files

Download the file for your platform. If you're not sure which to choose, learn more about installing packages.

Source Distribution

pyfasta-0.2.5.tar.gz (6.0 kB view hashes)

Uploaded Source

Supported by

AWS AWS Cloud computing and Security Sponsor Datadog Datadog Monitoring Fastly Fastly CDN Google Google Download Analytics Microsoft Microsoft PSF Sponsor Pingdom Pingdom Monitoring Sentry Sentry Error logging StatusPage StatusPage Status page