fast, memory-efficient, pythonic access to fasta sequence files
Project description
- Email:
- License:
MIT
Implementation
Requires Python >= 2.5. Stores a flattened version of the fasta file without spaces or headers and uses either a mmap of numpy binary format or fseek/fread so the sequence data is never read into memory. Saves a pickle (.gdx) of the start, stop (for fseek/mmap) locations of each header in the fasta file for internal use.
Usage
>>> from pyfasta import Fasta >>> f = Fasta('tests/data/three_chrs.fasta') >>> sorted(f.keys()) ['chr1', 'chr2', 'chr3'] >>> f['chr1'] NpyFastaRecord(0..80)
Slicing
>>> f['chr1'][:10] 'ACTGACTGAC' # get the 1st basepair in every codon (it's python yo) >>> f['chr1'][::3] 'AGTCAGTCAGTCAGTCAGTCAGTCAGT' # the index stores the start and stop of each header from the flattened # fasta file. (you should never need this) >>> f.index {'chr3': (160, 3760), 'chr2': (80, 160), 'chr1': (0, 80)} # can query by a 'feature' dictionary >>> f.sequence({'chr': 'chr1', 'start': 2, 'stop': 9}) 'CTGACTGA' # same as: >>> f['chr1'][1:9] 'CTGACTGA' # with reverse complement for - strand >>> f.sequence({'chr': 'chr1', 'start': 2, 'stop': 9, 'strand': '-'}) 'TCAGTCAG'
Numpy
The default is to use a memmaped numpy array as the backend. In which case it’s possible to get back an array directly…
>>> f['chr1'].tostring = False >>> f['chr1'][:10] # doctest: +NORMALIZE_WHITESPACE memmap(['A', 'C', 'T', 'G', 'A', 'C', 'T', 'G', 'A', 'C'], dtype='|S1')>>> import numpy as np >>> a = np.array(f['chr2']) >>> a.shape[0] == len(f['chr2']) True>>> a[10:14] array(['A', 'A', 'A', 'A'], dtype='|S1')
- mask a sub-sequence:
>>> a[11:13] = np.array('N', dtype='c') >>> a[10:14].tostring() 'ANNA'
Backends (Record class)
It’s also possible to specify another record class as the underlying work-horse for slicing and reading. Currently, there’s just the default: NpyFastaRecord which uses numpy memmap FastaRecord, which uses using fseek/fread. It’s possible to create your own using a sub-class of FastaRecord. see the source for details. Next addition will be a pytables/hdf5 backend.
>>> from pyfasta import FastaRecord # default is NpyFastaRecord >>> f = Fasta('tests/data/three_chrs.fasta', record_class=FastaRecord) >>> f['chr1'] FastaRecord('tests/data/three_chrs.fasta.flat', 0..80)
other than the repr, it should behave exactly like the Npy record class backend
cleanup (though for real use these will remain for faster access)
>>> import os >>> os.unlink('tests/data/three_chrs.fasta.gdx') >>> os.unlink('tests/data/three_chrs.fasta.npy') >>> os.unlink('tests/data/three_chrs.fasta.flat')
Project details
Release history Release notifications | RSS feed
Download files
Download the file for your platform. If you're not sure which to choose, learn more about installing packages.