Skip to main content

pythonic access to fasta sequence files

Project description

Author:

Brent Pedersen (brentp)

License:

MIT

Implementation

Requires Python >= 2.5. Stores a flattened version of the fasta file without spaces or headers and uses either a mmap of numpy binary format or fseek/fread so the sequence data is never read into memory. Saves a pickle (.gdx) of the start, stop (for fseek/mmap) locations of each header in the fasta file for internal use. Now supports the numpy array interface. When the underlying sequence file contains fewer than 150 headers (e.g. fewer than 150 chromosomes), the numpy binary format will be used and access will be significantly faster. For greater than 150 sequences, fseek/fread are used.

Usage

>>> from pyfasta import Fasta

>>> f = Fasta('tests/data/three_chrs.fasta')
>>> sorted(f.keys())
['chr1', 'chr2', 'chr3']

>>> f['chr1']
NpyFastaRecord('tests/data/three_chrs.fasta.flat.npy', 0..80)

Slicing

>>> f['chr1'][:10]
'ACTGACTGAC'

# get the 1st basepair in every codon (it's python yo)
>>> f['chr1'][::3]
'AGTCAGTCAGTCAGTCAGTCAGTCAGT'


# the index stores the start and stop of each header from the flattened
# fasta file. (you should never need this)
>>> f.index
{'chr3': (160, 3760), 'chr2': (80, 160), 'chr1': (0, 80)}


# can query by a 'feature' dictionary
>>> f.sequence({'chr': 'chr1', 'start': 2, 'stop': 9})
'CTGACTGA'

# same as:
>>> f['chr1'][1:9]
'CTGACTGA'

# with reverse complement for - strand
>>> f.sequence({'chr': 'chr1', 'start': 2, 'stop': 9, 'strand': '-'})
'TCAGTCAG'

# for files with < 150 sequences, it's possible to get back a numpy array directly
>>> f['chr1'].tostring = False
>>> f['chr1'][:10] # doctest: +NORMALIZE_WHITESPACE
memmap(['A', 'C', 'T', 'G', 'A', 'C', 'T', 'G', 'A', 'C'], dtype='|S1')

Numpy Array Interface

# FastaRecords support the numpy array interface.
>>> import numpy as np
>>> a = np.array(f['chr2'])
>>> a.shape[0] == len(f['chr2'])
True

>>> a[10:14]
array(['A', 'A', 'A', 'A'],
      dtype='|S1')

# mask a sub-sequence:
>>> a[11:13] = np.array('N', dtype='c')
>>> a[10:14].tostring()
'ANNA'



# cleanup (though for real use these will remain for faster access)
>>> import os
>>> os.unlink('tests/data/three_chrs.fasta.gdx')
>>> os.unlink('tests/data/three_chrs.fasta.flat.npy')

Project details


Download files

Download the file for your platform. If you're not sure which to choose, learn more about installing packages.

Source Distribution

pyfasta-0.2.8.tar.gz (6.1 kB view details)

Uploaded Source

File details

Details for the file pyfasta-0.2.8.tar.gz.

File metadata

  • Download URL: pyfasta-0.2.8.tar.gz
  • Upload date:
  • Size: 6.1 kB
  • Tags: Source
  • Uploaded using Trusted Publishing? No

File hashes

Hashes for pyfasta-0.2.8.tar.gz
Algorithm Hash digest
SHA256 f11e5a65402a814fea8017de07d676fb4f59c0c0099aab99a919994aefd46598
MD5 a022dcea3a66bd0a9d3b6a30eed4498b
BLAKE2b-256 a3e98c7de2a15185350fd8f02141b29ee2f9d472dd833e4fa30bb2f746287dd9

See more details on using hashes here.

Provenance

Supported by

AWS AWS Cloud computing and Security Sponsor Datadog Datadog Monitoring Fastly Fastly CDN Google Google Download Analytics Microsoft Microsoft PSF Sponsor Pingdom Pingdom Monitoring Sentry Sentry Error logging StatusPage StatusPage Status page