Skip to main content

Python classes to iterate through files in chunks

Project description

file_chunk_iterators

Python classes to iterate through files in chunks

Installation

pip install file_chunk_iterators

Usage

We provide two functions:

  1. iterate_file_in_chunks Iterator to iterate over chunks of a file. The iterator is initialized to specify how many lines/chunks to make. It is then used as iterator several times: each time, it will iterate over the lines of one chunk only.

  2. iterate_file_in_chunks_with_key Iterate through a file in chunks, keeping together certain groups of lines defined by a key. This iterator class operates like iterate_file_in_chunks (see its docstring) but with an additional constraint: Here, lines in the file are characterized by a key. Consecutive lines with the same key can never be split to different chunks.

These methods can be used in conjuction with pandas.read_csv to read a pandas dataframe one chunk at the time, which may save memory.

Here below, you have the docstrings of the two methods:

iterate_file_in_chunks(fname, nlines=None, nchunks=None):
    Arguments:
      fname:   file to be opened
      nlines:  max number of lines per chunk
      nchunks: number of chunks (overrides nlines)

    Boolean attribute self.finished tells  whether the whole file has been iterated through

    Example #1: specifying number of lines
    ======================================
      iterator=iterate_file_in_chunks(fname, nlines=4)

      # first time: iterates over lines 1-4
      for line in iterator:        print (line)

      # 2nd time: iterates over lines 5-8
      for line in iterator:        print (line)

      # etc

    Example #2: specifying number of lines, iterate whole file
    ==========================================================
      iterator=iterate_file_in_chunks(fname, nlines=4)
      chunkindex=0  #optionally, keeping track of chunkindex
      while not iterator.finished:
        for line in iterator:
          print(line)
        chunkindex+=1


    Example #3: specifying number of chunks
    =======================================
      iterator=iterate_file_in_chunks(fname, nchunks=3)
      for chunkindex in range(nchunks):
        for line in iterator:
          print(line)

    Argument nchunks will estimate the nlines per chunk from the size of the file.
    Note, lines are never divided in subfractions; so if you divide a 10-line file in 3, you have
    chunk1= 4 lines       chunk2= 3 lines        chunk3= 3 lines

    How to use in pandas
    ====================
    This code was written specifically to read tabular files in chunks to use in pandas.
    Theoretically this is accomplished by pandas.read_csv(.. , chunksize=50), but in practice that will crash with certain files.
    So to accomplish this, use:

    nchunks=4
    iterator=iterate_file_in_chunks(fname, nchunks=nchunks)

    # determine column names somehow, e.g. with:
    with open(fname) as fh:
      colnames=fh.readline().strip().split('\t')

    for chunkindex in range(nchunks):
      # read a chunk of lines as dataframe
      chunkdf=pd.read_csv(iterator, engine='python', names=colnames, sep='\t',
                          header=(1 if chunkindex==0 else None) )

      # use chunkdf somehow, to obtain resdf
      resdf=process(chunkdf)

      # write all resdf to a file called outfile, one chunk at the time
      resdf.to_csv(outfile, sep='\t', index=False, header=(chunkindex == 0),
                mode=('w' if chunkindex == 0 else 'a') )

    Warning
    =======
     pandas.read_csv has unexpected behavior when nlines is small, say <3
     It will ignore some stopiterations, and concatenate what are supposed to be different chunks

     pandas version tested 1.3.3

iterate_file_in_chunks_with_key(fname, nlines, keyfn=lambda x:x[:x.index('\t')]):

    Arguments:
      fname:   file to be opened
      nlines:  max number of lines per chunk
      keyfn:   function to be applied to each line to derive its key  (default: get first tab-separated field)

    Note:
      - the chunk may have size greater than nlines if there are more than nlines consecutive lines with the same key
      - the same-key condition is tested only for consecutive lines

    Boolean attribute self.finished tells whether the whole file has been iterated through

    Example #1: specifying number of lines
    ======================================
      iterator=iterate_file_in_chunks_with_key(fname, nlines=10, keyfn=lambda x:x.split()[0])

      # first time: iterates over lines 1-4
      for line in iterator:        print (line)

      # 2nd time: iterates over lines 5-8
      for line in iterator:        print (line)

      # etc

    Example #2: specifying number of lines, iterate whole file
    ==========================================================
      iterator=iterate_file_in_chunks_with_key(fname, nlines=10, keyfn=lambda x:x.split()[0])
      chunkindex=0  #optionally, keeping track of chunkindex
      while not iterator.finished:
        for line in iterator:
          print(line)
        chunkindex+=1

    Warning
    =======
     pandas.read_csv has unexpected behavior when nlines is small, say <3
     It will ignore some stopiterations, and concatenate what are supposed to be different chunks

     pandas version tested 1.3.3

Project details


Download files

Download the file for your platform. If you're not sure which to choose, learn more about installing packages.

Source Distribution

file_chunk_iterators-0.0.4.tar.gz (17.5 kB view details)

Uploaded Source

File details

Details for the file file_chunk_iterators-0.0.4.tar.gz.

File metadata

  • Download URL: file_chunk_iterators-0.0.4.tar.gz
  • Upload date:
  • Size: 17.5 kB
  • Tags: Source
  • Uploaded using Trusted Publishing? No
  • Uploaded via: twine/3.1.1 pkginfo/1.4.2 requests/2.22.0 setuptools/45.2.0 requests-toolbelt/0.8.0 tqdm/4.30.0 CPython/3.8.10

File hashes

Hashes for file_chunk_iterators-0.0.4.tar.gz
Algorithm Hash digest
SHA256 5be4d10f0b58b7b818cc5ae9b9031f0b8d4e5246757cd4d28d7b00149d3d4614
MD5 ec6afd205e79958187c1086ad179eb6d
BLAKE2b-256 294c45b1d7bb503a62fb8b46b0f7e8c4ed55d6632cb1e13f9304d32e44fd679c

See more details on using hashes here.

Supported by

AWS AWS Cloud computing and Security Sponsor Datadog Datadog Monitoring Fastly Fastly CDN Google Google Download Analytics Microsoft Microsoft PSF Sponsor Pingdom Pingdom Monitoring Sentry Sentry Error logging StatusPage StatusPage Status page