Project description

file_chunk_iterators

Python classes to iterate through files in chunks

We provide two functions:

iterate_file_in_chunks Iterator to iterate over chunks of a file. The iterator is initialized to specify how many lines/chunks to make. It is then used as iterator several times: each time, it will iterate over the lines of one chunk only.
iterate_file_in_chunks_with_key Iterate through a file in chunks, keeping together certain groups of lines defined by a key. This iterator class operates like iterate_file_in_chunks (see its docstring) but with an additional constraint: Here, lines in the file are characterized by a key. Consecutive lines with the same key can never be split to different chunks.

These methods can be used in conjuction with pandas.read_csv to read a pandas dataframe one chunk at the time, which may save memory.

Here below, you have the docstrings of the two methods:

iterate_file_in_chunks(fname, nlines=None, nchunks=None):
    Arguments:
      fname:   file to be opened
      nlines:  max number of lines per chunk
      nchunks: number of chunks (overrides nlines)

    Boolean attribute self.finished tells  whether the whole file has been iterated through

    Example #1: specifying number of lines
    ======================================
      iterator=iterate_file_in_chunks(fname, nlines=4)

      # first time: iterates over lines 1-4
      for line in iterator:        print (line)

      # 2nd time: iterates over lines 5-8
      for line in iterator:        print (line)

      # etc

    Example #2: specifying number of lines, iterate whole file
    ==========================================================
      iterator=iterate_file_in_chunks(fname, nlines=4)
      chunkindex=0  #optionally, keeping track of chunkindex
      while not iterator.finished:
        for line in iterator:
          print(line)
        chunkindex+=1


    Example #3: specifying number of chunks
    =======================================
      iterator=iterate_file_in_chunks(fname, nchunks=3)
      for chunkindex in range(nchunks):
        for line in iterator:
          print(line)

    Argument nchunks will estimate the nlines per chunk from the size of the file.
    Note, lines are never divided in subfractions; so if you divide a 10-line file in 3, you have
    chunk1= 4 lines       chunk2= 3 lines        chunk3= 3 lines

    How to use in pandas
    ====================
    This code was written specifically to read tabular files in chunks to use in pandas.
    Theoretically this is accomplished by pandas.read_csv(.. , chunksize=50), but in practice that will crash with certain files.
    So to accomplish this, use:

    nchunks=4
    iterator=iterate_file_in_chunks(fname, nchunks=nchunks)

    # determine column names somehow, e.g. with:
    with open(fname) as fh:
      colnames=fh.readline().strip().split('\t')

    for chunkindex in range(nchunks):
      # read a chunk of lines as dataframe
      chunkdf=pd.read_csv(iterator, engine='python', names=colnames, sep='\t',
                          header=(1 if chunkindex==0 else None) )

      # use chunkdf somehow, to obtain resdf
      resdf=process(chunkdf)

      # write all resdf to a file called outfile, one chunk at the time
      resdf.to_csv(outfile, sep='\t', index=False, header=(chunkindex == 0),
                mode=('w' if chunkindex == 0 else 'a') )

    Warning
    =======
     pandas.read_csv has unexpected behavior when nlines is small, say <3
     It will ignore some stopiterations, and concatenate what are supposed to be different chunks

     pandas version tested 1.3.3

iterate_file_in_chunks_with_key(fname, nlines, keyfn=lambda x:x[:x.index('\t')]):

    Arguments:
      fname:   file to be opened
      nlines:  max number of lines per chunk
      keyfn:   function to be applied to each line to derive its key  (default: get first tab-separated field)

    Note:
      - the chunk may have size greater than nlines if there are more than nlines consecutive lines with the same key
      - the same-key condition is tested only for consecutive lines

    Boolean attribute self.finished tells whether the whole file has been iterated through

    Example #1: specifying number of lines
    ======================================
      iterator=iterate_file_in_chunks_with_key(fname, nlines=10, keyfn=lambda x:x.split()[0])

      # first time: iterates over lines 1-4
      for line in iterator:        print (line)

      # 2nd time: iterates over lines 5-8
      for line in iterator:        print (line)

      # etc

    Example #2: specifying number of lines, iterate whole file
    ==========================================================
      iterator=iterate_file_in_chunks_with_key(fname, nlines=10, keyfn=lambda x:x.split()[0])
      chunkindex=0  #optionally, keeping track of chunkindex
      while not iterator.finished:
        for line in iterator:
          print(line)
        chunkindex+=1

    Warning
    =======
     pandas.read_csv has unexpected behavior when nlines is small, say <3
     It will ignore some stopiterations, and concatenate what are supposed to be different chunks

     pandas version tested 1.3.3

Project details

These details have not been verified by PyPI

Project links

Homepage

GitHub Statistics

View statistics for this project via Libraries.io, or by using our public dataset on Google BigQuery

Release history Release notifications | RSS feed

0.0.4

Jun 21, 2023

0.0.3

Jun 21, 2023

This version

0.0.1

Mar 22, 2023

Download files

Download the file for your platform. If you're not sure which to choose, learn more about installing packages.

Source Distribution

file_chunk_iterators-0.0.1.tar.gz (17.4 kB view hashes)

Uploaded Mar 22, 2023 Source

Built Distribution

file_chunk_iterators-0.0.1-py3-none-any.whl (18.6 kB view hashes)

Uploaded Mar 22, 2023 Python 3

Hashes for file_chunk_iterators-0.0.1.tar.gz

Hashes for file_chunk_iterators-0.0.1.tar.gz
Algorithm	Hash digest
SHA256	`9b22f1a7a98460950576a56e64694944b18c8edf2076fbce19ba99883579dec0`
MD5	`dddfd30c8b44aafbacf88110d284fa82`
BLAKE2b-256	`5514aae9ca62f3333f3e826bca55e4d226cde5e50fc4752be69a113680e0f7de`

Hashes for file_chunk_iterators-0.0.1-py3-none-any.whl

Hashes for file_chunk_iterators-0.0.1-py3-none-any.whl
Algorithm	Hash digest
SHA256	`6ddf79e2fbef81453a9656738eceb24ad1dffb9f87bc6f6d6e04b88bc45812f5`
MD5	`2865291d4ecfcd3eb482833f6e08c36a`
BLAKE2b-256	`bf8d7e40f548f8af66b0333a1f6a3d81298abc2e992d9c1ec6eaf2f40738f19f`