Python classes to iterate through files in chunks

These details have not been verified by PyPI

Project links

Homepage

License
- OSI Approved :: MIT License
Operating System
- OS Independent
Programming Language
- Python :: 3

Project description

file_chunk_iterators

Python classes to iterate through files in chunks

Installation

pip install file_chunk_iterators

Usage

We provide two functions:

iterate_file_in_chunks Iterator to iterate over chunks of a file. The iterator is initialized to specify how many lines/chunks to make. It is then used as iterator several times: each time, it will iterate over the lines of one chunk only.
iterate_file_in_chunks_with_key Iterate through a file in chunks, keeping together certain groups of lines defined by a key. This iterator class operates like iterate_file_in_chunks (see its docstring) but with an additional constraint: Here, lines in the file are characterized by a key. Consecutive lines with the same key can never be split to different chunks.

These methods can be used in conjuction with pandas.read_csv to read a pandas dataframe one chunk at the time, which may save memory.

Here below, you have the docstrings of the two methods:

iterate_file_in_chunks(fname, nlines=None, nchunks=None):
    Arguments:
      fname:   file to be opened
      nlines:  max number of lines per chunk
      nchunks: number of chunks (overrides nlines)

    Boolean attribute self.finished tells  whether the whole file has been iterated through

    Example #1: specifying number of lines
    ======================================
      iterator=iterate_file_in_chunks(fname, nlines=4)

      # first time: iterates over lines 1-4
      for line in iterator:        print (line)

      # 2nd time: iterates over lines 5-8
      for line in iterator:        print (line)

      # etc

    Example #2: specifying number of lines, iterate whole file
    ==========================================================
      iterator=iterate_file_in_chunks(fname, nlines=4)
      chunkindex=0  #optionally, keeping track of chunkindex
      while not iterator.finished:
        for line in iterator:
          print(line)
        chunkindex+=1


    Example #3: specifying number of chunks
    =======================================
      iterator=iterate_file_in_chunks(fname, nchunks=3)
      for chunkindex in range(nchunks):
        for line in iterator:
          print(line)

    Argument nchunks will estimate the nlines per chunk from the size of the file.
    Note, lines are never divided in subfractions; so if you divide a 10-line file in 3, you have
    chunk1= 4 lines       chunk2= 3 lines        chunk3= 3 lines

    How to use in pandas
    ====================
    This code was written specifically to read tabular files in chunks to use in pandas.
    Theoretically this is accomplished by pandas.read_csv(.. , chunksize=50), but in practice that will crash with certain files.
    So to accomplish this, use:

    nchunks=4
    iterator=iterate_file_in_chunks(fname, nchunks=nchunks)

    # determine column names somehow, e.g. with:
    with open(fname) as fh:
      colnames=fh.readline().strip().split('\t')

    for chunkindex in range(nchunks):
      # read a chunk of lines as dataframe
      chunkdf=pd.read_csv(iterator, engine='python', names=colnames, sep='\t',
                          header=(1 if chunkindex==0 else None) )

      # use chunkdf somehow, to obtain resdf
      resdf=process(chunkdf)

      # write all resdf to a file called outfile, one chunk at the time
      resdf.to_csv(outfile, sep='\t', index=False, header=(chunkindex == 0),
                mode=('w' if chunkindex == 0 else 'a') )

    Warning
    =======
     pandas.read_csv has unexpected behavior when nlines is small, say <3
     It will ignore some stopiterations, and concatenate what are supposed to be different chunks

     pandas version tested 1.3.3

iterate_file_in_chunks_with_key(fname, nlines, keyfn=lambda x:x[:x.index('\t')]):

    Arguments:
      fname:   file to be opened
      nlines:  max number of lines per chunk
      keyfn:   function to be applied to each line to derive its key  (default: get first tab-separated field)

    Note:
      - the chunk may have size greater than nlines if there are more than nlines consecutive lines with the same key
      - the same-key condition is tested only for consecutive lines

    Boolean attribute self.finished tells whether the whole file has been iterated through

    Example #1: specifying number of lines
    ======================================
      iterator=iterate_file_in_chunks_with_key(fname, nlines=10, keyfn=lambda x:x.split()[0])

      # first time: iterates over lines 1-4
      for line in iterator:        print (line)

      # 2nd time: iterates over lines 5-8
      for line in iterator:        print (line)

      # etc

    Example #2: specifying number of lines, iterate whole file
    ==========================================================
      iterator=iterate_file_in_chunks_with_key(fname, nlines=10, keyfn=lambda x:x.split()[0])
      chunkindex=0  #optionally, keeping track of chunkindex
      while not iterator.finished:
        for line in iterator:
          print(line)
        chunkindex+=1

    Warning
    =======
     pandas.read_csv has unexpected behavior when nlines is small, say <3
     It will ignore some stopiterations, and concatenate what are supposed to be different chunks

     pandas version tested 1.3.3

Project details

These details have not been verified by PyPI

Project links

Homepage

License
- OSI Approved :: MIT License
Operating System
- OS Independent
Programming Language
- Python :: 3

Release history Release notifications | RSS feed

This version

0.0.4

Jun 21, 2023

0.0.3

Jun 21, 2023

0.0.1

Mar 22, 2023

Download files

Download the file for your platform. If you're not sure which to choose, learn more about installing packages.

Source Distribution

file_chunk_iterators-0.0.4.tar.gz (17.5 kB view details)

Uploaded Jun 21, 2023 Source

File details

Details for the file file_chunk_iterators-0.0.4.tar.gz.

File metadata

Download URL: file_chunk_iterators-0.0.4.tar.gz
Upload date: Jun 21, 2023
Size: 17.5 kB
Tags: Source
Uploaded using Trusted Publishing? No
Uploaded via: twine/3.1.1 pkginfo/1.4.2 requests/2.22.0 setuptools/45.2.0 requests-toolbelt/0.8.0 tqdm/4.30.0 CPython/3.8.10

File hashes

Hashes for file_chunk_iterators-0.0.4.tar.gz
Algorithm	Hash digest
SHA256	`5be4d10f0b58b7b818cc5ae9b9031f0b8d4e5246757cd4d28d7b00149d3d4614`
MD5	`ec6afd205e79958187c1086ad179eb6d`
BLAKE2b-256	`294c45b1d7bb503a62fb8b46b0f7e8c4ed55d6632cb1e13f9304d32e44fd679c`

See more details on using hashes here.

file-chunk-iterators 0.0.4

Navigation

Verified details

Maintainers

Unverified details

Project links

Meta

Classifiers

Project description

file_chunk_iterators

Installation

Usage

Project details

Verified details

Maintainers

Unverified details

Project links

Meta

Classifiers

Release history Release notifications | RSS feed

Download files

Source Distribution

File details

File metadata

File hashes