Python classes to iterate through files in chunks
Project description
file_chunk_iterators
Python classes to iterate through files in chunks
We provide two functions:
-
iterate_file_in_chunks Iterator to iterate over chunks of a file. The iterator is initialized to specify how many lines/chunks to make. It is then used as iterator several times: each time, it will iterate over the lines of one chunk only.
-
iterate_file_in_chunks_with_key Iterate through a file in chunks, keeping together certain groups of lines defined by a key. This iterator class operates like iterate_file_in_chunks (see its docstring) but with an additional constraint: Here, lines in the file are characterized by a key. Consecutive lines with the same key can never be split to different chunks.
These methods can be used in conjuction with pandas.read_csv to read a pandas dataframe one chunk at the time, which may save memory.
Here below, you have the docstrings of the two methods:
iterate_file_in_chunks(fname, nlines=None, nchunks=None):
Arguments:
fname: file to be opened
nlines: max number of lines per chunk
nchunks: number of chunks (overrides nlines)
Boolean attribute self.finished tells whether the whole file has been iterated through
Example #1: specifying number of lines
======================================
iterator=iterate_file_in_chunks(fname, nlines=4)
# first time: iterates over lines 1-4
for line in iterator: print (line)
# 2nd time: iterates over lines 5-8
for line in iterator: print (line)
# etc
Example #2: specifying number of lines, iterate whole file
==========================================================
iterator=iterate_file_in_chunks(fname, nlines=4)
chunkindex=0 #optionally, keeping track of chunkindex
while not iterator.finished:
for line in iterator:
print(line)
chunkindex+=1
Example #3: specifying number of chunks
=======================================
iterator=iterate_file_in_chunks(fname, nchunks=3)
for chunkindex in range(nchunks):
for line in iterator:
print(line)
Argument nchunks will estimate the nlines per chunk from the size of the file.
Note, lines are never divided in subfractions; so if you divide a 10-line file in 3, you have
chunk1= 4 lines chunk2= 3 lines chunk3= 3 lines
How to use in pandas
====================
This code was written specifically to read tabular files in chunks to use in pandas.
Theoretically this is accomplished by pandas.read_csv(.. , chunksize=50), but in practice that will crash with certain files.
So to accomplish this, use:
nchunks=4
iterator=iterate_file_in_chunks(fname, nchunks=nchunks)
# determine column names somehow, e.g. with:
with open(fname) as fh:
colnames=fh.readline().strip().split('\t')
for chunkindex in range(nchunks):
# read a chunk of lines as dataframe
chunkdf=pd.read_csv(iterator, engine='python', names=colnames, sep='\t',
header=(1 if chunkindex==0 else None) )
# use chunkdf somehow, to obtain resdf
resdf=process(chunkdf)
# write all resdf to a file called outfile, one chunk at the time
resdf.to_csv(outfile, sep='\t', index=False, header=(chunkindex == 0),
mode=('w' if chunkindex == 0 else 'a') )
Warning
=======
pandas.read_csv has unexpected behavior when nlines is small, say <3
It will ignore some stopiterations, and concatenate what are supposed to be different chunks
pandas version tested 1.3.3
iterate_file_in_chunks_with_key(fname, nlines, keyfn=lambda x:x[:x.index('\t')]):
Arguments:
fname: file to be opened
nlines: max number of lines per chunk
keyfn: function to be applied to each line to derive its key (default: get first tab-separated field)
Note:
- the chunk may have size greater than nlines if there are more than nlines consecutive lines with the same key
- the same-key condition is tested only for consecutive lines
Boolean attribute self.finished tells whether the whole file has been iterated through
Example #1: specifying number of lines
======================================
iterator=iterate_file_in_chunks_with_key(fname, nlines=10, keyfn=lambda x:x.split()[0])
# first time: iterates over lines 1-4
for line in iterator: print (line)
# 2nd time: iterates over lines 5-8
for line in iterator: print (line)
# etc
Example #2: specifying number of lines, iterate whole file
==========================================================
iterator=iterate_file_in_chunks_with_key(fname, nlines=10, keyfn=lambda x:x.split()[0])
chunkindex=0 #optionally, keeping track of chunkindex
while not iterator.finished:
for line in iterator:
print(line)
chunkindex+=1
Warning
=======
pandas.read_csv has unexpected behavior when nlines is small, say <3
It will ignore some stopiterations, and concatenate what are supposed to be different chunks
pandas version tested 1.3.3
Project details
Release history Release notifications | RSS feed
Download files
Download the file for your platform. If you're not sure which to choose, learn more about installing packages.
Source Distribution
Built Distribution
Hashes for file_chunk_iterators-0.0.1.tar.gz
Algorithm | Hash digest | |
---|---|---|
SHA256 | 9b22f1a7a98460950576a56e64694944b18c8edf2076fbce19ba99883579dec0 |
|
MD5 | dddfd30c8b44aafbacf88110d284fa82 |
|
BLAKE2b-256 | 5514aae9ca62f3333f3e826bca55e4d226cde5e50fc4752be69a113680e0f7de |
Hashes for file_chunk_iterators-0.0.1-py3-none-any.whl
Algorithm | Hash digest | |
---|---|---|
SHA256 | 6ddf79e2fbef81453a9656738eceb24ad1dffb9f87bc6f6d6e04b88bc45812f5 |
|
MD5 | 2865291d4ecfcd3eb482833f6e08c36a |
|
BLAKE2b-256 | bf8d7e40f548f8af66b0333a1f6a3d81298abc2e992d9c1ec6eaf2f40738f19f |