Skip to main content

A reader for large files with custom delimiters and encodings

Project description

AlphaReader

After several attempts to try the csv package or pandas for reading large files with custome delimiters, I ended up writting a little program that does the job without complaints.

AlphaReader is a high performant, pure python, 15-line of code library, that reads chunks of bytes from your files, and retrieve line by line, the content of it.

The inspiration of this library came by having to extract data from a MS-SQL Server database, and having to deal with the CP1252 encoding. By default AlphaReader takes this encoding as it was useful in our use case.

It works also with HDFS through the pyarrow library. But is not a depedency.

CSVs

# !cat file.csv
# 1,John,Doe,2010
# 2,Mary,Smith,2011
# 3,Peter,Jones,2012

> reader = AlphaReader(open('file.csv', 'rb'), encoding='cp1252', terminator=10, delimiter=44)
> next(reader)
> ['1','John','Doe','2010']

TSVs

# !cat file.tsv
# 1    John    Doe    2010
# 2    Mary    Smith  2011
# 3    Peter   Jones  2012

> reader = AlphaReader(open('file.tsv', 'rb'), encoding='cp1252', terminator=10, delimiter=9)
> next(reader)
> ['1','John','Doe','2010']

XSVs

# !cat file.tsv
# 1¦John¦Doe¦2010
# 2¦Mary¦Smith¦2011
# 3¦Peter¦Jones¦2012

> ord('¦')
> 166
> chr(166)
> '¦'
> reader = AlphaReader(open('file.tsv', 'rb'), encoding='cp1252', terminator=10, delimiter=166)
> next(reader)
> ['1','John','Doe','2010']

HDFS

# !hdfs dfs -cat /raw/tsv/file.tsv
# 1    John    Doe    2010
# 2    Mary    Smith  2011
# 3    Peter   Jones  2012

> import pyarrow as pa
> fs = pa.hdfs.connect()
> reader = AlphaReader(fs.open('/raw/tsv/file.tsv', 'rb'), encoding='cp1252', terminator=10, delimiter=9)
> next(reader)
> ['1','John','Doe','2010']

Transformations

# !cat file.csv
# 1,2,3
# 10,20,30
# 100,200,300

> fn = lambda x: x+1
> reader = AlphaReader(open('/raw/tsv/file.tsv', 'rb'), encoding='cp1252', terminator=10, delimiter=44, fn_tranform=fn)
> next(reader)
> [2,3,4]
> next(reader)
> [11,21,31]

Chain Transformations

# !cat file.csv
# 1,2,3
# 10,20,30
# 100,200,300

> fn_1 = lambda x: x+1
> fn_2 = lambda x: x*10
> reader = AlphaReader(open('/raw/tsv/file.tsv', 'rb'), encoding='cp1252', terminator=10, delimiter=44, fn_tranform=[fn_1, fn_2])
> next(reader)
> [20,30,40]
> next(reader)
> [110,210,310]

Caution with large files

> reader = AlphaReader(open('large_file.xsv', 'rb'), encoding='cp1252', terminator=172, delimiter=173)
> records = list(reader) # Avoid this as it will load all file in memory

Project details


Download files

Download the file for your platform. If you're not sure which to choose, learn more about installing packages.

Source Distribution

alphareader-0.0.1.tar.gz (3.3 kB view hashes)

Uploaded Source

Built Distribution

alphareader-0.0.1-py3-none-any.whl (15.2 kB view hashes)

Uploaded Python 3

Supported by

AWS AWS Cloud computing and Security Sponsor Datadog Datadog Monitoring Fastly Fastly CDN Google Google Download Analytics Microsoft Microsoft PSF Sponsor Pingdom Pingdom Monitoring Sentry Sentry Error logging StatusPage StatusPage Status page