Skip to main content

A reader for large files with custom delimiters and encodings

Project description

AlphaReader

canimus

After several attempts to try the csv package or pandas for reading large files with custom delimiters, I ended up writting a little program that does the job without complaints.

AlphaReader is a high performant, pure python, 15-line of code library, that reads chunks of bytes from your files, and retrieve line by line, the content of it.

The inspiration of this library came by having to extract data from a MS-SQL Server database, and having to deal with the CP1252 encoding. By default AlphaReader takes this encoding as it was useful in our use case.

It works also with HDFS through the pyarrow library. But is not a depedency.

CSVs

# !cat file.csv
# 1,John,Doe,2010
# 2,Mary,Smith,2011
# 3,Peter,Jones,2012

> reader = AlphaReader(open('file.csv', 'rb'), encoding='cp1252', terminator=10, delimiter=44)
> next(reader)
> ['1','John','Doe','2010']

TSVs

# !cat file.tsv
# 1    John    Doe    2010
# 2    Mary    Smith  2011
# 3    Peter   Jones  2012

> reader = AlphaReader(open('file.tsv', 'rb'), encoding='cp1252', terminator=10, delimiter=9)
> next(reader)
> ['1','John','Doe','2010']

XSVs

# !cat file.tsv
# 1¦John¦Doe¦2010
# 2¦Mary¦Smith¦2011
# 3¦Peter¦Jones¦2012

> ord('¦')
> 166
> chr(166)
> '¦'
> reader = AlphaReader(open('file.tsv', 'rb'), encoding='cp1252', terminator=10, delimiter=166)
> next(reader)
> ['1','John','Doe','2010']

HDFS

# !hdfs dfs -cat /raw/tsv/file.tsv
# 1    John    Doe    2010
# 2    Mary    Smith  2011
# 3    Peter   Jones  2012

> import pyarrow as pa
> fs = pa.hdfs.connect()
> reader = AlphaReader(fs.open('/raw/tsv/file.tsv', 'rb'), encoding='cp1252', terminator=10, delimiter=9)
> next(reader)
> ['1','John','Doe','2010']

Transformations

# !cat file.csv
# 1,2,3
# 10,20,30
# 100,200,300

> fn = lambda x: int(x)
> reader = AlphaReader(open('/raw/tsv/file.tsv', 'rb'), encoding='cp1252', terminator=10, delimiter=44, fn_transform=fn)
> next(reader)
> [1,2,3]
> next(reader)
> [10,20,30]

Chain Transformations

# !cat file.csv
# 1,2,3
# 10,20,30
# 100,200,300

> fn_1 = lambda x: x+1
> fn_2 = lambda x: x*10
> reader = AlphaReader(open('/raw/tsv/file.tsv', 'rb'), encoding='cp1252', terminator=10, delimiter=44, fn_transform=[int, fn_1, fn_2])
> next(reader)
> [20,30,40]
> next(reader)
> [110,210,310]

Caution

> reader = AlphaReader(open('large_file.xsv', 'rb'), encoding='cp1252', terminator=172, delimiter=173)
> records = list(reader) # Avoid this as it will load all file in memory

Limitations

  • No support for multi-byte delimiters
  • Relatively slower performance than csv library. Use csv and dialects when your files have \r\n terminators
  • Transformations are per row, perhaps vectorization could aid performance

Performance

  • 24MB file loaded with list(AlphaReader(file_handle))
tests/test_profile.py::test_alphareader_with_encoding
--------------------------------------------------------------------------------- live log call 
INFO     root:test_profile.py:22          252343 function calls in 0.386 seconds

    Ordered by: cumulative time

   ncalls  tottime  percall  cumtime  percall filename:lineno(function)
   119605    0.039    0.000    0.386    0.000 .\alphareader\__init__.py:39(AlphaReader)
   122228    0.266    0.000    0.266    0.000 {method 'split' of 'str' objects}
     2625    0.005    0.000    0.054    0.000 {method 'decode' of 'bytes' objects}
     2624    0.001    0.000    0.049    0.000 .\Python-3.7.4\lib\encodings\cp1252.py:14(decode)
     2624    0.048    0.000    0.048    0.000 {built-in method _codecs.charmap_decode}
     2625    0.027    0.000    0.027    0.000 {method 'read' of '_io.BufferedReader' objects}
        1    0.000    0.000    0.000    0.000 .\__init__.py:5(_validate)
        1    0.000    0.000    0.000    0.000 {built-in method _codecs.lookup}

Project details


Download files

Download the file for your platform. If you're not sure which to choose, learn more about installing packages.

Source Distribution

alphareader-0.0.7.tar.gz (4.2 kB view details)

Uploaded Source

Built Distribution

alphareader-0.0.7-py3-none-any.whl (15.9 kB view details)

Uploaded Python 3

File details

Details for the file alphareader-0.0.7.tar.gz.

File metadata

  • Download URL: alphareader-0.0.7.tar.gz
  • Upload date:
  • Size: 4.2 kB
  • Tags: Source
  • Uploaded using Trusted Publishing? No
  • Uploaded via: twine/3.1.1 pkginfo/1.5.0.1 requests/2.23.0 setuptools/40.8.0 requests-toolbelt/0.9.1 tqdm/4.43.0 CPython/3.7.4

File hashes

Hashes for alphareader-0.0.7.tar.gz
Algorithm Hash digest
SHA256 60fbf96ba313a2492ad8aee83815c9fee9c700e422bf951a602dfc073a1cb726
MD5 92a9b3a8460b4f2bcfabe930dfb23c9a
BLAKE2b-256 e362c94a65e19dae4522e83d4c622b039d2a7f45fcbf240f0762fa63744dcdae

See more details on using hashes here.

File details

Details for the file alphareader-0.0.7-py3-none-any.whl.

File metadata

  • Download URL: alphareader-0.0.7-py3-none-any.whl
  • Upload date:
  • Size: 15.9 kB
  • Tags: Python 3
  • Uploaded using Trusted Publishing? No
  • Uploaded via: twine/3.1.1 pkginfo/1.5.0.1 requests/2.23.0 setuptools/40.8.0 requests-toolbelt/0.9.1 tqdm/4.43.0 CPython/3.7.4

File hashes

Hashes for alphareader-0.0.7-py3-none-any.whl
Algorithm Hash digest
SHA256 26448aa8bcf46eebff4c9d720893796fdd404bd045b27d4379782eadfbee9e00
MD5 c7075f4108e74264f1a5c95a9ed3a7c6
BLAKE2b-256 82b5fc0331ff678a662922451408cabd5ad1d741dfa99fd29bd9c2df4c5e0811

See more details on using hashes here.

Supported by

AWS AWS Cloud computing and Security Sponsor Datadog Datadog Monitoring Fastly Fastly CDN Google Google Download Analytics Microsoft Microsoft PSF Sponsor Pingdom Pingdom Monitoring Sentry Sentry Error logging StatusPage StatusPage Status page