A reader for large files with custom delimiters and encodings
Project description
AlphaReader
After several attempts to try the csv
package or pandas
for reading large files with custom delimiters, I ended up writting a little program that does the job without complaints.
AlphaReader is a high performant, pure python, 15-line of code library, that reads chunks of bytes from your files, and retrieve line by line, the content of it.
The inspiration of this library came by having to extract data from a MS-SQL Server database, and having to deal with the CP1252
encoding. By default AlphaReader takes this encoding as it was useful in our use case.
It works also with HDFS
through the pyarrow
library. But is not a depedency.
CSVs
# !cat file.csv
# 1,John,Doe,2010
# 2,Mary,Smith,2011
# 3,Peter,Jones,2012
> reader = AlphaReader(open('file.csv', 'rb'), encoding='cp1252', terminator=10, delimiter=44)
> next(reader)
> ['1','John','Doe','2010']
TSVs
# !cat file.tsv
# 1 John Doe 2010
# 2 Mary Smith 2011
# 3 Peter Jones 2012
> reader = AlphaReader(open('file.tsv', 'rb'), encoding='cp1252', terminator=10, delimiter=9)
> next(reader)
> ['1','John','Doe','2010']
XSVs
# !cat file.tsv
# 1¦John¦Doe¦2010
# 2¦Mary¦Smith¦2011
# 3¦Peter¦Jones¦2012
> ord('¦')
> 166
> chr(166)
> '¦'
> reader = AlphaReader(open('file.tsv', 'rb'), encoding='cp1252', terminator=10, delimiter=166)
> next(reader)
> ['1','John','Doe','2010']
HDFS
# !hdfs dfs -cat /raw/tsv/file.tsv
# 1 John Doe 2010
# 2 Mary Smith 2011
# 3 Peter Jones 2012
> import pyarrow as pa
> fs = pa.hdfs.connect()
> reader = AlphaReader(fs.open('/raw/tsv/file.tsv', 'rb'), encoding='cp1252', terminator=10, delimiter=9)
> next(reader)
> ['1','John','Doe','2010']
Transformations
# !cat file.csv
# 1,2,3
# 10,20,30
# 100,200,300
> fn = lambda x: int(x)
> reader = AlphaReader(open('/raw/tsv/file.tsv', 'rb'), encoding='cp1252', terminator=10, delimiter=44, fn_transform=fn)
> next(reader)
> [1,2,3]
> next(reader)
> [10,20,30]
Chain Transformations
# !cat file.csv
# 1,2,3
# 10,20,30
# 100,200,300
> fn_1 = lambda x: x+1
> fn_2 = lambda x: x*10
> reader = AlphaReader(open('/raw/tsv/file.tsv', 'rb'), encoding='cp1252', terminator=10, delimiter=44, fn_transform=[int, fn_1, fn_2])
> next(reader)
> [20,30,40]
> next(reader)
> [110,210,310]
Caution
> reader = AlphaReader(open('large_file.xsv', 'rb'), encoding='cp1252', terminator=172, delimiter=173)
> records = list(reader) # Avoid this as it will load all file in memory
Limitations
- No support for
multi-byte
delimiters - Relatively slower performance than
csv
library. Usecsv
and dialects when your files have\r\n
terminators - Transformations are per row, perhaps vectorization could aid performance
Performance
- 24MB file loaded with
list(AlphaReader(file_handle))
tests/test_profile.py::test_alphareader_with_encoding
--------------------------------------------------------------------------------- live log call
INFO root:test_profile.py:22 252343 function calls in 0.386 seconds
Ordered by: cumulative time
ncalls tottime percall cumtime percall filename:lineno(function)
119605 0.039 0.000 0.386 0.000 .\alphareader\__init__.py:39(AlphaReader)
122228 0.266 0.000 0.266 0.000 {method 'split' of 'str' objects}
2625 0.005 0.000 0.054 0.000 {method 'decode' of 'bytes' objects}
2624 0.001 0.000 0.049 0.000 .\Python-3.7.4\lib\encodings\cp1252.py:14(decode)
2624 0.048 0.000 0.048 0.000 {built-in method _codecs.charmap_decode}
2625 0.027 0.000 0.027 0.000 {method 'read' of '_io.BufferedReader' objects}
1 0.000 0.000 0.000 0.000 .\__init__.py:5(_validate)
1 0.000 0.000 0.000 0.000 {built-in method _codecs.lookup}
Project details
Download files
Download the file for your platform. If you're not sure which to choose, learn more about installing packages.
Source Distribution
Built Distribution
File details
Details for the file alphareader-0.0.7.tar.gz
.
File metadata
- Download URL: alphareader-0.0.7.tar.gz
- Upload date:
- Size: 4.2 kB
- Tags: Source
- Uploaded using Trusted Publishing? No
- Uploaded via: twine/3.1.1 pkginfo/1.5.0.1 requests/2.23.0 setuptools/40.8.0 requests-toolbelt/0.9.1 tqdm/4.43.0 CPython/3.7.4
File hashes
Algorithm | Hash digest | |
---|---|---|
SHA256 | 60fbf96ba313a2492ad8aee83815c9fee9c700e422bf951a602dfc073a1cb726 |
|
MD5 | 92a9b3a8460b4f2bcfabe930dfb23c9a |
|
BLAKE2b-256 | e362c94a65e19dae4522e83d4c622b039d2a7f45fcbf240f0762fa63744dcdae |
File details
Details for the file alphareader-0.0.7-py3-none-any.whl
.
File metadata
- Download URL: alphareader-0.0.7-py3-none-any.whl
- Upload date:
- Size: 15.9 kB
- Tags: Python 3
- Uploaded using Trusted Publishing? No
- Uploaded via: twine/3.1.1 pkginfo/1.5.0.1 requests/2.23.0 setuptools/40.8.0 requests-toolbelt/0.9.1 tqdm/4.43.0 CPython/3.7.4
File hashes
Algorithm | Hash digest | |
---|---|---|
SHA256 | 26448aa8bcf46eebff4c9d720893796fdd404bd045b27d4379782eadfbee9e00 |
|
MD5 | c7075f4108e74264f1a5c95a9ed3a7c6 |
|
BLAKE2b-256 | 82b5fc0331ff678a662922451408cabd5ad1d741dfa99fd29bd9c2df4c5e0811 |