Skip to main content

A Python library to convert COBOL ebcdic file to parquet format based on copybook

Project description

pycobol2parquet

pycobol2parquet is a Python library to convert COBOL ebcdic file to parquet format.

I released pycobol2csv back in 2021 and it has been deployed to multiple production systems. One feedback I received is about the possibility of converting from Cobol to Parquet directly for analytical workload.

It is straightforward to reuse the same underline knowledge and code to generate Parquet file.

Install the python module:

pip install pycobol2parquet

To use the module:

from pycobol2parquet import convert_cobol_file, decode_copybook_file

row_length, cobol_struc = decode_copybook_file(copybook_file)

convert_cobol_file(copybook_file, data_file, output_file, codepage, debug=False)

Please refer to convert_cobol_test_main.py for details.

test

2 sets of test data have been created from scratch. Each set includes a copybook and an EBCDIC data file.

To test:

python convert_cobol_test_main.py --copybook testdata\test2\DWSTUB.txt --data testdata\test2\DWSTUB_DATA.DAT --output DWSTUB_DATA_output.parquet

known issues and limitations

  • Be aware of the resources available in your runtime environment and make sure the Cobol file size is not beyond the limit or cause any performance issue.

To handle large Cobol files, you can split the files into smaller chunks and then process the chunks in parallel. Please refer to the medium post for details.

  • When creating Parquet files the library detects data type automatically. This is to simplify the parameters passed to the conversion function.

Project details


Download files

Download the file for your platform. If you're not sure which to choose, learn more about installing packages.

Source Distributions

No source distribution files available for this release.See tutorial on generating distribution archives.

Built Distribution

pycobol2parquet-0.0.3-py3-none-any.whl (9.1 kB view hashes)

Uploaded Python 3

Supported by

AWS AWS Cloud computing and Security Sponsor Datadog Datadog Monitoring Fastly Fastly CDN Google Google Download Analytics Microsoft Microsoft PSF Sponsor Pingdom Pingdom Monitoring Sentry Sentry Error logging StatusPage StatusPage Status page