File Genie is designed to parse various file types and transform them according to provided configuration
Project description
FileGenie
FileGenie is a Python library that simplifies file parsing from AWS S3 across various formats (eg.- TEXT, CSV, EXCEL, ZIP, XML, PDF etc.) and enables users to define custom functions for data massaging and transformation, ensuring seamless processing and tailored output generation based on provided configurations.
Features
- Multi-format Support: Effortlessly parse files in formats such as TEXT, CSV, EXCEL, ZIP, XML, and PDF directly from AWS S3.
- Flexible Response Types: Generate responses tailored to user needs, including DATAFRAME, JSON, or FILE outputs.
- Password-Protected Files: Seamlessly parse files secured with passwords.
- Custom Edge Case Handling: Apply user-defined custom functions to address specific data massaging and transformation requirements, such as sanitizing data, converting values, reformatting date fields etc.
- AWS S3 Integration: Fetch files directly from AWS S3 buckets using IAM roles for secure access.
- Streamlined Configuration: Set up easily with minimal configuration, eliminating the need of writing parser for specific file type.
Installation
Install the SDK using pip:
pip install file_genie
Prerequisites
- Your application should be deployed on AWS EKS to enable the SDK to utilize AWS S3 credentials.
- Python: >= '3.6'
- Pandas: '2.0.0'
Getting Started
Define Custom Edge Cases: Let's say you need to sanitize columns (e.g., standardise column values to a common format before applying custom logic) during file parsing, you can define custom functions for the SDK to use.
To implement this:
- Create an edgeCases folder in your project.
- Add a file named user_edge_cases.py.
- Define your custom functions in this file.
- Reference these functions in the edge_case section of the file_config.
- The SDK will automatically import and apply these functions during file parsing or transformation.
from edgeCases import user_edge_cases
self.edge_cases = user_edge_cases
Define the configuration required for file parsing logic and S3 bucket names
s3_config: {
upload_bucket: s3_bucket_name
download_bucket: s3_bucket_name
}
file_config: {
"file_source_1": {
"read_from_s3_func":"read_complete_excel_file",
"parameters_for_read_s3": None,
"file_dtype":{
"Order_Number": str,
"Added On":str,
"Added By":str
},
"columns_mapping": {
<!-- "Column Name in file": "Column name required in output" -->
"Transaction Type": "TransactionType",
"Cust Name": "CustomerName",
"Cust ID": "CustomerId",
"Transaction Amount": "Amount",
"OrderNumber": "TransactionReference",
"Reference ID": "CustomerReferenceId",
"Target Date": "TargetDate",
"TransactionDate": "TransactionDate",
"FeeAmount": "ServiceCharge",
"TaxAmount": "ServiceTax",
"NetAmount": "NetAmount"
}
"edge_case": {
<!-- edge case function name which you have defined in user_edge_case.py : params required for that function
there can be different type of params. For eg. - dict, list, str -->
<!-- In this convert_amount_as_per_currency is the edge case function which you want to apply while transforming the entries and "Amount" is the param to this function where you will apply the currency conversion -->
"convert_amount_as_per_currency": "Amount"
}
},
}
read_from_s3_func: This filed in FileGenie configuration specifies the function to be used for parsing a specific file type from AWS S3. Depending on the file format, you can choose from the following available functions:
- readFromS3 - parse the TXT, EXCEL, CSV, XML, PDF files
- readZipFromS3 - parse the zip files
- read_complete_excel_file - Use this function when working with EXCEL files containing multiple sheets.
parameters_for_read_s3: This field in FileGenie configuration specifies the additional parameters required for reading the file such as password_protected, password, sep etc. you can choose from the following available params:
- password_protected: If file is password protected or not
- passowrd_secret_key: Secret key name for password.
- skiprows: Rows to skip at the start.
- sep: Delimiter for CSV parsing.
- header: Row number(s) to use as column names.
- has_header: Specify if the file has a header.
- skip_header: Skip the header row during processing.
- sheet_name: Target sheet in an Excel file.
- parser_func: Custom parser function.
- chunksize: Number of rows to read per chunk.
- skip_footer: Rows to skip at the end.
Import and initialise the file genie
from file_genie import FileGenie
file_genie = FileGenie(config={s3_config: s3_config, file_config: file_config})
parsed_data = file_genie.parse("s3://your-bucket-name/path/to/your/file.csv", file_source, ParsedDataResponseType.DATAFRAME.value)
//By default SDK will provide response as DATAFRAME
Project details
Download files
Download the file for your platform. If you're not sure which to choose, learn more about installing packages.
Source Distribution
Built Distribution
Filter files by name, interpreter, ABI, and platform.
If you're not sure about the file name format, learn more about wheel file names.
Copy a direct link to the current filters
File details
Details for the file file_genie-0.0.4.tar.gz.
File metadata
- Download URL: file_genie-0.0.4.tar.gz
- Upload date:
- Size: 15.3 kB
- Tags: Source
- Uploaded using Trusted Publishing? No
- Uploaded via: twine/5.1.1 CPython/3.9.20
File hashes
| Algorithm | Hash digest | |
|---|---|---|
| SHA256 |
4f52b5c23041043bc0d3ee1bc4a51067f7f74f312a74f312c49664af9baf5a9b
|
|
| MD5 |
64f85e7d4cb68fdfeb635c890a4e513d
|
|
| BLAKE2b-256 |
2bc242816fd4b42457a7ecda363389f59d05d5e780641af8f760897457f99d5b
|
File details
Details for the file file_genie-0.0.4-py3-none-any.whl.
File metadata
- Download URL: file_genie-0.0.4-py3-none-any.whl
- Upload date:
- Size: 15.8 kB
- Tags: Python 3
- Uploaded using Trusted Publishing? No
- Uploaded via: twine/5.1.1 CPython/3.9.20
File hashes
| Algorithm | Hash digest | |
|---|---|---|
| SHA256 |
d4e839dfa286b7fec56ee1a677fb0f5f2b93db9877bc885e2153d18b38570e98
|
|
| MD5 |
c80771f803478c5cd509a5eb9a51c3f2
|
|
| BLAKE2b-256 |
c9a60babeec729d455ec84ae59a65a0fdc319d345611c9c62381781c6fa2a800
|