ABN Amro technical assignment package
Project description
Project Title
Overview
This project provides a framework for reading, transforming, and writing CSV datasets using PySpark. It reads data from multiple CSV files, applies a selected transformation, and writes the resulting data to a target location.
The user specifies the input datasets, the transformation to apply, and the output file path via command-line arguments. The project uses a flexible and modular design, allowing for multiple transformations.
Features
- Read CSV files into PySpark DataFrames.
- Perform a series of predefined transformations on the data.
- Write the transformed data to a specified output CSV file.
- Easily configurable via command-line arguments.
- Modular and extendable for additional transformations.
Requirements
- Python 3.x
- PySpark
- Pydantic (for model validation)
- argparse (for argument parsing)
Project Structure
.
|-- README.md
|-- assignment_files
| |-- dataset_one.csv
| |-- dataset_three.csv
| |-- dataset_two.csv
| `-- exercise.md
|-- framework
| |-- __init__.py
| |-- __pycache__
| | `-- __init__.cpython-310.pyc
| |-- base
| | |-- __init__.py
| | |-- __pycache__
| | | |-- __init__.cpython-310.pyc
| | | `-- base.cpython-310.pyc
| | `-- base.py
| |-- reader
| | |-- __init__.py
| | |-- __pycache__
| | | |-- __init__.cpython-310.pyc
| | | `-- csv_reader.cpython-310.pyc
| | `-- csv_reader.py
| |-- transform
| | |-- __init__.py
| | |-- __pycache__
| | | |-- __init__.cpython-310.pyc
| | | |-- transform1.cpython-310.pyc
| | | |-- transform2.cpython-310.pyc
| | | |-- transform3.cpython-310.pyc
| | | |-- transform4.cpython-310.pyc
| | | |-- transform5.cpython-310.pyc
| | | |-- transform6.cpython-310.pyc
| | | |-- transform7.cpython-310.pyc
| | | `-- transform8.cpython-310.pyc
| | |-- transform1.py
| | |-- transform2.py
| | |-- transform3.py
| | |-- transform4.py
| | |-- transform5.py
| | |-- transform6.py
| | |-- transform7.py
| | `-- transform8.py
| `-- writer
| |-- __init__.py
| |-- __pycache__
| | |-- __init__.cpython-310.pyc
| | `-- csv_writer.cpython-310.pyc
| `-- csv_writer.py
|-- main.py
|-- output
| |-- best_salesperson
| | |-- _SUCCESS
| | `-- part-00000-edb1c614-a0f5-46f2-a86a-5345f7e81d64-c000.csv
| |-- department_breakdown
| | |-- _SUCCESS
| | `-- part-00000-dabece4d-5e3f-48e7-827b-d1003546d134-c000.csv
| |-- it_data
| | |-- _SUCCESS
| | `-- part-00000-b0f7ec6d-ec64-457c-901b-b50991b7c774-c000.csv
| |-- marketing_address_info
| | |-- _SUCCESS
| | `-- part-00000-d04799be-d17b-4a2e-aeb7-1019aba578b4-c000.csv
| |-- most_order_age_group
| | |-- _SUCCESS
| | `-- part-00000-8095a6c4-cfbd-4498-abbb-b1d4c88f7671-c000.csv
| |-- top_3
| | |-- _SUCCESS
| | `-- part-00000-8eab2190-8b5a-416b-816a-fc89f226e4ae-c000.csv
| |-- top_3_most_order_company_by_dept
| | |-- _SUCCESS
| | `-- part-00000-ef18c3bb-fa4e-4013-a40a-7860dfdd5f31-c000.csv
| `-- top_3_most_sold_per_department_netherlands
| |-- _SUCCESS
| `-- part-00000-95a49831-4245-4e44-b490-7996d3c6844d-c000.csv
|-- requirements.txt
|-- setup.cfg
|-- setup.py
|-- test
| |-- __init__.py
| `-- transform
| |-- __init__.py
| |-- test_transform1.py
| |-- test_transform2.py
| |-- test_transform3.py
| |-- test_transform4.py
| |-- test_transform5.py
| `-- test_transform6.py
Installation
Run the following command to install the package
pip install abn-amro-assessment-2024
Usage
python3 main.py --dataset_one_path <file1 full_path>
--dataset_two_path <file2 full_path>
--dataset_three_path <file3 full_path>
--transform_name <name of transformation>
--target_path <target full_path>
For example:
python3 main.py --dataset_one_path /assignment_files/dataset_one.csv
--dataset_two_path /assignment_files/dataset_two.csv
--dataset_three_path /assignment_files/dataset_three.csv
--transform_name Transform1
--target_path output/it_data
What is Implemented?
Output #1 - IT Data
Output #2 - Marketing Address Information
Output #3 - Department Breakdown
Output #4 - Top 3 best performers per department
Output #5 - Top 3 most sold products per department in the Netherlands
Output #6 - Who is the best overall salesperson per country
Additional Scenarios: Extra Bonus
Extra Bonus: Output #7 - Which age group of recipient has the highest number of quantity ordered per department
- The output directory should be called most_order_age_group and you must use PySpark to save only to one CSV file.
Extra Bonus: Output #8 - Top 3 companies those have placed the most order quantity per department
- The output directory should be called top_3_most_order_company_by_dept and you must use PySpark to save only to one CSV file.
Testing
Unit Test Cases are added inside test/transform utilizing chispa Pyspark Test Helper.
Build and Release
This application have an automated build pipeline using GitHub Actions that releases PyPI package to PyPi repository
Contact
Release details can be found here: https://pypi.org/project/abn-amro-assessment-2024/
For any questions, feel free to contact me at sumanta.dutta2012@gmail.com.
Project details
Release history Release notifications | RSS feed
Download files
Download the file for your platform. If you're not sure which to choose, learn more about installing packages.
Source Distribution
Built Distribution
File details
Details for the file abn_amro_assessment_2024-0.0.5.tar.gz
.
File metadata
- Download URL: abn_amro_assessment_2024-0.0.5.tar.gz
- Upload date:
- Size: 8.3 kB
- Tags: Source
- Uploaded using Trusted Publishing? No
- Uploaded via: twine/5.1.1 CPython/3.12.7
File hashes
Algorithm | Hash digest | |
---|---|---|
SHA256 | 79fc7dc0264bba952ed5e658508bd6fdd6f6f6b472b4066dddc8e2f4053fbaa8 |
|
MD5 | 4c4fb65f5caacece7f3e6815510ded1f |
|
BLAKE2b-256 | 6be4ed4784cd54cc338f6de6ebfff8e3c70680e97b4911737af74cea05edf1ec |
File details
Details for the file abn_amro_assessment_2024-0.0.5-py3-none-any.whl
.
File metadata
- Download URL: abn_amro_assessment_2024-0.0.5-py3-none-any.whl
- Upload date:
- Size: 15.9 kB
- Tags: Python 3
- Uploaded using Trusted Publishing? No
- Uploaded via: twine/5.1.1 CPython/3.12.7
File hashes
Algorithm | Hash digest | |
---|---|---|
SHA256 | 5c745d74242eb0f730786efc9cd2f686b79c24f26ef288bec02cf52e47436b9f |
|
MD5 | 231d085af99ded4cec2c8a2d8a3ed426 |
|
BLAKE2b-256 | 30dbfdd4bab2a0af173cb8c1a769aaabfd6682f67eb904d2ab0e6994d6b12db4 |