This package calculates average student performances
Project description
bigDataSML
This project was created as a Assingment in BigDataProgramming in November 2021 at DHBW Stuttgart. The abbreviation SML is from my first name "Samuel". To test its main functionality locally, you can go into the "src"-Folder and run
python3 main.py
This prints the Output-Table and creates a ouput-CSV. If you want to try it again, you first need to delete this CSV! Athorwise that will cause an error.
There is also a GitHub-Repository for this project:
https://git.dhbw-stuttgart.de/wi20067/bigDataSML
Enjoy this package and the functionlity it contains! (:D it's more about Spark, creating Packages and using GitLab-CI ...)
Funtctionality Description of main.py (in German!!!)
As this only descripes the simple functionlity of the "main.py" the following part is written in German.
Data Source:
Die in diesem Projekt verwendeten Daten stammen aus dem Data-Science-Portal Kaggle. Von dort wurden sie heruntergeladen und in Form von CSV-Dateien in dem "/data"-Ordner gespeichert. https://www.kaggle.com/spscientist/students-performance-in-exams
Project Goal:
Ziel des Projektes ist es mit Hilfe der gelernten Spark-Fähigkeiten eine Tabelle aus den Durchschnittsnoten je ethnische Gruppe, erst allgemein und anschließend auch aufgeteilt nach Geschlechtern Somit könne Unterschiede zwischen den Geschlechtern und den ethnischen Gruppen analysiert werden und die best-, bzw. schlechtperformendste Gruppe herausgearbeitet werden
Data Description:
Grundlage dafür ist eine 1002 x 8 Tabelle mit Informationen, bei der jede Zeile eine Schülerin/ einen Schüler repräsentiert Informationen werden jeweils zum Geschlecht, der ethnischen Gruppe, dem Bildungsniveau der Eltern, usw. und vor allem zu einer Mathenote, einer Lesenote und einer Schreibnote in Form einer CSV-Datei geliefert Zur späteren Verwendung und zum einfacheren Vergleich berechnet dieses Programm zu Beginn in einer neuen Spalte die Durchschnittsnote aus Mathe, Lesen und Schreiben Anschließend erstellt es zwei neue Tabellen nach Geschlechtern und berechnet für alle drei Tabellen (Weiblich, männlich und alle) die Durchschnittsnote aus allen Schülern je ethnische Gruppe (vorheriges Gruppieren notwendig!) Die berechneten Ergebnisse werden in dem ursprünglichen Dataframe zusammengesetzt und in einer neuen CSV-Datei gespeichert
Result:
Ergebnis der Untersuchungen ist zuallererst, das über alle ethnischen Gruppen hinweg Mädchen bessere Ergebnisse liefern konnten als Jungen Auch ist immer die Gruppe E die mit den besten Ergebnissen und so ergibt sich hieraus, dass die Mädchen aus der ethnischen Gruppe E die beste Gesamt-Durchschnittsnote liefern konnten
Output-Table:
Ethnische Gruppe | Durchschnittsnoten | Durchschnittsnoten - Weiblich | Durchschnittsnoten - Männlich |
---|---|---|---|
group E | 72.75238095238097 | 74.06280193236712 | 71.47887323943662 |
group D | 69.17938931297705 | 71.43927648578813 | 66.98746867167922 |
group C | 67.13166144200628 | 68.58518518518518 | 65.24940047961628 |
group B | 65.46842105263156 | 67.50961538461539 | 63.00000000000001 |
group A | 62.992509363295866 | 65.12962962962963 | 61.54088050314464 |
How to pack everything and upload it as a Package to PyPi
This project is also uploaded as a pip-package. You can find it at:
https://pypi.org/project/bigDataSML/0.1.3/
From the functionality that does't make that much sense, but it was about "How-to-do-that". The steps are descriped below ...
I. Create your file structure
-
I would always recommand to create projects like this inside a Git-Repo. Look for a tutorial if you don't know how.
-
Create your main.py. I created mine inside the "/src"-folder. This may contain whatever functionality you want to have.
-
Create a "init.py". You will need two underscores at the beginning and at the end. Here you can add the functions from "main.py". In my case this is just:
from main import main
- Create a "setup.py". This will contain all the information of the package and the requirements. You may use mine as a template.
from setuptools import setup, find_packages
import codecs
import os
VERSION = '0.0.1'
DESCRIPTION = 'This package calculates average student performances'
this_directory=os.path.abspath(os.path.dirname(__file__))
with open(os.path.join(this_directory+"/doc","ReadMe.md"),encoding="utf-8") as f:
long_description = f.read()
# Setting up
setup(
name="bigDataSML",
version=VERSION,
author="SML (Samuel Schlenker)",
author_email="wi20067@lehre.dhbw-stuttgart.de",
description=DESCRIPTION,
url="https://github.com/Samu2021/bigDataSML",
long_description_content_type="text/markdown",
long_description=long_description,
python_requieres=">=2.7.18",
packages=find_packages(),
install_requires=["pyspark >= 2.3.0"],
keywords=["python","spark","pyspark","student","performance","calculation","bigdata","programming","fun","sml"],
classifiers=[
"Development Status :: 4 - Beta",
"Intended Audience :: Education",
"Programming Language :: Python :: 2.7",
"Natural Language :: English",
"Natural Language :: German",
"Operating System :: Unix",
"Operating System :: MacOS :: MacOS X",
"Operating System :: Microsoft :: Windows",
]
)
- It might make also sense to add a "ReadMe" and a "License". I created them inside the "/doc"-Folder.
II. Registration at PyPI
Just create a normal account at:
And you should hopefully remeber your credentials later ...
III. Create Package and upload to PyPI
- If everything is fine you need to run the setup.py which creates all the necessary files.
python3 setup.py sdist bdist_wheel
- If you haven't yet, you may need to install it first by:
pip3 install twine
- Then you should be able to upload the content from the dist folder with the twine-package.
python3 -m twine upload dist/*
This will ask you for your username and password, where you have to use the login credantials created above.
Further informations:
For any questions regarding the distribution and installation of Python packages there is a great documentation:
Gitlab-CI
Additinally i created a Gitlab-CI-Pipeline. This also was for training purpose and isn't used the most senseful way. I just wanted to try and show how this great tool can be used. Therefore i had to create the ".gitlab-ci.yml"-File.
-
Inside i first specified a image. As my files are in python, I used Python. This way a docker container with Python will be used.
-
The "stages" section is kind of an overview of all the steps which will be executed by GitLab.
-
before_scrip: For checking if Python works, this section just prints the Python version which will be used.
-
The build-section in my case creates the files which will be uploaded to PyPI. The files in the "dist"-folder are needed later, so i specified them as artifacts.
-
In "run" i first installed pyspark via pip and than executed my "main.py". Here the "output"-Folder is the artifact.
-
In a last "test"-stage i just check the existence of the output-File of "the main.py" - further tests would be possible and also make sense but are from my point of view beyond the scope of this work.
This way many things could be automated and way easier. For this small project the pipe obviously is exaggerated.
Project details
Release history Release notifications | RSS feed
Download files
Download the file for your platform. If you're not sure which to choose, learn more about installing packages.
Source Distribution
Built Distribution
File details
Details for the file bigDataSML-0.1.3.tar.gz
.
File metadata
- Download URL: bigDataSML-0.1.3.tar.gz
- Upload date:
- Size: 5.7 kB
- Tags: Source
- Uploaded using Trusted Publishing? No
- Uploaded via: twine/3.6.0 importlib_metadata/4.8.2 pkginfo/1.8.1 requests/2.26.0 requests-toolbelt/0.9.1 tqdm/4.62.3 CPython/3.9.8
File hashes
Algorithm | Hash digest | |
---|---|---|
SHA256 | aac1a178a064b1e3f8a2ec902f1297f40463e53d5ce2f9ec560abd21c10aa6b7 |
|
MD5 | 0fd66ce8a15245c2992c6e8f5182fdd6 |
|
BLAKE2b-256 | eff15c4a966ef91f54ddf1f85e4c7858b912684778936198d2e1289829933363 |
File details
Details for the file bigDataSML-0.1.3-py3-none-any.whl
.
File metadata
- Download URL: bigDataSML-0.1.3-py3-none-any.whl
- Upload date:
- Size: 6.1 kB
- Tags: Python 3
- Uploaded using Trusted Publishing? No
- Uploaded via: twine/3.6.0 importlib_metadata/4.8.2 pkginfo/1.8.1 requests/2.26.0 requests-toolbelt/0.9.1 tqdm/4.62.3 CPython/3.9.8
File hashes
Algorithm | Hash digest | |
---|---|---|
SHA256 | a5e12eae698447dc72fcd5473b385b984ace642f8c5b5e398c2210680925053e |
|
MD5 | 77c5af00d58a9d9d629b8ab75012b7cd |
|
BLAKE2b-256 | e6ab3c27f4e262d44c1ec4e82fade1349e0af7ca7d2aa428c2b94ba52b49e81b |