A library of tools that I use to manage files,clean datasets and do exploratory data analysis
Project description
DSToolkit - utilities for better analytics projects
A library of tools that I use to manage files, clean datasets and do exploratory data analysis
Table of Contents
General Info
This library is a set of tools for managing files, cleaning data and doing exploratory data analysis.
This all started because I found myself creating lots and lots of versions of data files in various states of completeness. I would scrape some data, write it a file (in the data/raw folder) then work on it some and save it to the data/processed folder. After a few iterations, I couldn't remember if it was data/raw/scraped_page1.csv or data/raw/scraped_page101.csv that was the latest. So I started to name the files with a timestamp appendage scraped_page_01011850.csv (for a file that was created on Jan 1 at 6:50pm). So I needed a utility to create the timestamps and then get the lastest version of the file. I copied this code so much that I decided to use it as a way to learn about creating real Python projects, GitHub hooks, Visual Studio Code, Docker Containers and more.
Technologies
- Python 3.7
- Pandas
Usage
pip install -U mlderes.dstoolkit
In your module:
from mlderes.dstoolkit import get_latest_data_filename, DataFolder, make_ts_filename, write_data
data_folder = DataFolder('./data') # root data folder
DATA_RAW = data_folder.RAW
DATA_EXTERNAL = data_folder/'external'
# Get the filename (path) of the file like foo* in the ./data/raw directory
fp = get_latest_data_filename(DATA_RAW, 'foo')
Contributions
This project was developed using Visual Studio Code and leverages the support the platform has for developing in containers, so if you have Docker Desktop installed, you should be able to fork this repo, download a copy to locally and open the folder in a container. All the dependencies are there, nothing to install, no need to worry about specific versions of libraries, creating venvs on your machine. Heck you don't even need Python installed!
Contributions to documentation, utilities and issues are welcome. All pull requests must include unittests and all existing tests must pass before being considered.
Todo
- Make documentation as part of build
- Add more samples to documentation
License
This work is licensed under the GPL, which guarentees end users the freedom to study, share, and modify the software for your own use.
Project details
Release history Release notifications | RSS feed
Download files
Download the file for your platform. If you're not sure which to choose, learn more about installing packages.
Source Distribution
File details
Details for the file mlderes.dstoolkit-0.2.5.tar.gz
.
File metadata
- Download URL: mlderes.dstoolkit-0.2.5.tar.gz
- Upload date:
- Size: 10.4 kB
- Tags: Source
- Uploaded using Trusted Publishing? No
- Uploaded via: twine/3.2.0 pkginfo/1.5.0.1 requests/2.24.0 setuptools/47.1.1 requests-toolbelt/0.9.1 tqdm/4.48.0 CPython/3.8.3
File hashes
Algorithm | Hash digest | |
---|---|---|
SHA256 | 50104d245eb1011f87558d659ce10aeda51838448e77967ae727456f44fa9198 |
|
MD5 | 4f869294c9c14ba036e6d8680b5dfc6b |
|
BLAKE2b-256 | 18c65998c8b2a0ea55ab3cd01493038689b783ec6897f7c380d4e86174526d4b |