Data Science Toolkit (DST) is a Python library that helps implement data science related project with ease.
Project description
Data Science Toolkit (DST)
Data Science Toolkit (DST) is a Python library that helps implement data science projects with ease: from data ingestion and preprocessing to modeling, geospatial analysis, computer vision, text vectorization, and reinforcement learning.
It bundles practical, production-friendly utilities and higher-level abstractions so you can move faster while keeping control over the details.
Key Features
- Data handling:
DataFramefor loading CSV/JSON/Excel/Parquet, cleaning, transforming, and streaming large datasets. - Modeling:
Modelfor traditional ML and deep learning training, cross-validation, metrics, and GPU helpers. - Text & NLP:
Vectorizerfor bag-of-words/TF-IDF, tokenization, cosine similarity, and projections. - Charts:
Chartutilities for quick exploratory visuals with Matplotlib/Seaborn/Plotly. - GIS:
GISfor geospatial data layers, joins, CRS transforms, area/perimeter, and exports. - Computer Vision:
ImageFactoryfor resizing, cropping, contour detection, blending, and basic filters. - Reinforcement Learning:
EnvironmentandR3tools to explore policies and custom environments. - Crop Simulation:
CSMmodules for crop water requirement, ET simulations, and monitoring pipelines. - Utilities:
Libwith climate, math, text processing, IO helpers, and more.
Installation
DST is published as data-science-toolkit.
pip install data-science-toolkit
If you’re installing from source (for development):
git clone https://github.com/elhachimi-ch/dst.git
cd dst
pip install -e .
Notes:
- Requires Python 3.5+.
- Some features (e.g., deep learning, GIS, CV) pull heavier dependencies (TensorFlow, CatBoost, OpenCV, Geo stack). Install times may vary.
Quickstart
from data_science_toolkit.dataframe import DataFrame
from data_science_toolkit.model import Model
# Load a toy dataset
data = DataFrame()
data.load_dataset('iris')
y = data.get_column('target')
data.drop_column('target')
# Fit a decision tree
model = Model(data_x=data.get_dataframe(), data_y=y, model_type='dt', training_percent=0.8)
model.train()
model.report() # classification metrics
model.cross_validation(5)
Work with Parquet (large data)
from data_science_toolkit.dataframe import DataFrame
# Stream a Parquet dataset efficiently
df = DataFrame(data_path="path/to/parquet/dir", data_type="parquet", n_workers="auto")
summary = df.describe() # computes per-column stats without loading entire data into RAM
print(summary)
Text Vectorization
from data_science_toolkit.vectorizer import Vectorizer
documents = [
"data science is fun",
"toolkits help data workflows",
"science advances with good tools"
]
vec = Vectorizer(documents_as_list=documents, vectorizer_type='tfidf', ngram_tuple=(1,2))
matrix = vec.get_matrix()
features = vec.get_features_names()
print(len(features), features[:10])
Geospatial Utilities
from data_science_toolkit.gis import GIS
gis = GIS()
gis.add_data_layer("parcels", "data/parcels.geojson", data_type="sf")
gis.add_area_column("parcels", unit="ha")
gis.to_crs("parcels", epsg="3857")
gis.export("parcels", "out/parcels_3857", file_format="geojson")
Computer Vision Helpers
from data_science_toolkit.imagefactory import ImageFactory
img = ImageFactory("data/sample.jpg")
img.to_gray_scale()
img.gaussian_blur((5,5))
img.save("out/processed.jpg")
Documentation
Full API docs and tutorials live at: https://data-science-toolkit.readthedocs.io
Contributing
Contributions and suggestions are welcome via GitHub pull requests.
Typical workflow:
- Fork the repo and create a feature branch.
- Install dev dependencies:
pip install -e .. - Add tests or notebook snippets where relevant.
- Open a PR with a clear description and examples.
Maintainership
We’re actively enhancing the repo with new algorithms and utilities. Feedback on priorities is appreciated.
License
MIT License. See the LICENSE file for details.
Citation
If you use DST in academic work, please cite the repository and (optionally) reference the Code Ocean capsule for reproducibility: https://codeocean.com/capsule/1309232/tree
Project details
Release history Release notifications | RSS feed
Download files
Download the file for your platform. If you're not sure which to choose, learn more about installing packages.
Source Distribution
Built Distribution
Filter files by name, interpreter, ABI, and platform.
If you're not sure about the file name format, learn more about wheel file names.
Copy a direct link to the current filters
File details
Details for the file data_science_toolkit-0.1.67.tar.gz.
File metadata
- Download URL: data_science_toolkit-0.1.67.tar.gz
- Upload date:
- Size: 212.4 kB
- Tags: Source
- Uploaded using Trusted Publishing? No
- Uploaded via: twine/6.2.0 CPython/3.10.18
File hashes
| Algorithm | Hash digest | |
|---|---|---|
| SHA256 |
0376a08bf100ca2b03ee5fd9ba8bbcdb8a42dbe11d16053a4c8a36615be3561d
|
|
| MD5 |
37a60591eb1b29d6353e442816f8c3ed
|
|
| BLAKE2b-256 |
7a1c05e50779522b5ca686cb12d8deb24aab61e03f2986b1a98fbb02df9f5f06
|
File details
Details for the file data_science_toolkit-0.1.67-py3-none-any.whl.
File metadata
- Download URL: data_science_toolkit-0.1.67-py3-none-any.whl
- Upload date:
- Size: 217.0 kB
- Tags: Python 3
- Uploaded using Trusted Publishing? No
- Uploaded via: twine/6.2.0 CPython/3.10.18
File hashes
| Algorithm | Hash digest | |
|---|---|---|
| SHA256 |
714371e90c9aaaed9fc5f6339e1ca056e515c2bc902dcc4565be35ae0e1d8a62
|
|
| MD5 |
a5380314428f76840385c4fe5d453914
|
|
| BLAKE2b-256 |
efa2eaeff787a5c8fed72c68b13053837cee1039afb762b54c1bf1913467b78e
|