Skip to main content

Data Science Toolkit (DST) is a Python library that helps implement data science related project with ease.

Project description

Data Science Toolkit (DST)

Docs License Reproducible Capsule

Data Science Toolkit (DST) is a Python library that helps implement data science projects with ease: from data ingestion and preprocessing to modeling, geospatial analysis, computer vision, text vectorization, and reinforcement learning.

It bundles practical, production-friendly utilities and higher-level abstractions so you can move faster while keeping control over the details.

Key Features

  • Data handling: DataFrame for loading CSV/JSON/Excel/Parquet, cleaning, transforming, and streaming large datasets.
  • Modeling: Model for traditional ML and deep learning training, cross-validation, metrics, and GPU helpers.
  • Text & NLP: Vectorizer for bag-of-words/TF-IDF, tokenization, cosine similarity, and projections.
  • Charts: Chart utilities for quick exploratory visuals with Matplotlib/Seaborn/Plotly.
  • GIS: GIS for geospatial data layers, joins, CRS transforms, area/perimeter, and exports.
  • Computer Vision: ImageFactory for resizing, cropping, contour detection, blending, and basic filters.
  • Reinforcement Learning: Environment and R3 tools to explore policies and custom environments.
  • Crop Simulation: CSM modules for crop water requirement, ET simulations, and monitoring pipelines.
  • Utilities: Lib with climate, math, text processing, IO helpers, and more.

Installation

DST is published as data-science-toolkit.

pip install data-science-toolkit

If you’re installing from source (for development):

git clone https://github.com/elhachimi-ch/dst.git
cd dst
pip install -e .

Notes:

  • Requires Python 3.5+.
  • Some features (e.g., deep learning, GIS, CV) pull heavier dependencies (TensorFlow, CatBoost, OpenCV, Geo stack). Install times may vary.

Quickstart

from data_science_toolkit.dataframe import DataFrame
from data_science_toolkit.model import Model

# Load a toy dataset
data = DataFrame()
data.load_dataset('iris')
y = data.get_column('target')
data.drop_column('target')

# Fit a decision tree
model = Model(data_x=data.get_dataframe(), data_y=y, model_type='dt', training_percent=0.8)
model.train()
model.report()          # classification metrics
model.cross_validation(5)

Work with Parquet (large data)

from data_science_toolkit.dataframe import DataFrame

# Stream a Parquet dataset efficiently
df = DataFrame(data_path="path/to/parquet/dir", data_type="parquet", n_workers="auto")
summary = df.describe()  # computes per-column stats without loading entire data into RAM
print(summary)

Text Vectorization

from data_science_toolkit.vectorizer import Vectorizer

documents = [
	"data science is fun",
	"toolkits help data workflows",
	"science advances with good tools"
]

vec = Vectorizer(documents_as_list=documents, vectorizer_type='tfidf', ngram_tuple=(1,2))
matrix = vec.get_matrix()
features = vec.get_features_names()
print(len(features), features[:10])

Geospatial Utilities

from data_science_toolkit.gis import GIS

gis = GIS()
gis.add_data_layer("parcels", "data/parcels.geojson", data_type="sf")
gis.add_area_column("parcels", unit="ha")
gis.to_crs("parcels", epsg="3857")
gis.export("parcels", "out/parcels_3857", file_format="geojson")

Computer Vision Helpers

from data_science_toolkit.imagefactory import ImageFactory

img = ImageFactory("data/sample.jpg")
img.to_gray_scale()
img.gaussian_blur((5,5))
img.save("out/processed.jpg")

Documentation

Full API docs and tutorials live at: https://data-science-toolkit.readthedocs.io

Contributing

Contributions and suggestions are welcome via GitHub pull requests.

Typical workflow:

  • Fork the repo and create a feature branch.
  • Install dev dependencies: pip install -e ..
  • Add tests or notebook snippets where relevant.
  • Open a PR with a clear description and examples.

Maintainership

We’re actively enhancing the repo with new algorithms and utilities. Feedback on priorities is appreciated.

License

MIT License. See the LICENSE file for details.

Citation

If you use DST in academic work, please cite the repository and (optionally) reference the Code Ocean capsule for reproducibility: https://codeocean.com/capsule/1309232/tree

Project details


Download files

Download the file for your platform. If you're not sure which to choose, learn more about installing packages.

Source Distribution

data_science_toolkit-0.1.67.tar.gz (212.4 kB view details)

Uploaded Source

Built Distribution

If you're not sure about the file name format, learn more about wheel file names.

data_science_toolkit-0.1.67-py3-none-any.whl (217.0 kB view details)

Uploaded Python 3

File details

Details for the file data_science_toolkit-0.1.67.tar.gz.

File metadata

  • Download URL: data_science_toolkit-0.1.67.tar.gz
  • Upload date:
  • Size: 212.4 kB
  • Tags: Source
  • Uploaded using Trusted Publishing? No
  • Uploaded via: twine/6.2.0 CPython/3.10.18

File hashes

Hashes for data_science_toolkit-0.1.67.tar.gz
Algorithm Hash digest
SHA256 0376a08bf100ca2b03ee5fd9ba8bbcdb8a42dbe11d16053a4c8a36615be3561d
MD5 37a60591eb1b29d6353e442816f8c3ed
BLAKE2b-256 7a1c05e50779522b5ca686cb12d8deb24aab61e03f2986b1a98fbb02df9f5f06

See more details on using hashes here.

File details

Details for the file data_science_toolkit-0.1.67-py3-none-any.whl.

File metadata

File hashes

Hashes for data_science_toolkit-0.1.67-py3-none-any.whl
Algorithm Hash digest
SHA256 714371e90c9aaaed9fc5f6339e1ca056e515c2bc902dcc4565be35ae0e1d8a62
MD5 a5380314428f76840385c4fe5d453914
BLAKE2b-256 efa2eaeff787a5c8fed72c68b13053837cee1039afb762b54c1bf1913467b78e

See more details on using hashes here.

Supported by

AWS Cloud computing and Security Sponsor Datadog Monitoring Depot Continuous Integration Fastly CDN Google Download Analytics Pingdom Monitoring Sentry Error logging StatusPage Status page