Data Studio annotation tool
Project description
Website • Docs • Twitter • Join Slack Community
What is Data Studio?
Data Studio is an open source data labeling tool. It lets you label data types like audio, text, images, videos, and time series with a simple and straightforward UI and export to various model formats. It can be used to prepare raw data or improve existing training data to get more accurate ML models.
- Try out Data Studio
- What you get from Data Studio
- Included templates for labeling data in Data Studio
- Set up machine learning models with Data Studio
- Integrate Data Studio with your existing tools
Have a custom dataset? You can customize Data Studio to fit your needs. Read an introductory blog post to learn more.
Try out Data Studio
Install Data Studio locally, or deploy it in a cloud instance. Or, sign up for a free trial of our Enterprise edition..
- Install locally with Docker
- Run with Docker Compose (Data Studio + Nginx + PostgreSQL)
- Install locally with pip
- Install locally with Anaconda
- Install for local development
- Deploy in a cloud instance
Install locally with Docker
Official Data Studio docker image is here and it can be downloaded with docker pull
.
Run Data Studio in a Docker container and access it at http://localhost:8080
.
docker pull heartexlabs/data-studio:latest
docker run -it -p 8080:8080 -v $(pwd)/mydata:/data-studio/data heartexlabs/data-studio:latest
You can find all the generated assets, including SQLite3 database storage data_studio.sqlite3
and uploaded files, in the ./mydata
directory.
Override default Docker install
You can override the default launch command by appending the new arguments:
docker run -it -p 8080:8080 -v $(pwd)/mydata:/data-studio/data heartexlabs/data-studio:latest data-studio --log-level DEBUG
Build a local image with Docker
If you want to build a local image, run:
docker build -t heartexlabs/data-studio:latest .
Run with Docker Compose
Docker Compose script provides production-ready stack consisting of the following components:
- Data Studio
- Nginx - proxy web server used to load various static data, including uploaded audio, images, etc.
- PostgreSQL - production-ready database that replaces less performant SQLite3.
To start using the app from http://localhost
run this command:
docker-compose up
Run with Docker Compose + MinIO
You can also run it with an additional MinIO server for local S3 storage. This is particularly useful when you want to test the behavior with S3 storage on your local system. To start Data Studio in this way, you need to run the following command:
# Add sudo on Linux if you are not a member of the docker group
docker compose -f docker-compose.yml -f docker-compose.minio.yml up -d
If you do not have a static IP address, you must create an entry in your hosts file so that both Data Studio and your browser can access the MinIO server. For more detailed instructions, please refer to our guide on storing data.
Install locally with pip
# Requires Python >=3.8
pip install data-studio
# Start the server at http://localhost:8080
data-studio
Install locally with Anaconda
conda create --name data-studio
conda activate data-studio
conda install psycopg2
pip install data-studio
Install for local development
You can run the latest Data Studio version locally without installing the package with pip.
# Install all package dependencies
pip install -e .
# Run database migrations
python data_studio/manage.py migrate
python data_studio/manage.py collectstatic
# Start the server in development mode at http://localhost:8080
python data_studio/manage.py runserver
Deploy in a cloud instance
You can deploy Data Studio with one click in Heroku, Microsoft Azure, or Google Cloud Platform:
Apply frontend changes
The frontend part of Data Studio app lies in the frontend/
folder and written in React JSX. In case you've made some changes there, the following commands should be run before building / starting the instance:
cd data_studio/frontend/
yarn install --frozen-lockfile
npx webpack
cd ../..
python data_studio/manage.py collectstatic --no-input
Troubleshoot installation
If you see any errors during installation, try to rerun the installation
pip install --ignore-installed data-studio
Install dependencies on Windows
To run Data Studio on Windows, download and install the following wheel packages from Gohlke builds to ensure you're using the correct version of Python:
# Upgrade pip
pip install -U pip
# If you're running Win64 with Python 3.8, install the packages downloaded from Gohlke:
pip install lxml‑4.5.0‑cp38‑cp38‑win_amd64.whl
# Install Data Studio
pip install data-studio
Run test suite
To add the tests' dependencies to your local install:
pip install -r deploy/requirements-test.txt
Alternatively, it is possible to run the unit tests from a Docker container in which the test dependencies are installed:
make build-testing-image
make docker-testing-shell
In either case, to run the unit tests:
cd data_studio
# sqlite3
DJANGO_DB=sqlite DJANGO_SETTINGS_MODULE=core.settings.data_studio pytest -vv
# postgres (assumes default postgres user,db,pass. Will not work in Docker
# testing container without additional configuration)
DJANGO_DB=default DJANGO_SETTINGS_MODULE=core.settings.data_studio pytest -vv
What you get from Data Studio
- Multi-user labeling sign up and login, when you create an annotation it's tied to your account.
- Multiple projects to work on all your datasets in one instance.
- Streamlined design helps you focus on your task, not how to use the software.
- Configurable label formats let you customize the visual interface to meet your specific labeling needs.
- Support for multiple data types including images, audio, text, HTML, time-series, and video.
- Import from files or from cloud storage in Amazon AWS S3, Google Cloud Storage, or JSON, CSV, TSV, RAR, and ZIP archives.
- Integration with machine learning models so that you can visualize and compare predictions from different models and perform pre-labeling.
- Embed it in your data pipeline REST API makes it easy to make it a part of your pipeline
Included templates for labeling data in Data Studio
Data Studio includes a variety of templates to help you label your data, or you can create your own using specifically designed configuration language. The most common templates and use cases for labeling include the following cases:
Set up machine learning models with Data Studio
Connect your favorite machine learning model using the Data Studio Machine Learning SDK. Follow these steps:
- Start your own machine learning backend server. See more detailed instructions.
- Connect Data Studio to the server on the model page found in project settings.
This lets you:
- Pre-label your data using model predictions.
- Do online learning and retrain your model while new annotations are being created.
- Do active learning by labeling only the most complex examples in your data.
Integrate Data Studio with your existing tools
You can use Data Studio as an independent part of your machine learning workflow or integrate the frontend or backend into your existing tools.
- Use the Data Studio Frontend as a separate React library. See more in the Frontend Library documentation.
Ecosystem
Project | Description |
---|---|
data-studio | Server, distributed as a pip package |
data-studio-frontend | React and JavaScript frontend and can run standalone in a web browser or be embedded into your application. |
data-manager | React and JavaScript frontend for managing data. Includes the Data Studio Frontend. Relies on the data-studio server or a custom backend with the expected API methods. |
data-studio-converter | Encode labels in the format of your favorite machine learning library |
data-studio-transformers | Transformers library connected and configured for use with Data Studio |
Roadmap
Want to use The Coolest Feature X but Data Studio doesn't support it? Check out our public roadmap!
Citation
@misc{Data Studio,
title={{Data Studio}: Data labeling software},
url={https://github.com/heartexlabs/data-studio},
note={Open source software available from https://github.com/heartexlabs/data-studio},
author={
Maxim Tkachenko and
Mikhail Malyuk and
Andrey Holmanyuk and
Nikolai Liubimov},
year={2020-2022},
}
License
This software is licensed under the Apache 2.0 LICENSE © Heartex. 2020-2022
Project details
Release history Release notifications | RSS feed
Download files
Download the file for your platform. If you're not sure which to choose, learn more about installing packages.
Source Distribution
File details
Details for the file Data-Studio-1.309.tar.gz
.
File metadata
- Download URL: Data-Studio-1.309.tar.gz
- Upload date:
- Size: 119.3 MB
- Tags: Source
- Uploaded using Trusted Publishing? No
- Uploaded via: twine/4.0.2 CPython/3.10.12
File hashes
Algorithm | Hash digest | |
---|---|---|
SHA256 | 744f95e0cd22b68e1b1ab9ed09470edf9546ae6e10a299f114f518c224a73849 |
|
MD5 | bc2ed3084a155eca2520d34bc05fdd15 |
|
BLAKE2b-256 | 0eeea43fffb9d890c766357fba012f594f8a79fb1193dee7ff68aa0f7b0d64d0 |