Open-source tool for exploring, labeling, and monitoring data for NLP projects.

These details have not been verified by PyPI

Project links

GitHub Statistics

View statistics for this project via Libraries.io, or by using our public dataset on Google BigQuery

Project description

Argilla

Codecov

Open-source framework for data-centric NLP

Data Labeling + Data Curation + Inference Store

Designed for MLOps & Feedback Loops

https://user-images.githubusercontent.com/25269220/200496945-7efb11b8-19f3-4793-bb1d-d42132009cbb.mp4

Documentation | Key Features | Quickstart | Principles | Migration from Rubrix | FAQ

Key Features

Advanced NLP labeling

Programmatic labeling using weak supervision. Built-in label models (Snorkel, Flyingsquid)
Bulk-labeling and search-driven annotation
Iterate on training data with any pre-trained model or library
Efficiently review and refine annotations in the UI and with Python
Use Argilla built-in metrics and methods for finding label and data errors (e.g., cleanlab)
Simple integration with active learning workflows

Monitoring

Close the gap between production data and data collection activities
Auto-monitoring for major NLP libraries and pipelines (spaCy, Hugging Face, FlairNLP)
ASGI middleware for HTTP endpoints
Argilla Metrics to understand data and model issues, like entity consistency for NER models
Integrated with Kibana for custom dashboards

Team workspaces

Bring different users and roles into the NLP data and model lifecycles
Organize data collection, review and monitoring into different workspaces
Manage workspace access for different users

Quickstart

Argilla is composed of a Python Server with Elasticsearch as the database layer, and a Python Client to create and manage datasets.

To get started you need to install the client and the server with pip:

pip install "argilla[server]"

Then you need to run Elasticsearch (ES).

The simplest way is to useDocker by running:

docker run -d --name es-for-argilla -p 9200:9200 -p 9300:9300 -e "ES_JAVA_OPTS=-Xms512m -Xmx512m" -e "discovery.type=single-node" docker.elastic.co/elasticsearch/elasticsearch-oss:7.10.2

:information_source: Check the docs for further options and configurations for Elasticsearch.

Finally you can launch the server:

python -m argilla

:information_source: The most common error message after this step is related to the Elasticsearch instance not running. Make sure your Elasticsearch instance is running on http://localhost:9200/. If you already have an Elasticsearch instance or cluster, you point the server to its URL by using ENV variables

🎉 You can now access Argilla UI pointing your browser at http://localhost:6900/.

The default username and password are argilla and 1234.

Your workspace will contain no datasets. So let's use the datasets library to create our first datasets!

First, you need to install datasets:

pip install datasets

Then go to your Python IDE of choice and run:

import pandas as pd
import argilla as rg
from datasets import load_dataset

# load dataset from the hub
dataset = load_dataset("argilla/gutenberg_spacy-ner", split="train")

# read in dataset, assuming its a dataset for text classification
dataset_rg = rg.read_datasets(dataset, task="TokenClassification")

# log the dataset to the Argilla web app
rg.log(dataset_rg, "gutenberg_spacy-ner")

# load dataset from json
my_dataframe = pd.read_json(
    "https://raw.githubusercontent.com/recognai/datasets/main/sst-sentimentclassification.json")

# convert pandas dataframe to DatasetForTextClassification
dataset_rg = rg.DatasetForTextClassification.from_pandas(my_dataframe)

# log the dataset to the Argilla web app
rg.log(dataset_rg, name="sst-sentimentclassification")

This will create two datasets which you can use to do a quick tour of the core features of Argilla.

🚒 If you find issues, get direct support from the team and other community members on the Slack Community

For getting started with your own use cases, go to the docs.

Principles

Open: Argilla is free, open-source, and 100% compatible with major NLP libraries (Hugging Face transformers, spaCy, Stanford Stanza, Flair, etc.). In fact, you can use and combine your preferred libraries without implementing any specific interface.
End-to-end: Most annotation tools treat data collection as a one-off activity at the beginning of each project. In real-world projects, data collection is a key activity of the iterative process of ML model development. Once a model goes into production, you want to monitor and analyze its predictions, and collect more data to improve your model over time. Argilla is designed to close this gap, enabling you to iterate as much as you need.
User and Developer Experience: The key to sustainable NLP solutions is to make it easier for everyone to contribute to projects. Domain experts should feel comfortable interpreting and annotating data. Data scientists should feel free to experiment and iterate. Engineers should feel in control of data pipelines. Argilla optimizes the experience for these core users to make your teams more productive.
Beyond hand-labeling: Classical hand labeling workflows are costly and inefficient, but having humans-in-the-loop is essential. Easily combine hand-labeling with active learning, bulk-labeling, zero-shot models, and weak-supervision in novel data annotation workflows.

FAQ

What is Argilla?

Argilla is an open-source MLOps tool for building and managing data for Natural Language Processing.

What can I use Argilla for?

Argilla is useful if you want to:

create a dataset for training a model.
evaluate and improve an existing model.
monitor an existing model to improve it over time and gather more training data.

What do I need to start using Argilla?

You need to have a running instance of Elasticsearch and install the Argilla Python library. The library is used to read and write data into Argilla.

How can I "upload" data into Argilla?

Currently, the only way to upload data into Argilla is by using the Python library.

This is based on the assumption that there's rarely a perfectly prepared dataset in the format expected by the data annotation tool.

Argilla is designed to enable fast iteration for users that are closer to data and models, namely data scientists and NLP/ML/Data engineers.

If you are familiar with libraries like Weights & Biases or MLFlow, you'll find Argilla log and load methods intuitive.

That said, Argilla gives you different shortcuts and utils to make loading data into Argilla a breeze, such as the ability to read datasets directly from the Hugging Face Hub.

In summary, the recommended process for uploading data into Argilla would be following:

Install Argilla Python library,
Open a Jupyter Notebook,
Make sure you have a Argilla server instance up and running,
Read your source dataset using Pandas, Hugging Face datasets, or any other library,
Do any data preparation, pre-processing, or pre-annotation with a pretrained model, and
Transform your dataset rows/records into Argilla records and log them into a dataset using rb.log. If your dataset is already loaded as a Hugging Face dataset, check the read_datasets method to make this process even simpler.

How can I train a model

The training datasets created with Argilla are model agnostic.

You can choose one of many amazing frameworks to train your model, like transformers, spaCy, flair or sklearn.

Check out our deep dives and our tutorials on how Argilla integrates with these frameworks.

If you want to train a Hugging Face transformer or spaCy NER model, we provide a neat shortcut to prepare your dataset for training.

Can Argilla share the Elasticsearch Instance/cluster?

Yes, you can use the same Elasticsearch instance/cluster for Argilla and other applications. You only need to perform some configuration, check the Advanced installation guide in the docs.

How to solve an exceeded flood-stage watermark in Elasticsearch?

By default, Elasticsearch is quite conservative regarding the disk space it is allowed to use.

If less than 5% of your disk is free, Elasticsearch can enforce a read-only block on every index, and as a consequence, Argilla stops working.

To solve this, you can simply increase the watermark by executing the following command in your terminal:

curl -X PUT "localhost:9200/_cluster/settings?pretty" -H 'Content-Type: application/json' -d'{"persistent": {"cluster.routing.allocation.disk.watermark.flood_stage":"99%"}}'

Contributors

Project details

These details have not been verified by PyPI

Project links

GitHub Statistics

View statistics for this project via Libraries.io, or by using our public dataset on Google BigQuery

Release history Release notifications | RSS feed

1.28.0

May 9, 2024

1.27.0

Apr 18, 2024

1.26.1

Mar 27, 2024

1.26.0

Mar 22, 2024

1.25.0

Feb 29, 2024

1.24.0

Feb 9, 2024

1.23.1

Feb 8, 2024

1.23.0

Feb 2, 2024

1.22.0

Jan 18, 2024

1.21.0

Dec 21, 2023

1.20.0

Nov 30, 2023

1.19.0

Nov 13, 2023

1.18.0

Oct 25, 2023

1.17.0

Oct 19, 2023

1.16.0

Sep 18, 2023

1.15.1

Sep 8, 2023

1.15.0

Aug 31, 2023

1.14.1

Aug 16, 2023

1.14.0

Aug 11, 2023

1.13.3

Jul 27, 2023

1.13.2

Jul 24, 2023

1.13.1

Jul 21, 2023

1.13.0

Jul 21, 2023

1.12.1

Jul 12, 2023

1.12.0

Jun 29, 2023

1.11.0

Jun 22, 2023

1.10.0

Jun 16, 2023

1.9.0

Jun 9, 2023

1.8.0

May 31, 2023

1.7.0

May 10, 2023

1.6.0

Apr 9, 2023

1.5.1

Mar 31, 2023

1.5.0

Mar 22, 2023

1.4.1

Mar 30, 2023

1.4.0

Mar 9, 2023

1.3.2

Mar 30, 2023

1.3.1

Feb 24, 2023

1.3.0

Feb 9, 2023

1.2.2

Mar 30, 2023

1.2.1

Jan 23, 2023

1.2.0

Jan 12, 2023

1.1.1

Nov 29, 2022

This version

1.1.0

Nov 24, 2022

1.0.1

Nov 4, 2022

1.0.0

Oct 24, 2022

1.0.0a3 pre-release

Oct 24, 2022

1.0.0a2 pre-release

Oct 14, 2022

1.0.0a1 pre-release

Oct 13, 2022

0.0.1

Oct 6, 2022

Download files

Download the file for your platform. If you're not sure which to choose, learn more about installing packages.

Source Distribution

argilla-1.1.0.tar.gz (1.4 MB view hashes)

Uploaded Nov 24, 2022 Source

Built Distribution

argilla-1.1.0-py3-none-any.whl (1.7 MB view hashes)

Uploaded Nov 24, 2022 Python 3

Hashes for argilla-1.1.0.tar.gz

Hashes for argilla-1.1.0.tar.gz
Algorithm	Hash digest
SHA256	`46433a7af5d2e9967cb14ecc983d36190e70237ce631e5da5c63653ee9e77bda`
MD5	`506aae989164e34c06df0ce234aacc57`
BLAKE2b-256	`325cf34425911fc99faf5532e6425968d017bd56e58d82fcd00bfe90d5564e62`

Hashes for argilla-1.1.0-py3-none-any.whl

Hashes for argilla-1.1.0-py3-none-any.whl
Algorithm	Hash digest
SHA256	`df56ef7f911e39ff12520f59c2c04be3a6d724666a0028aa8644b3b2c1c4eb5a`
MD5	`14fdaa184e957aeb38f8eb5cfae64bbd`
BLAKE2b-256	`412972c8fd7ff8f2b6772fc1abc1bc5b1242838776e201d74971e6d8a88d6f0a`