Skip to main content

Language quality evaluation tool.

Project description

English | 简体中文

Introduction

Dingo is a data quality assessment tool that helps you automatically detect data quality issues in your datasets. Dingo provides a variety of built-in detection rules and model methods, and also supports custom detection methods. It supports commonly used NLP datasets and multimodal datasets, including pre-training datasets, fine-tuning datasets, and evaluation datasets. In addition, Dingo supports various interface usage methods, including local CLI, SDK, and RESTFul API, making it easy to integrate into various evaluation platforms, such as OpenCompasssimple-evals etc.

Architecture of Dingo

Architecture of dingo

QuickStart

Install dingo.

pip install dingo

Try the following SDK demo code:

from dingo.model import Model
from dingo.io import RawInputModel
from dingo.exec import Executor

input_data = {
    "eval_models": ["sft"],
    "input_path": "tatsu-lab/alpaca", # default from huggingface
    "data_format": "plaintext",
}

raw_input = RawInputModel(**input_data)
Model.apply_config(raw_input.custom_config_path)
executor = Executor.exec_map["local"](raw_input)
result = executor.evaluate()
print(result)

you can also try CLI:

python -m dingo.run.cli --input_path tatsu-lab/alpaca -e sft --data_format plaintext

Tutorials

Config

Execute

Dingo can be run locally or on a Spark cluster.

Local Mode

In addition to the aforementioned SDK calls, you can also run data evaluation locally with CLI:

python -m dingo.run.cli

The CLI parameters are as follows.

parameter name description
-e or --eval_models The model used to evaluate data quality.
-i or --input_path The path of data. It can be a file or a directory.
--output_path The path of result data.
--data_format The format of data. It can be json, jsonl, plaintext and list json.
--dataset The platform for data run. It can be huggingface, local and spark.
--datasource The source of data. It can be huggingface, local and s3.
--huggingface_split The split of huggingface.
--column_id The column name of id in data.
--column_prompt The column name of prompt in data.
--column_content The column name of content in data.
--custom_config_path The path of custom config file.
--spark_master_url The url of spark master.
--spark_summary_save_path The path of summary saved when run in spark.
--s3_ak The ak of s3.
--s3_sk The sk of s3.
--s3_endpoint_url The url of end point in s3.
--s3_addressing_style The style of addressing in s3.
--s3_bucket The bucket of s3.

More information can be obtained by running the following command: python -m dingo.run.cli --help.

Spark Mode

If the scale of data is very large you can use Spark to run the project.

Firstly, create an object from SparkExecutor, and set the actual instances of SparkSession and DataFrame.

from dingo.exec.spark import SparkExecutor

spark_exec = SparkExecutor()
spark_exec.set_spark(spark_session)
spark_exec.set_input_df(spark_data_frame)

Then, convert the data and execute the rule list.

spark_exec.convert_data(column_id=['data_id'], column_prompt=['prompt'], column_content=['content'])
spark_exec.execute(["CommonSpecialCharacter", "CommonColonEnd"])

Finally, summarize and get the result data.

spark_exec.summarize()
output_df = spark_exec.get_output_df()
summary = spark_exec.get_summary()

Evaluation Results

Summary

The summary.json file is overall information about evaluation results. Here is an example:

{
    "dataset_id": "20240816_175052",
    "input_model": "default",
    "input_path": "test/data/test_local_json.json",
    "output_path": "test/outputs/20240816_175052",
    "score": 0.0,
    "num_good": 0,
    "num_bad": 2,
    "total": 2,
    "error_type_ratio": {
        "QUALITY_INEFFECTIVENESS": 0.0,
        "QUALITY_INCOMPLETENESS": 0.0,
        "QUALITY_DISUNDERSTANDABILITY": 0.0,
        "QUALITY_DISSIMILARITY": 0.0,
        "QUALITY_DISFLUENCY": 0.0,
        "QUALITY_IRRELEVANCE": 1.0,
        "QUALITY_INSECURITY": 0.0
    },
    "error_name_ratio": {
        "QUALITY_IRRELEVANCE-CommonSpecialCharacter": 1.0
    }
}

The error_ratio field shows data quality signals in seven different aspects: EFFECTIVENESS, COMPLETENESS, UNDERSTANDABILITY, SIMILARITY, FLUENCY, RELEVANCE and SECURITY.

Detailed Results

For more detailed issues found in data items, Dingo created files in a directory named with the quality signals mentioned above. Give an example. CommonColonEnd.json in the QUALITY_SIGNAL_COMPLETENESS directory is as follows:

{"data_id": "0", "prompt": "", "content": "�I am 8 years old. ^I love apple because: fuck you", "error_type": ["QUALITY_IRRELEVANCE"], "error_name": ["QUALITY_IRRELEVANCE-CommonSpecialCharacter"], "error_reason": ["�"]}
{"data_id": "1", "prompt": "", "content": "�[I like blue best. Because blue is the color of the sky. ", "error_type": ["QUALITY_IRRELEVANCE"], "error_name": ["QUALITY_IRRELEVANCE-CommonSpecialCharacter"], "error_reason": ["�"]}

We evaluated the quality of these three datasets based on Dingo.

Dataset Dataset Type EFFECTIVENESS COMPLETENESS UNDERSTANDABILITY SIMILARITY FLUENCY RELEVANCE SECURITY
SlimPajama-627B Pretrain 0.016860 0.000175 0.002062 0.003563 0.000302 0.003767 0
Stanford_alpaca SFT 0.001442 0.000538 0.000481 0.000231 0 0 0
MMLU Benchmark 0.011759 0.007349 0 0 0 0 0

Rule List

Contributing

We appreciate all contributions to Dingo. Please refer to CONTRIBUTING.md for the contributing guideline.

License

This project is released under the Apache 2.0 license.

Project details


Download files

Download the file for your platform. If you're not sure which to choose, learn more about installing packages.

Source Distributions

No source distribution files available for this release.See tutorial on generating distribution archives.

Built Distribution

dingo_python-1.0.3-py3-none-any.whl (72.1 kB view details)

Uploaded Python 3

File details

Details for the file dingo_python-1.0.3-py3-none-any.whl.

File metadata

  • Download URL: dingo_python-1.0.3-py3-none-any.whl
  • Upload date:
  • Size: 72.1 kB
  • Tags: Python 3
  • Uploaded using Trusted Publishing? No
  • Uploaded via: twine/3.1.1 pkginfo/1.4.2 requests/2.22.0 setuptools/45.2.0 requests-toolbelt/0.8.0 tqdm/4.30.0 CPython/3.8.10

File hashes

Hashes for dingo_python-1.0.3-py3-none-any.whl
Algorithm Hash digest
SHA256 b570d449583f6e209687f8bed3c45ffe2d556355243b31ea8ced0ec219a0c40b
MD5 3ad6e173f7c5a4f85613a108ad65fe13
BLAKE2b-256 5af919e3d4341c9351d0264fa396fd72627ae25feac08c7b884b3dd3592981d8

See more details on using hashes here.

Supported by

AWS AWS Cloud computing and Security Sponsor Datadog Datadog Monitoring Fastly Fastly CDN Google Google Download Analytics Microsoft Microsoft PSF Sponsor Pingdom Pingdom Monitoring Sentry Sentry Error logging StatusPage StatusPage Status page