Skip to main content

Language quality evaluation tool.

Project description

English | 简体中文

Introduction

Dingo is a data quality assessment tool that helps you automatically detect data quality issues in your datasets. Dingo provides a variety of built-in detection rules and model methods, and also supports custom detection methods. It supports commonly used NLP datasets and multimodal datasets, including pre-training datasets, fine-tuning datasets, and evaluation datasets. In addition, Dingo supports various interface usage methods, including local CLI, SDK, and RESTFul API, making it easy to integrate into various evaluation platforms, such as OpenCompasssimple-evals etc.

Architecture of Dingo

Architecture of dingo

QuickStart

Install dingo.

pip install dingo

Try the following SDK demo code:

from dingo.model import Model
from dingo.io import RawInputModel
from dingo.exec import Executor

input_data = {
    "eval_models": ["sft"],
    "input_path": "tatsu-lab/alpaca", # default from huggingface
    "data_format": "plaintext",
}

raw_input = RawInputModel(**input_data)
Model.apply_config(raw_input.custom_config_path)
executor = Executor.exec_map["local"](raw_input)
result = executor.evaluate()
print(result)

you can also try CLI:

python -m dingo.run.cli --input_path tatsu-lab/alpaca -e sft --data_format plaintext

Tutorials

Config

Dingo enables users to personalize their data quality inspection methods, which can include the use of heuristic rules, third-party quality inspection tools or services, and large models. These can be implemented through configuration. Specifically, users can pass a parameter named custom_config_path which points to a configuration file. Below is a template for this configuration file: template.json

Rules

Heuristic rules are a common method for data processing and quality inspection. Dingo has implemented a series of heuristic rules and grouped them into rule groups, such as zh-all, and en-all, which represent the heuristic quality inspection rule sets for Chinese and English respectively. In the template of the configuration file, the two configuration items related to heuristic rule configuration are custom_rule_list and rule_config, which represent the rule set and the configuration parameters for a specific rule, respectively. Below is a configuration example:

{
  "custom_rule_list": [],
  "rule_config": {}
}

Large Models

Dingo supports data quality inspection using large models. Before use, users need to configure llm_config. For OpenAI models:

{
  "key": "YOUR_API_KEY"
}

For HuggingFace models(currently support the downloaded models):

{
  "path": "your local model path",
}

Execute

Dingo can be run locally or on a Spark cluster.

Local Mode

In addition to the aforementioned SDK calls, you can also run data evaluation locally with CLI:

python -m dingo.run.cli

The CLI parameters are as follows.

parameter name description
-i or --input_path The path of data. It can be a file or a directory.
-e or --eval_models The model used to evaluate data quality.
--dataset_id The id of data input.
--data_type The type of data. It can be JSON, jsonl, plaintext and list json.
--output_path The path of result data.
--column_id The column name of id in data.
--column_prompt The column name of prompt in data.
--column_content The column name of content in data.
custom_config_path The path of custom config file.

More information can be obtained by running the following command: python -m dingo.run.cli --help.

Spark Mode

If the scale of data is very large you can use Spark to run the project.

Firstly, create an object from SparkExecutor, and set the actual instances of SparkSession and DataFrame.

from dingo.exec.spark import SparkExecutor

spark_exec = SparkExecutor()
spark_exec.set_spark(spark_session)
spark_exec.set_input_df(spark_data_frame)

Then, convert the data and execute the rule list.

spark_exec.convert_data(column_id=['data_id'], column_prompt=['prompt'], column_content=['content'])
spark_exec.execute(["CommonSpecialCharacter", "CommonColonEnd"])

Finally, summarize and get the result data.

spark_exec.summarize()
output_df = spark_exec.get_output_df()
summary = spark_exec.get_summary()

Evaluation Results

Summary

The summary.json file is overall information about evaluation results. Here is an example:

{
    "dataset_id": "20240618",
    "input_model": "default",
    "input_path": "data/inputs/test_data1.json",
    "output_path": "data/outputs/20240625_134409",
    "score": 90.0,
    "num_good": 90,
    "num_bad": 10,
    "total": 100,
    "error_ratio": {...}
}

The error_ratio field shows data quality signals in seven different aspects: EFFECTIVENESS, COMPLETENESS, UNDERSTANDABILITY, SIMILARITY, FLUENCY, RELEVANCE and SECURITY.

Detailed Results

For more detailed issues found in data items, Dingo created files in a directory named with the quality signals mentioned above. Give an example. CommonColonEnd.json in the QUALITY_SIGNAL_COMPLETENESS directory is as follows:

{
    "name": "CommonColonEnd", # rule name
    "count": 1,
    "ratio": 0.5,
    "detail": [
        {
            "data_id": "0",
            "prompt": "",
            "content": "I am 8 years old. ^I love apple because:",
            "error_reason": "Ends with a colon."
        }
    ]
}

We evaluated the quality of these three datasets based on Dingo.

Dataset Dataset Type EFFECTIVENESS COMPLETENESS UNDERSTANDABILITY SIMILARITY FLUENCY RELEVANCE SECURITY
SlimPajama-627B Pretrain 0 0.001797 0.011547 0.003563 0 0 0
Stanford_alpaca SFT 0.0008 0.0004 0.0013 0.0002 0 0 0
MMLU Benchmark 0.0064 0.0005 0.0113 0 0 0 0

Rule List

Function Name Type Description DataSet
CommonColonEnd COMPLETENESS check whether the last char is ':'
CommonContentNull EFFECTIVENESS check whether content is null
CommonDocRepeat SIMILARITY check whether content repeats Redpajama MAP-en FineWeb Gopher
CommonHtmlEntity RELEVANCE check whether content has html entity
CommonIDCard SECURITY check if the content contains ID card.
CommonNoPunc FLUENCY check whether content has paragraph without punctuations
CommonSpecialCharacter RELEVANCE check whether content has special characters.
CommonWatermark RELEVANCE check whether content has watermarks.
CommonWordNumber EFFECTIVENESS check whether the number of word in [20, 100000] Redpajama MAP-en Gopher Dolma ROOTS-en
CommonMeanWordLength EFFECTIVENESS check whether the mean length of word in [3, 10] Redpajama MAP-en Gopher Dolma
CommonSymbolWordRatio EFFECTIVENESS check whether the ratio of symbol / word is > 0.1 Redpajama Gopher Dolma
CommonAlphaWords EFFECTIVENESS check whether the ratio of words that contain at least one alphabetic character > 0.6 Redpajama MAP-en Gopher Dolma
CommonStopWord EFFECTIVENESS check whether the ratio of stop word > 2 Redpajama MAP-en Gopher Dolma
CommonSentenceNumber COMPLETENESS check whether the number of sentence >= 3 Redpajama MAP-en FineWeb C4
CommonCurlyBracket UNDERSTANDABILITY check whether content contains curly bracket: { or } Redpajama C4
CommonCapitalWords UNDERSTANDABILITY check whether capital words ratio > 0.3 Redpajama MAP-en
CommonLoremIpsum EFFECTIVENESS check whether the ratio of lorem ipsum < 3e-08 Redpajama MAP-en FineWeb C4
CommonUniqueWords UNDERSTANDABILITY check whether the ratio of unique words > 0.1 Redpajama MAP-en
CommonCharNumber EFFECTIVENESS check whether the number of char > 100 MAP-en Slimpajama
CommonLineStartWithBulletpoint UNDERSTANDABILITY check whether lines start with bullet points. Redpajama MAP-en Gopher Dolma
CommonLineEndWithEllipsis COMPLETENESS check whether lines end with ellipsis. Redpajama MAP-en Gopher Dolma
CommonLineEndWithTerminal COMPLETENESS check whether lines end with terminal punctuation mark. Redpajama FineWeb C4
CommonLineWithJavascript EFFECTIVENESS check whether line with the word Javascript. Redpajama FineWeb C4

Contributing

We appreciate all contributions to Dingo. Please refer to CONTRIBUTING.md for the contributing guideline.

License

This project is released under the Apache 2.0 license.

Project details


Download files

Download the file for your platform. If you're not sure which to choose, learn more about installing packages.

Source Distributions

No source distribution files available for this release.See tutorial on generating distribution archives.

Built Distribution

evalu-1.0.2.1-py3-none-any.whl (66.3 kB view details)

Uploaded Python 3

File details

Details for the file evalu-1.0.2.1-py3-none-any.whl.

File metadata

  • Download URL: evalu-1.0.2.1-py3-none-any.whl
  • Upload date:
  • Size: 66.3 kB
  • Tags: Python 3
  • Uploaded using Trusted Publishing? No
  • Uploaded via: twine/3.1.1 pkginfo/1.4.2 requests/2.22.0 setuptools/45.2.0 requests-toolbelt/0.8.0 tqdm/4.30.0 CPython/3.8.10

File hashes

Hashes for evalu-1.0.2.1-py3-none-any.whl
Algorithm Hash digest
SHA256 a9b43c63b097d30d51529c9f6bdd4bb7c3aa755d8b0cf6868a60215822b71c92
MD5 9a7b860576af4ddb10c6b9cdd35582a4
BLAKE2b-256 3a584482527b075209a9076af833f657991965709a2a0571d448d312493e19ec

See more details on using hashes here.

Supported by

AWS AWS Cloud computing and Security Sponsor Datadog Datadog Monitoring Fastly Fastly CDN Google Google Download Analytics Microsoft Microsoft PSF Sponsor Pingdom Pingdom Monitoring Sentry Sentry Error logging StatusPage StatusPage Status page