Language quality evaluation tool.
Project description
English | 简体中文
Introduction
Dingo is a data quality assessment tool that helps you automatically detect data quality issues in your datasets. Dingo provides a variety of built-in detection rules and model methods, and also supports custom detection methods. It supports commonly used NLP datasets and multimodal datasets, including pre-training datasets, fine-tuning datasets, and evaluation datasets. In addition, Dingo supports various interface usage methods, including local CLI, SDK, and RESTFul API, making it easy to integrate into various evaluation platforms, such as OpenCompass,simple-evals etc.
Architecture of Dingo
QuickStart
Install dingo
.
pip install dingo
Try the following SDK
demo code:
from dingo.model import Model
from dingo.io import RawInputModel
from dingo.exec import Executor
input_data = {
"eval_models": ["sft"],
"input_path": "tatsu-lab/alpaca", # default from huggingface
"data_format": "plaintext",
}
raw_input = RawInputModel(**input_data)
Model.apply_config(raw_input.custom_config_path)
executor = Executor.exec_map["local"](raw_input)
result = executor.evaluate()
print(result)
you can also try CLI
:
python -m dingo.run.cli --input_path tatsu-lab/alpaca -e sft --data_format plaintext
Tutorials
Config
Dingo
enables users to personalize their data quality inspection methods, which can include the use of heuristic rules,
third-party quality inspection tools or services, and large models. These can be implemented through configuration.
Specifically, users can pass a parameter named custom_config_path
which points to a configuration file.
Below is a template for this configuration file: template.json
Rules
Heuristic rules are a common method for data processing and quality inspection. Dingo
has implemented a series of
heuristic rules and grouped them into rule groups, such as zh-all
, and en-all
, which represent the heuristic quality
inspection rule sets for Chinese and English respectively. In the template of the configuration file, the two configuration
items related to heuristic rule configuration are custom_rule_list
and rule_config
, which represent the rule set
and the configuration parameters for a specific rule, respectively. Below is a configuration example:
{
"custom_rule_list": [],
"rule_config": {}
}
Large Models
Dingo
supports data quality inspection using large models. Before use, users need to configure llm_config
.
For OpenAI
models:
{
"key": "YOUR_API_KEY"
}
For HuggingFace models(currently support the downloaded models):
{
"path": "your local model path",
}
Execute
Dingo
can be run locally or on a Spark cluster.
Local Mode
In addition to the aforementioned SDK calls, you can also run data evaluation locally with CLI
:
python -m dingo.run.cli
The CLI parameters are as follows.
parameter name | description |
---|---|
-i or --input_path |
The path of data. It can be a file or a directory. |
-e or --eval_models |
The model used to evaluate data quality. |
--dataset_id |
The id of data input. |
--data_type |
The type of data. It can be JSON, jsonl, plaintext and list json. |
--output_path |
The path of result data. |
--column_id |
The column name of id in data. |
--column_prompt |
The column name of prompt in data. |
--column_content |
The column name of content in data. |
custom_config_path |
The path of custom config file. |
More information can be obtained by running the following command: python -m dingo.run.cli --help
.
Spark Mode
If the scale of data is very large you can use Spark to run the project.
Firstly, create an object from SparkExecutor
, and set the actual instances of SparkSession and DataFrame.
from dingo.exec.spark import SparkExecutor
spark_exec = SparkExecutor()
spark_exec.set_spark(spark_session)
spark_exec.set_input_df(spark_data_frame)
Then, convert the data and execute the rule list.
spark_exec.convert_data(column_id=['data_id'], column_prompt=['prompt'], column_content=['content'])
spark_exec.execute(["CommonSpecialCharacter", "CommonColonEnd"])
Finally, summarize and get the result data.
spark_exec.summarize()
output_df = spark_exec.get_output_df()
summary = spark_exec.get_summary()
Evaluation Results
Summary
The summary.json
file is overall information about evaluation results. Here is an example:
{
"dataset_id": "20240618",
"input_model": "default",
"input_path": "data/inputs/test_data1.json",
"output_path": "data/outputs/20240625_134409",
"score": 90.0,
"num_good": 90,
"num_bad": 10,
"total": 100,
"error_ratio": {...}
}
The error_ratio
field shows data quality signals in seven different aspects:
EFFECTIVENESS
, COMPLETENESS
, UNDERSTANDABILITY
, SIMILARITY
, FLUENCY
, RELEVANCE
and SECURITY
.
Detailed Results
For more detailed issues found in data items, Dingo
created files in a directory named with the quality signals mentioned above.
Give an example. CommonColonEnd.json
in the QUALITY_SIGNAL_COMPLETENESS
directory is as follows:
{
"name": "CommonColonEnd", # rule name
"count": 1,
"ratio": 0.5,
"detail": [
{
"data_id": "0",
"prompt": "",
"content": "I am 8 years old. ^I love apple because:",
"error_reason": "Ends with a colon."
}
]
}
We evaluated the quality of these three datasets based on Dingo
.
Dataset | Dataset Type | EFFECTIVENESS | COMPLETENESS | UNDERSTANDABILITY | SIMILARITY | FLUENCY | RELEVANCE | SECURITY |
---|---|---|---|---|---|---|---|---|
SlimPajama-627B | Pretrain | 0 | 0.001797 | 0.011547 | 0.003563 | 0 | 0 | 0 |
Stanford_alpaca | SFT | 0.0008 | 0.0004 | 0.0013 | 0.0002 | 0 | 0 | 0 |
MMLU | Benchmark | 0.0064 | 0.0005 | 0.0113 | 0 | 0 | 0 | 0 |
Rule List
Function Name | Type | Description | DataSet |
---|---|---|---|
CommonColonEnd | COMPLETENESS | check whether the last char is ':' | |
CommonContentNull | EFFECTIVENESS | check whether content is null | |
CommonDocRepeat | SIMILARITY | check whether content repeats | Redpajama MAP-en FineWeb Gopher |
CommonHtmlEntity | RELEVANCE | check whether content has html entity | |
CommonIDCard | SECURITY | check if the content contains ID card. | |
CommonNoPunc | FLUENCY | check whether content has paragraph without punctuations | |
CommonSpecialCharacter | RELEVANCE | check whether content has special characters. | |
CommonWatermark | RELEVANCE | check whether content has watermarks. | |
CommonWordNumber | EFFECTIVENESS | check whether the number of word in [20, 100000] | Redpajama MAP-en Gopher Dolma ROOTS-en |
CommonMeanWordLength | EFFECTIVENESS | check whether the mean length of word in [3, 10] | Redpajama MAP-en Gopher Dolma |
CommonSymbolWordRatio | EFFECTIVENESS | check whether the ratio of symbol / word is > 0.1 | Redpajama Gopher Dolma |
CommonAlphaWords | EFFECTIVENESS | check whether the ratio of words that contain at least one alphabetic character > 0.6 | Redpajama MAP-en Gopher Dolma |
CommonStopWord | EFFECTIVENESS | check whether the ratio of stop word > 2 | Redpajama MAP-en Gopher Dolma |
CommonSentenceNumber | COMPLETENESS | check whether the number of sentence >= 3 | Redpajama MAP-en FineWeb C4 |
CommonCurlyBracket | UNDERSTANDABILITY | check whether content contains curly bracket: { or } | Redpajama C4 |
CommonCapitalWords | UNDERSTANDABILITY | check whether capital words ratio > 0.3 | Redpajama MAP-en |
CommonLoremIpsum | EFFECTIVENESS | check whether the ratio of lorem ipsum < 3e-08 | Redpajama MAP-en FineWeb C4 |
CommonUniqueWords | UNDERSTANDABILITY | check whether the ratio of unique words > 0.1 | Redpajama MAP-en |
CommonCharNumber | EFFECTIVENESS | check whether the number of char > 100 | MAP-en Slimpajama |
CommonLineStartWithBulletpoint | UNDERSTANDABILITY | check whether lines start with bullet points. | Redpajama MAP-en Gopher Dolma |
CommonLineEndWithEllipsis | COMPLETENESS | check whether lines end with ellipsis. | Redpajama MAP-en Gopher Dolma |
CommonLineEndWithTerminal | COMPLETENESS | check whether lines end with terminal punctuation mark. | Redpajama FineWeb C4 |
CommonLineWithJavascript | EFFECTIVENESS | check whether line with the word Javascript. | Redpajama FineWeb C4 |
Contributing
We appreciate all contributions to Dingo
. Please refer to CONTRIBUTING.md for the contributing guideline.
License
This project is released under the Apache 2.0 license.
Project details
Release history Release notifications | RSS feed
Download files
Download the file for your platform. If you're not sure which to choose, learn more about installing packages.
Source Distributions
Built Distribution
File details
Details for the file evalu-1.0.2.1-py3-none-any.whl
.
File metadata
- Download URL: evalu-1.0.2.1-py3-none-any.whl
- Upload date:
- Size: 66.3 kB
- Tags: Python 3
- Uploaded using Trusted Publishing? No
- Uploaded via: twine/3.1.1 pkginfo/1.4.2 requests/2.22.0 setuptools/45.2.0 requests-toolbelt/0.8.0 tqdm/4.30.0 CPython/3.8.10
File hashes
Algorithm | Hash digest | |
---|---|---|
SHA256 | a9b43c63b097d30d51529c9f6bdd4bb7c3aa755d8b0cf6868a60215822b71c92 |
|
MD5 | 9a7b860576af4ddb10c6b9cdd35582a4 |
|
BLAKE2b-256 | 3a584482527b075209a9076af833f657991965709a2a0571d448d312493e19ec |