Skip to main content

assess unstructured data quality

Reason this release was yanked:

unintentional files

Project description

A Python library for unstructured data quality assessment. It provides tools to evaluate the quality of unstructured documents, including checks for consistency, completeness, accuracy and PII contamination. The library can be used to analyze documents such as PDFs, text files, and markdowns.

Installation

pip install lightudq

Configure your LLM key

Set the key once – either export it in your shell or place it in a .env file (auto‑loaded).

Provider Example model_name Required env variable
OpenAI openai:gpt-4o OPENAI_API_KEY
Anthropic anthropic:claude-3 ANTHROPIC_API_KEY
Cohere cohere:command-r COHERE_API_KEY
Mistral mistral:mixtral-8x22b MISTRAL_API_KEY
# Option A – shell export
export OPENAI_API_KEY="sk-…"

# Option B – .env file in project root
echo "OPENAI_API_KEY=sk-…" > .env

Usage

Quality check of a document

from lightudq.document_quality import DocumentQuality
dq = DocumentQuality('tests/doc_samples/corrupt_description.txt')
res = dq.run()
# profile contains auto generated QnA pairs addressed in the document along with document summary
print(res.profile)
"""
{'title': 'corrupt_description.txt', 'wordCount': 310, 'qnaPairs': {'qna_pairs': [{'question': 'What is Fict.AI known for in the tech industry?',...}"""
# inconsistency checks if there is inconsistency for in the answers of the  auto generated QnA pairs
print(res.inconsistency)
"""
{inconsistent_facts': 2, 'metadata': [{'original': 'Fict.AI is headquartered in Austin, ....}
"""
# pii checks if the document contains any personally identifiable information
print(res.pii)
"""
{'present': True, 'metadata': ['Name: James Smith', 'Date of Birth: September 23, 1970'], 'count': 2}
"""

Add custom metrics to document quality checks

custom metrics can be added to the document quality checks to evaluate specific aspects of the document.

class CustomMetricOutput(BaseModel):
    result: Optional[int] =None

revenue_metric = CustomMetric(name="revenue", prompt="what is the revenue?", outputModel=CustomMetricOutput)
dq.add_custom_metric(revenue_metric)
res = dq.run()
print(res.custom_metrics)
"""
{'revenue': {'result': 120000}}
"""

Edit auto generated profile before running quality checks

The auto generated profile can be edited before running the quality checks. This is useful when the auto generated QnA pairs are not sufficient or need to be modified.

dq = DocumentQuality('tests/doc_samples/corrupt_description.txt')
dq.get_document_profile()
print(dq.profile.qnaPairs)
"""
qna_pairs=[QnAPair(question='Where is Fict.AI headquartered?', answer='Fict.AI is headquartered in the vibrant city of Austin.'), QnAPair(question='How much revenue does Fict.AI currently generate?', answer='Fict.AI currently generates an impressive revenue of $120,000.'), QnAPair(question='Who is the CFO of Fict.AI and since when has he been in that position?', answer='The CFO of Fict.AI is James Smith, who has been in the position since 2015.'), QnAPair(question="What factor contributes to Fict.AI's ability to form collaborations and partnerships?", answer="Fict.AI's strategic location in Austin provides easy access to numerous tech firms and talent, fostering an environment conducive to collaborations and partnerships."), QnAPair(question="What significant role does James Smith have in Fict.AI's success?", answer='James Smith, the CFO of Fict.AI, has played a crucial role in financial decision-making and has successfully guided the company to its current financial stability.')]
"""
#edit the profile before running quality checks
dq.profile.qnaPairs = QnAPairs(qna_pairs=[
    QnAPair(question='What is Fict.AI known for in the tech industry?', answer='AI solutions'),
    QnAPair(question='Where is Fict.AI located?', answer='Austin, Texas'),
])
res = dq.run()
# no inconsistency with new qna pairs
print(res.inconsistency)
"""
reasoning=None inconsistent_facts=0 metadata=None
"""

Compare documents or versions of same documents

A document can be compared with a reference profile to check for completeness and accuracy. This is useful when evaluating different versions of the same document or comparing a document with a reference profile.

reference_dq = DocumentQuality(file_path='tests/doc_samples/base_description.pdf')
reference_profile = reference_dq.get_document_profile()
dq = DocumentQuality(file_path='tests/doc_samples/corrupt_description.txt')
res = dq.compare(reference_profile=reference_profile)
# questions from the reference profile that are not answered in the current document
print(res.incompleteness)
"""
{'questions': ["What is Fict.AI's net income for the fiscal year?"], ...}
"""
# facts that are inconsistent with the reference profile
print(res.inaccuracy)
"""
{'inconsistent_facts': 2, 'metadata': [{'original': 'Fict.AI is headquartered in Austin, Texas and ....}
"""

API documentation

For more detailed information on the API, please refer to the API documentation.

License

This project is licensed under the MIT License.

Project details


Download files

Download the file for your platform. If you're not sure which to choose, learn more about installing packages.

Source Distribution

lightudq-0.1.3.tar.gz (240.1 kB view details)

Uploaded Source

Built Distribution

If you're not sure about the file name format, learn more about wheel file names.

lightudq-0.1.3-py3-none-any.whl (9.9 kB view details)

Uploaded Python 3

File details

Details for the file lightudq-0.1.3.tar.gz.

File metadata

  • Download URL: lightudq-0.1.3.tar.gz
  • Upload date:
  • Size: 240.1 kB
  • Tags: Source
  • Uploaded using Trusted Publishing? No
  • Uploaded via: twine/6.1.0 CPython/3.9.19

File hashes

Hashes for lightudq-0.1.3.tar.gz
Algorithm Hash digest
SHA256 03d9ba81a2b54ec0931f4e320c6f5eec20247a5588d335dd1afe989b6d738abf
MD5 55414616e98fb7d48061aa9fb2c3c32c
BLAKE2b-256 a1c75b7bcf49da4dcc49ce5f2226a5bd83a01e6073c50d73c6a43b606e0c0d4d

See more details on using hashes here.

File details

Details for the file lightudq-0.1.3-py3-none-any.whl.

File metadata

  • Download URL: lightudq-0.1.3-py3-none-any.whl
  • Upload date:
  • Size: 9.9 kB
  • Tags: Python 3
  • Uploaded using Trusted Publishing? No
  • Uploaded via: twine/6.1.0 CPython/3.9.19

File hashes

Hashes for lightudq-0.1.3-py3-none-any.whl
Algorithm Hash digest
SHA256 ada1653ea4963789a8624a8a411d68aa9f376c92ce70e55d800ee37d9e35a91f
MD5 1874ca4476c421196bd5069d9b8291d9
BLAKE2b-256 6d11f876b83bb229e9f099dd1d154116d89db3faa62bc77a6191cb35fb92e8ed

See more details on using hashes here.

Supported by

AWS Cloud computing and Security Sponsor Datadog Monitoring Depot Continuous Integration Fastly CDN Google Download Analytics Pingdom Monitoring Sentry Error logging StatusPage Status page