Skip to main content

A library for calculating consensus entropy between multiple strings, particularly useful for OCR result analysis

Project description

Consensus Entropy | 共识熵

English | 中文

English

A Python library for calculating consensus entropy between multiple strings, particularly useful for OCR result analysis. Uses Levenshtein distance to calculate string differences.

This library is the official implementation of our paper: Consensus Entropy: Harnessing Multi-VLM Agreement for Self-Verifying and Self-Improving OCR

Citation

If you use this library in your research, please cite our paper:

@misc{zhang2025consensusentropyharnessingmultivlm,
      title={Consensus Entropy: Harnessing Multi-VLM Agreement for Self-Verifying and Self-Improving OCR}, 
      author={Yulong Zhang and Tianyi Liang and Xinyue Huang and Erfei Cui and Xu Guo and Pei Chu and Chenhui Li and Ru Zhang and Wenhai Wang and Gongshen Liu},
      year={2025},
      eprint={2504.11101},
      archivePrefix={arXiv},
      primaryClass={cs.CV},
      url={https://arxiv.org/abs/2504.11101}
}

Installation

pip install consensus-entropy

Usage

Basic Usage

from consensus_entropy import calculate_consensus_entropy

# Calculate consensus entropy for multiple OCR results
ocr_results = [
    "Hello World",
    "Hello Wrld",
    "Hallo World"
]

# Calculate entropy values for each result
entropy_values = calculate_consensus_entropy(ocr_results, task_type="ocr")
print(entropy_values)  # [0.1667, 0.3333, 0.3333]

Get Best OCR Result

from consensus_entropy import get_best_ocr_result

# Get the OCR result with lowest entropy
ocr_results = ["Test1", "Test2", "Text2"]
best_result, best_entropy = get_best_ocr_result(ocr_results, task_type="ocr")
print(f"Best result: {best_result}")
print(f"Entropy: {best_entropy:.4f}")

Features

  • Calculate normalized Levenshtein distance
  • Compute consensus entropy for multiple strings
  • Get the best OCR result with lowest entropy
  • Support for both English and Chinese text
  • Type hints for better IDE support
  • Optimized for OCR tasks

Requirements

  • Python 3.7+
  • numpy
  • python-Levenshtein

Notes

  • Currently only supports OCR task type
  • Input string list must contain at least two elements
  • All inputs will be converted to string type

License

This project is licensed under the MIT License - see the LICENSE file for details.

Contributing

Contributions are welcome! Please feel free to submit a Pull Request.


中文

一个用于计算多个字符串之间共识熵的Python库,特别适用于OCR结果分析。使用Levenshtein距离来计算字符串之间的差异。

本库是我们论文的官方实现:Consensus Entropy: Harnessing Multi-VLM Agreement for Self-Verifying and Self-Improving OCR

引用

如果您在研究中使用了本库,请引用我们的论文:

@misc{zhang2025consensusentropyharnessingmultivlm,
      title={Consensus Entropy: Harnessing Multi-VLM Agreement for Self-Verifying and Self-Improving OCR}, 
      author={Yulong Zhang and Tianyi Liang and Xinyue Huang and Erfei Cui and Xu Guo and Pei Chu and Chenhui Li and Ru Zhang and Wenhai Wang and Gongshen Liu},
      year={2025},
      eprint={2504.11101},
      archivePrefix={arXiv},
      primaryClass={cs.CV},
      url={https://arxiv.org/abs/2504.11101}
}

安装

pip install consensus-entropy

使用方法

基本用法

from consensus_entropy import calculate_consensus_entropy

# 计算多个OCR结果的共识熵
ocr_results = [
    "人工智能",
    "人工智障",
    "人工智能",
    "人工智惠"
]

# 计算每个结果的熵值
entropy_values = calculate_consensus_entropy(ocr_results, task_type="ocr")
print(entropy_values)  # [0.1667, 0.2500, 0.1667, 0.2500]

获取最佳OCR结果

from consensus_entropy import get_best_ocr_result

# 获取熵值最低的OCR结果
ocr_results = ["测试文本1", "测试文本2", "文本2"]
best_result, best_entropy = get_best_ocr_result(ocr_results, task_type="ocr")
print(f"最佳结果: {best_result}")
print(f"熵值: {best_entropy:.4f}")

功能特点

  • 计算标准化Levenshtein距离
  • 计算多个OCR结果的共识熵
  • 获取熵值最低的最佳OCR结果
  • 支持中文和英文文本
  • 类型提示支持
  • 针对OCR任务优化的算法

系统要求

  • Python 3.7+
  • numpy
  • python-Levenshtein

注意事项

  • 目前仅支持OCR任务类型
  • 输入字符串列表至少需要两个元素
  • 所有输入都会被转换为字符串类型处理

许可证

本项目采用MIT许可证 - 详见LICENSE文件

贡献

欢迎提交Pull Request来改进这个项目。

Project details


Download files

Download the file for your platform. If you're not sure which to choose, learn more about installing packages.

Source Distribution

consensus_entropy-0.1.0.tar.gz (5.4 kB view details)

Uploaded Source

Built Distribution

If you're not sure about the file name format, learn more about wheel file names.

consensus_entropy-0.1.0-py3-none-any.whl (5.7 kB view details)

Uploaded Python 3

File details

Details for the file consensus_entropy-0.1.0.tar.gz.

File metadata

  • Download URL: consensus_entropy-0.1.0.tar.gz
  • Upload date:
  • Size: 5.4 kB
  • Tags: Source
  • Uploaded using Trusted Publishing? No
  • Uploaded via: twine/6.1.0 CPython/3.12.10

File hashes

Hashes for consensus_entropy-0.1.0.tar.gz
Algorithm Hash digest
SHA256 94af2fe79af37f08f290f5a3dbc756a8584a2ec3bfb0571629f1039c204ec925
MD5 ea5cb8bba1b6b4c357eeb767e66e923a
BLAKE2b-256 ee757d6d6c84ed715644adb755cb1af4291daca1d79f8bed6c05137dd35427ce

See more details on using hashes here.

File details

Details for the file consensus_entropy-0.1.0-py3-none-any.whl.

File metadata

File hashes

Hashes for consensus_entropy-0.1.0-py3-none-any.whl
Algorithm Hash digest
SHA256 2c659a13b2b38afb44ecbcf410d6865f2f28681bf23f9c9dd8a547f92f55304d
MD5 55245f2d07e6764f544d9e70202c9279
BLAKE2b-256 7ec7c0a179b56038094276e2401520a3a572f4d407fbe1fbf9e933452957518b

See more details on using hashes here.

Supported by

AWS Cloud computing and Security Sponsor Datadog Monitoring Depot Continuous Integration Fastly CDN Google Download Analytics Pingdom Monitoring Sentry Error logging StatusPage Status page