Skip to main content

No project description provided

Project description

Overview

Clearedge is a Python package designed to simplify the process of extracting raw text and metadata from documents. You can use it to retrieve not only the text but also valuable metadata including titles, subheadings, page numbers, file names, bounding box (bbox) coordinates, and chunk types. Whether you're working on document analysis, data extraction projects, or building a RAG app with LLM, Clearedge provides a straightforward and efficient solution.

Features

  • Text Extraction: Extract raw text from documents (currently supports pdf only. other file types coming soon).
  • Metadata Retrieval: Obtain metadata such as subheadings, page numbers, file names, bounding boxes and more.
  • Bounding Box Coordinates: Access bbox coordinates for text chunks, enabling spatial analysis of text placement within documents.
  • Chunk Type Identification: Identify types of text chunks (e.g., table, text and more) for advanced content analysis.
  • Support for Multiple Formats (coming soon): Compatible with popular document formats, ensuring broad usability.

Installation

Prerequisites

To install clearedge, you will need Python 3.8 or later.

Since we use Tesseract, you will need extra dependencies.

For MacOS users, you need to run:

brew install tesseract

For ubuntu users, you need to run:

sudo apt install tesseract-ocr

Latest release

You can then install the latest release of the package using pypi as follows:

pip install clearedge

Quick Start

Here's a simple example to get you started with clearedge:

from clearedge.reader.pdf import process_pdf

# Call the extractor with the path to your document
chunks = process_pdf('/path/to/your/document.pdf', use_ocr=True) # do not add use_ocr for faster processing. output is less accurate without ocr.

# Extract text and metadata
for chunk in chunks:
    text, metadata = chunk.text, chunk.metadata
    print(text) # Accessing extracted text
    print(metadata.to_dict()) # Accessing metadata

Documentation

For more detailed information on all the features and functionalities of Clearedge, please refer to the official documentation (coming soon).

Contributing

Contributions to Clearedge are welcome! If you have suggestions for improvements or bug fixes, please feel free to: Open an issue to discuss what you would like to change. Submit pull requests for us to review.

Citation

If you wish to cite this project, feel free to use this BibTeX reference:

@misc{clearedge2024,
    title={clearedge: RAG preprocessor},
    author={Clearedge AI},
    year={2024},
    publisher = {GitHub},
    howpublished = {\url{https://github.com/Clearedge-AI/clearedge}}
}

License

Clearedge is released under the Apache 2.0 License. See the LICENSE file for more details.

Acknowledgments

This project was inspired by the need for a simple, yet comprehensive tool for document analysis and metadata extraction. We thank all contributors and users for their support and feedback. Clearedge aims to be a valuable tool for developers, researchers, and anyone involved in processing and analyzing document content. We hope it simplifies your projects and helps you achieve your goals more efficiently.

Project details


Download files

Download the file for your platform. If you're not sure which to choose, learn more about installing packages.

Source Distribution

clearedge-0.1.17.tar.gz (7.1 MB view details)

Uploaded Source

Built Distribution

If you're not sure about the file name format, learn more about wheel file names.

clearedge-0.1.17-py3-none-any.whl (7.1 MB view details)

Uploaded Python 3

File details

Details for the file clearedge-0.1.17.tar.gz.

File metadata

  • Download URL: clearedge-0.1.17.tar.gz
  • Upload date:
  • Size: 7.1 MB
  • Tags: Source
  • Uploaded using Trusted Publishing? No
  • Uploaded via: poetry/1.8.2 CPython/3.12.2 Darwin/22.2.0

File hashes

Hashes for clearedge-0.1.17.tar.gz
Algorithm Hash digest
SHA256 b38574a9c70a2d9b637c8a3339af0b5208b9ab80a14e426c69dd865ebd8c9f0d
MD5 edae157ae6a88cb5024307b5713e6e78
BLAKE2b-256 8e3f4a600b86bab71521cdf2992bf45febc2cbdea964069dc49b526626d993b5

See more details on using hashes here.

File details

Details for the file clearedge-0.1.17-py3-none-any.whl.

File metadata

  • Download URL: clearedge-0.1.17-py3-none-any.whl
  • Upload date:
  • Size: 7.1 MB
  • Tags: Python 3
  • Uploaded using Trusted Publishing? No
  • Uploaded via: poetry/1.8.2 CPython/3.12.2 Darwin/22.2.0

File hashes

Hashes for clearedge-0.1.17-py3-none-any.whl
Algorithm Hash digest
SHA256 e63d0003fb47184c80adc71e28c1ef5d4eb74e3b9b3cd2871f20c6a8198bfe5a
MD5 2767405dd0498636bfb3a6265a225ec5
BLAKE2b-256 a08c6782242407af69c65040197c9d15fb02c5b3297fcf38dfa091707d460df4

See more details on using hashes here.

Supported by

AWS Cloud computing and Security Sponsor Datadog Monitoring Depot Continuous Integration Fastly CDN Google Download Analytics Pingdom Monitoring Sentry Error logging StatusPage Status page