A package to extract Bengali text from PDFs using OCR

These details have not been verified by PyPI

Project links

Homepage

License
- OSI Approved :: MIT License
Operating System
- OS Independent
Programming Language
- Python :: 3

Project description

Bangla PDF OCR

Bangla PDF OCR is a powerful tool that extracts Bengali text from PDF files. It's designed for simplicity and works on Windows, macOS, and Linux without any extra downloads or configurations.

Key Features

Extracts Bengali text from PDFs quickly and accurately
Works on Windows, macOS, and Linux
Easy to use from both command line and Python scripts
Installs all necessary components automatically
Supports other languages besides Bengali

Quick Start

Install the package:
```
pip install bangla-pdf-ocr
```
Run the setup command to install dependencies:
```
bangla-pdf-ocr-setup
```

Start using it right away!

From command line:

bangla-pdf-ocr your_file.pdf

In your Python script:

from bangla_pdf_ocr import process_pdf
text = process_pdf("your_file.pdf")
print(text)

That's it! No additional downloads or configurations needed.

Features

Extract Bengali text from PDF files
Support for other languages through Tesseract OCR
Easy-to-use command-line interface
Automatic installation of dependencies (OS-specific)
Multi-threaded processing for improved performance

Prerequisites

Python 3.6 or higher
pip (Python package installer)

Installation

Install the package from PyPI:
```
pip install bangla-pdf-ocr
```
Set up system dependencies:
```
bangla-pdf-ocr-setup
```
This command installs necessary dependencies based on your operating system:
- Linux: Installs tesseract-ocr, poppler-utils, and tesseract-ocr-ben
- macOS: Installs tesseract, poppler, and tesseract-lang via Homebrew
- Windows: Downloads and installs Tesseract OCR and Poppler, adding them to the system PATH
Note: On Windows, you may need to run the command prompt as administrator.
Verify the installation:
```
bangla-pdf-ocr-verify
```
This command checks if all required dependencies are properly installed and accessible.
Try a sample PDF extraction:
```
bangla-pdf-ocr
```
This command processes a sample Bengali PDF file included with the package, demonstrating the text extraction capabilities.

Usage

Command-line Interface

Basic usage:

bangla-pdf-ocr [input_pdf] [-o output_file] [-l language]

Options:

input_pdf: Path to the input PDF file (optional, uses a sample PDF if not provided)
-o, --output: Specify the output file path (default: input filename with .txt extension)
-l, --language: Specify the OCR language (default: 'ben' for Bengali)

Examples:

Process the default sample PDF:
```
bangla-pdf-ocr
```
Process a specific PDF:
```
bangla-pdf-ocr path/to/my_document.pdf
```

Specify an output file:

bangla-pdf-ocr path/to/my_document.pdf -o path/to/extracted_text.txt

Using as a Python Module

You can also use Bangla PDF OCR as a module in your Python scripts. Here's an example:

from bangla_pdf_ocr import process_pdf

# Process a PDF file
input_pdf = "path/to/your/document.pdf"
output_file = "path/to/output/extracted_text.txt"
language = "ben"  # Use "ben" for Bengali or other language codes as needed

extracted_text = process_pdf(input_pdf, output_file, language)

# The extracted text is now in the 'extracted_text' variable
# and has also been saved to the output file

print(f"Text extracted and saved to: {output_file}")

This allows you to integrate Bangla PDF OCR functionality directly into your Python projects, giving you more control over the OCR process and enabling you to use the extracted text in your applications.

Troubleshooting

If you encounter any issues:

Run the verification command:
```
bangla-pdf-ocr-verify
```
For Windows users:
- Run setup/verify command prompts as administrator if you encounter permission issues.
- Restart your command prompt or IDE after installation to ensure PATH changes take effect.
Check the console output and logs for any error messages.
If automatic installation fails, refer to the manual installation instructions provided by the setup command.
Ensure you have the latest version of the package:
```
pip install --upgrade bangla-pdf-ocr
```
If problems persist, please open an issue on our GitHub repository with detailed information about the error and your system configuration.

Reporting Issues

If you encounter any problems or have suggestions for Bangla PDF OCR:

Check existing issues to see if your issue has already been reported.
If not, create a new issue on our GitHub repository.
Provide detailed information about the problem, including steps to reproduce it.

We appreciate your feedback to help improve Bangla PDF OCR!

Happy OCR processing!

Project details

These details have not been verified by PyPI

Project links

Homepage

License
- OSI Approved :: MIT License
Operating System
- OS Independent
Programming Language
- Python :: 3

Release history Release notifications | RSS feed

This version

0.1.1

Oct 12, 2024

0.1.0

Oct 11, 2024

Download files

Download the file for your platform. If you're not sure which to choose, learn more about installing packages.

Source Distribution

bangla_pdf_ocr-0.1.1.tar.gz (81.9 kB view details)

Uploaded Oct 12, 2024 Source

Built Distribution

If you're not sure about the file name format, learn more about wheel file names.

The dropdown lists show the available interpreters, ABIs, and platforms. Enable javascript to be able to filter the list of wheel files.

bangla_pdf_ocr-0.1.1-py3-none-any.whl (78.3 kB view details)

Uploaded Oct 12, 2024 Python 3

File details

Details for the file bangla_pdf_ocr-0.1.1.tar.gz.

File metadata

Download URL: bangla_pdf_ocr-0.1.1.tar.gz
Upload date: Oct 12, 2024
Size: 81.9 kB
Tags: Source
Uploaded using Trusted Publishing? No
Uploaded via: twine/5.1.1 CPython/3.10.13

File hashes

Hashes for bangla_pdf_ocr-0.1.1.tar.gz
Algorithm	Hash digest
SHA256	`f3e1fb6691d2217ccc31009cc2613c9f8b29445cd1e384ce5becc18e42107f65`
MD5	`6aab3ee2934bbb3ad741326324eb87d6`
BLAKE2b-256	`b0243d6a0e0fefbbf0222b76e510c1397a4b1d34fbe5b1f5e9e5edfdaabe9035`

See more details on using hashes here.

File details

Details for the file bangla_pdf_ocr-0.1.1-py3-none-any.whl.

File metadata

Download URL: bangla_pdf_ocr-0.1.1-py3-none-any.whl
Upload date: Oct 12, 2024
Size: 78.3 kB
Tags: Python 3
Uploaded using Trusted Publishing? No
Uploaded via: twine/5.1.1 CPython/3.10.13

File hashes

Hashes for bangla_pdf_ocr-0.1.1-py3-none-any.whl
Algorithm	Hash digest
SHA256	`95dd886ca747898c27cf72b8bad620e931c457ee2cf6d5c59688a32104595cde`
MD5	`d1c7e2e40cd8dd36f088a5fc7ebbb4d0`
BLAKE2b-256	`f692e0e2788966a0e0c15b647fa6816eb99f54b487232a1278dc3a76d40101d5`

See more details on using hashes here.

bangla-pdf-ocr 0.1.1

Navigation

Verified details

Maintainers

Unverified details

Project links

Meta

Classifiers

Project description

Bangla PDF OCR

Key Features

Quick Start

Features

Prerequisites

Installation

Usage

Command-line Interface

Options:

Examples:

Using as a Python Module

Troubleshooting

Reporting Issues

Project details

Verified details

Maintainers

Unverified details

Project links

Meta

Classifiers

Release history Release notifications | RSS feed

Download files

Source Distribution

Built Distribution

File details

File metadata

File hashes

File details

File metadata

File hashes