Skip to main content

DataXtractor is a versatile Python library designed to simplify the extraction of valuable data from a variety of sources, including images and PDF documents. Whether you need to extract text, tables, or structured content, DataXtractor provides powerful and intuitive tools to streamline the process.

Project description

DataXtractor Library

DataXtractor is a versatile library designed to extract text from PDF documents, with the ability to handle images and multi-column layouts. This README file provides an overview of the library's capabilities and how to use it effectively.

Features

DataXtractor library offers the following key features:

1. Image to Text Extraction

DataXtractor is equipped to handle PDFs containing images. It utilizes Optical Character Recognition (OCR) to convert images embedded in PDF files into machine-readable text. This allows you to access and manipulate the textual content within images present in your PDF documents.

2. Multi-Column Text Extraction

In case your PDF contains text arranged in multiple columns, DataXtractor allows you to extract this text intelligently. The library can separate and extract content from each column independently, making it possible to obtain text in a structured and organized manner.

3. Language Support

DataXtractor supports multiple languages for OCR operations. You can specify the language code string using the lang parameter. By default, the library uses English (eng) if the language is not specified. You can also specify multiple languages for a more comprehensive text extraction process. For example:

supported_language_codes = [ "ara", "aze", "aze_cyrl", "bel", "ben", "bod", "bos", "bul", "cat", "ceb", "ces", "chi_sim", "chi_sim_vert", "chi_tra", "chi_tra_vert", "chr", "cym", "dan", "deu", "deu-frak", "ell", "eng", "enm", "epo", "est", "eus", "fas", "fil", "fin", "fra", "frk", "frm", "fry", "gle", "glg", "grc", "guj", "hat", "heb", "hin", "hrv", "hun", "hye", "iku", "ind", "isl", "ita", "ita-old", "jav", "jpn", "jpn_vert", "kan", "kat", "kat-old", "kaz", "khm", "kir", "kor", "kor_vert", "lao", "lat", "lav", "lit", "ltz", "mal", "mar", "mkd", "mlt", "mon", "mri", "msa", "mya", "nep", "nld", "nor", "oci", "ori", "osd", "pan", "pol", "por", "pus", "ron", "rus", "san", "sin", "slk", "slv", "snd", "spa", "spa_old", "sqi", "srp", "srp_latn", "sun", "swa", "swe", "syr", "tam", "tat", "tel", "tgk", "tha", "tir", "ton", "tur", "uig", "ukr", "urd", "uzb", "uzb_cyrl", "vie", "yid", "yor" ]

This is especially useful when working with PDFs that contain text in various languages.

4. PDF Text Extraction

DataXtractor is not limited to image-based PDFs. It can also extract text directly from PDF documents that contain text content. This feature allows you to process PDF files, whether they contain text alone or a combination of text and images.

Getting Started

To get started with the DataXtractor library, follow these steps:

  1. Installation: Install the DataXtractor library by using the provided package manager (if available), or manually include it in your project.

  2. Library Initialization: Initialize the DataXtractor library in your code, specifying the language(s) to use for OCR, as well as any other required parameters.

  3. PDF Processing: Load your PDF document and apply the appropriate extraction functions based on your needs. For image-based PDFs, use OCR to convert images to text. For text-based PDFs, extract text directly.

  4. Output Handling: Receive the extracted text and use it as needed for further processing or analysis within your application.

You can convert a PDF into an image and then perform OCR on that image using two different languages. Additionally, you can crop the image into two parts for separate OCR processing.

Example

from pdf2image_dataextraction import data_extraction2parts


path = "sample.pdf"
left_partition = "40"
right_partition = "60"
lang_part_first = "en"
lang_part_second = "en"
data = data_extraction2parts.DATA_EXTRACTION_2_PARTS(
    path, left_partition, right_partition, lang_part_first, lang_part_second
)
print(data)

Extract table from Xls if there is any image found in xls then its also work

It require python 3.10 version

from pdf_dataextraction import data_extraction_pdf


path = "sample.xls"
data = data_extraction_pdf.extract_table_from_xls(
    path,
    "/home/rahul.katoch/Desktop/Test/",
)
print(data)

You can also extract data from PDF

from pdf_dataextraction import data_extraction_pdf


path = "sample.pdf"
data = data_extraction_pdf.extract_text_from_pdf(path)
print(data)

Extract table from pdf

from pdf_dataextraction import data_extraction_pdf


path = "sample.pdf"
data = data_extraction_pdf.extract_tables_dynamic_pdf(path)
print(data)

Extract links from the pdf

path = "sample.pdf"
data = data_extraction_pdf.extract_links_with_text(path)
print(data)

You can also extract data from images

Add this into your root

sudo apt install tesseract-ocr-all
from image_dataextraction import data_imageextraction


path = "sample.jpeg"

data = data_imageextraction.Image_extraction(path)
print(data)

Extract Images form XLS

from image_dataextraction import data_imageextraction


sheet = "sample.xls"
save_dir="./output"

image_list = data_imageextraction.extract_images_from_sheet(sheet, save_dir)
print(image_list)

Contribute

If you find any issues or want to contribute to the DataXtractor library, please check the project's repository for information on how to get involved.

License

This library is released under the MIT License to encourage collaboration and use in various applications.


DataXtractor is a powerful library for extracting text from PDF documents, whether they contain images, multi-column layouts, or plain text. It supports multiple languages and can be a valuable tool for text extraction and data analysis in a wide range of applications.

Project details


Download files

Download the file for your platform. If you're not sure which to choose, learn more about installing packages.

Source Distribution

dataxtractor-1.4.10.tar.gz (8.3 kB view details)

Uploaded Source

Built Distribution

If you're not sure about the file name format, learn more about wheel file names.

dataxtractor-1.4.10-py3-none-any.whl (9.4 kB view details)

Uploaded Python 3

File details

Details for the file dataxtractor-1.4.10.tar.gz.

File metadata

  • Download URL: dataxtractor-1.4.10.tar.gz
  • Upload date:
  • Size: 8.3 kB
  • Tags: Source
  • Uploaded using Trusted Publishing? No
  • Uploaded via: twine/6.1.0 CPython/3.12.3

File hashes

Hashes for dataxtractor-1.4.10.tar.gz
Algorithm Hash digest
SHA256 c3e851b73a068bbc284df5b1d9a63ff63a33315b835d5f27aae71414f892bbdf
MD5 3feacccca4e0c8049990901da3c9cfc4
BLAKE2b-256 1e6d4498c74c2297028785fc06c7fbbeebad237d6e47da0f527d07cb3c93655e

See more details on using hashes here.

File details

Details for the file dataxtractor-1.4.10-py3-none-any.whl.

File metadata

  • Download URL: dataxtractor-1.4.10-py3-none-any.whl
  • Upload date:
  • Size: 9.4 kB
  • Tags: Python 3
  • Uploaded using Trusted Publishing? No
  • Uploaded via: twine/6.1.0 CPython/3.12.3

File hashes

Hashes for dataxtractor-1.4.10-py3-none-any.whl
Algorithm Hash digest
SHA256 5a6f91f71ee5b697042a91a716d0e2914f45e9ad90b2ce7bbe2c722b8c88bee8
MD5 186540493804fdaadff32a9ca5141f17
BLAKE2b-256 b279399781e9d8032065f5dbffc638616493a5e7ede6efca25e5d817fda4f30c

See more details on using hashes here.

Supported by

AWS Cloud computing and Security Sponsor Datadog Monitoring Depot Continuous Integration Fastly CDN Google Download Analytics Pingdom Monitoring Sentry Error logging StatusPage Status page