Skip to main content

DecisionFacts Extraction Library extracts content from PDF, PPTX, Docx, png, jpg., and convert as structured JSON data.

Project description

DF Extract Lib

PyPI version License

Requirements

Python 3.10+

Installation

# Using pip
$ python -m pip install df-extract

# Manual install
$ python -m pip install .

1. To extract content from PDF

from df_extract.pdf import ExtractPDF


path = "/home/test/ABC.pdf"

extract_pdf = ExtractPDF(file_path=path)

# By default, output as text
extract_pdf.extract()  # Output will be located `/home/test/ABC.pdf.txt`

# Output as json
extract_pdf.extract(as_json=True)  # Output will be located `/home/test/ABC.pdf.json`

You can change the output directory with simply pass output_dir param

from df_extract.pdf import ExtractPDF


path = "/home/test/ABC.pdf"

extract_pdf = ExtractPDF(file_path=path, output_dir="/home/test/output")
extract_pdf.extract()

Extract content from PDF with image data

This requires easyocr

from df_extract.base import ImageExtract
from df_extract.pdf import ExtractPDF


path = "/home/test/ABC.pdf"

image_extract = ImageExtract(model_download_enabled=True)
extract_pdf = ExtractPDF(file_path=path, image_extract=image_extract)
extract_pdf.extract()

2. To extract content from PPT and PPTx

from df_extract.pptx import ExtractPPTx


path = "/home/test/DEF.pptx"

extract_pptx = ExtractPPTx(file_path=path)

# By default, output as text
extract_pptx.extract()  # Output will be located `/home/test/DEF.pptx.txt`

# Output as json
extract_pptx.extract(as_json=True)  # Output will be located `/home/test/DEF.pptx.json`

3. To extract content from Doc and Docx

from df_extract.docx import ExtractDocx


path = "/home/test/GHI.docx"

extract_docx = ExtractDocx(file_path=path)

# By default, output as text
extract_docx.extract()  # Output will be located `/home/test/GHI.docx.txt`

# Output as json
extract_docx.extract(as_json=True)  # Output will be located `/home/test/GHI.docx.json`

4. To extract content from PNG, JPEG and JPG

from df_extract.image import ExtractImage


path = "/home/test/JKL.png"

extract_png = ExtractImage(file_path=path)

# By default, output as text
extract_png.extract()  # Output will be located `/home/test/JKL.png.txt`

# Output as json
extract_png.extract(as_json=True)  # Output will be located `/home/test/JKL.png.json`

Project details


Download files

Download the file for your platform. If you're not sure which to choose, learn more about installing packages.

Source Distribution

df_extract-0.0.2.tar.gz (11.7 kB view hashes)

Uploaded Source

Built Distribution

df_extract-0.0.2-py3-none-any.whl (14.0 kB view hashes)

Uploaded Python 3

Supported by

AWS AWS Cloud computing and Security Sponsor Datadog Datadog Monitoring Fastly Fastly CDN Google Google Download Analytics Microsoft Microsoft PSF Sponsor Pingdom Pingdom Monitoring Sentry Sentry Error logging StatusPage StatusPage Status page