Skip to main content

DecisionFacts Extraction Library extracts content from PDF, PPTX, Docx, png, jpg., and convert as structured JSON data.

Project description

DF Extract Lib

PyPI version License

Requirements

Python 3.10+ asyncio

Installation

# Using pip
$ python -m pip install df-extract

# Manual install
$ python -m pip install .

1. To extract content from PDF

from df_extract.pdf import ExtractPDF


path = "/home/test/ABC.pdf"

extract_pdf = ExtractPDF(file_path=path)

# By default, output as text
await extract_pdf.extract()  # Output will be located `/home/test/ABC.pdf.txt`

# Output as json
await extract_pdf.extract(as_json=True)  # Output will be located `/home/test/ABC.pdf.json`

You can change the output directory with simply pass output_dir param

from df_extract.pdf import ExtractPDF


path = "/home/test/ABC.pdf"

extract_pdf = ExtractPDF(file_path=path, output_dir="/home/test/output")
await extract_pdf.extract()

Extract content from PDF with image data

This requires easyocr

from df_extract.base import ImageExtract
from df_extract.pdf import ExtractPDF


path = "/home/test/ABC.pdf"

image_extract = ImageExtract(model_download_enabled=True)
extract_pdf = ExtractPDF(file_path=path, image_extract=image_extract)
await extract_pdf.extract()

2. To extract content from PPT and PPTx

from df_extract.pptx import ExtractPPTx


path = "/home/test/DEF.pptx"

extract_pptx = ExtractPPTx(file_path=path)

# By default, output as text
await extract_pptx.extract()  # Output will be located `/home/test/DEF.pptx.txt`

# Output as json
await extract_pptx.extract(as_json=True)  # Output will be located `/home/test/DEF.pptx.json`

3. To extract content from Doc and Docx

from df_extract.docx import ExtractDocx


path = "/home/test/GHI.docx"

extract_docx = ExtractDocx(file_path=path)

# By default, output as text
await extract_docx.extract()  # Output will be located `/home/test/GHI.docx.txt`

# Output as json
await extract_docx.extract(as_json=True)  # Output will be located `/home/test/GHI.docx.json`

4. To extract content from PNG, JPEG and JPG

from df_extract.image import ExtractImage


path = "/home/test/JKL.png"

extract_png = ExtractImage(file_path=path)

# By default, output as text
await extract_png.extract()  # Output will be located `/home/test/JKL.png.txt`

# Output as json
await extract_png.extract(as_json=True)  # Output will be located `/home/test/JKL.png.json`

Project details


Download files

Download the file for your platform. If you're not sure which to choose, learn more about installing packages.

Source Distribution

df_extract-0.0.2.1.tar.gz (11.7 kB view hashes)

Uploaded Source

Built Distribution

df_extract-0.0.2.1-py3-none-any.whl (14.0 kB view hashes)

Uploaded Python 3

Supported by

AWS AWS Cloud computing and Security Sponsor Datadog Datadog Monitoring Fastly Fastly CDN Google Google Download Analytics Microsoft Microsoft PSF Sponsor Pingdom Pingdom Monitoring Sentry Sentry Error logging StatusPage StatusPage Status page