Skip to main content

DecisionFacts Extraction Library extracts content from PDF, PPTX, Docx, png, jpg., and convert as structured JSON data.

Project description

DF Extract Lib

PyPI version License

Requirements

Python 3.10+ asyncio

Installation

# Using pip
$ python -m pip install df-extract

# Manual install
$ python -m pip install .

1. To extract content from PDF

from df_extract.pdf import ExtractPDF


path = "/home/test/ABC.pdf"

extract_pdf = ExtractPDF(file_path=path)

# By default, output as text
await extract_pdf.extract()  # Output will be located `/home/test/ABC.pdf.txt`

# Output as json
await extract_pdf.extract(as_json=True)  # Output will be located `/home/test/ABC.pdf.json`

You can change the output directory with simply pass output_dir param

from df_extract.pdf import ExtractPDF


path = "/home/test/ABC.pdf"

extract_pdf = ExtractPDF(file_path=path, output_dir="/home/test/output")
await extract_pdf.extract()

Extract content from PDF with image data

This requires easyocr

from df_extract.base import ImageExtract
from df_extract.pdf import ExtractPDF


path = "/home/test/ABC.pdf"

image_extract = ImageExtract(model_download_enabled=True)
extract_pdf = ExtractPDF(file_path=path, image_extract=image_extract)
await extract_pdf.extract()

2. To extract content from PPT and PPTx

from df_extract.pptx import ExtractPPTx


path = "/home/test/DEF.pptx"

extract_pptx = ExtractPPTx(file_path=path)

# By default, output as text
await extract_pptx.extract()  # Output will be located `/home/test/DEF.pptx.txt`

# Output as json
await extract_pptx.extract(as_json=True)  # Output will be located `/home/test/DEF.pptx.json`

3. To extract content from Doc and Docx

from df_extract.docx import ExtractDocx


path = "/home/test/GHI.docx"

extract_docx = ExtractDocx(file_path=path)

# By default, output as text
await extract_docx.extract()  # Output will be located `/home/test/GHI.docx.txt`

# Output as json
await extract_docx.extract(as_json=True)  # Output will be located `/home/test/GHI.docx.json`

4. To extract content from PNG, JPEG and JPG

from df_extract.image import ExtractImage


path = "/home/test/JKL.png"

extract_png = ExtractImage(file_path=path)

# By default, output as text
await extract_png.extract()  # Output will be located `/home/test/JKL.png.txt`

# Output as json
await extract_png.extract(as_json=True)  # Output will be located `/home/test/JKL.png.json`

Project details


Download files

Download the file for your platform. If you're not sure which to choose, learn more about installing packages.

Source Distribution

df_extract-0.0.2.1.tar.gz (11.7 kB view details)

Uploaded Source

Built Distribution

df_extract-0.0.2.1-py3-none-any.whl (14.0 kB view details)

Uploaded Python 3

File details

Details for the file df_extract-0.0.2.1.tar.gz.

File metadata

  • Download URL: df_extract-0.0.2.1.tar.gz
  • Upload date:
  • Size: 11.7 kB
  • Tags: Source
  • Uploaded using Trusted Publishing? No
  • Uploaded via: twine/4.0.2 CPython/3.9.18

File hashes

Hashes for df_extract-0.0.2.1.tar.gz
Algorithm Hash digest
SHA256 c162dd85881866b8479a969135925ad7d0b881ade7e34ada88ff95b531bd632c
MD5 3621f1f975ac6a1946868b950c8cf59b
BLAKE2b-256 94b1203a533b073939f66c08b151b02f55a5be4298fd96fc9685175c343df86d

See more details on using hashes here.

File details

Details for the file df_extract-0.0.2.1-py3-none-any.whl.

File metadata

  • Download URL: df_extract-0.0.2.1-py3-none-any.whl
  • Upload date:
  • Size: 14.0 kB
  • Tags: Python 3
  • Uploaded using Trusted Publishing? No
  • Uploaded via: twine/4.0.2 CPython/3.9.18

File hashes

Hashes for df_extract-0.0.2.1-py3-none-any.whl
Algorithm Hash digest
SHA256 d46d15d8e08dce13904b9ac88ec25d28a40ddac3c8ccb3c39fb5189eb0743004
MD5 c84e49f3e542e12da9882eaab94b9072
BLAKE2b-256 af6dd33a87d3de22eccba3495ffa53057f27a027f0eda89d0c1736c18c03aabd

See more details on using hashes here.

Supported by

AWS AWS Cloud computing and Security Sponsor Datadog Datadog Monitoring Fastly Fastly CDN Google Google Download Analytics Microsoft Microsoft PSF Sponsor Pingdom Pingdom Monitoring Sentry Sentry Error logging StatusPage StatusPage Status page