Skip to main content

This library helps in extracting text from searchable pdf files by keeping the layout intact.

Project description

pdflayoutxt

pdflayoutxt is a Python library for extracting text from searchable pdf's (Non Scanned) and it make sures the extracted text is in the same layout as the document.

Installation

Use the package manager pip to install foobar.

pip install pdflayoutxt

Usage

# import the library
import pdflayoutxt

# creates an object of pdfextracter
pdfobj=pdflayoutxt.pdfextracter()

# returns a list, each index being the text extracted from that index page. 
# In simple terms no_of_pages_in_document==len(list_returned)
pdf_path="./abc.pdf"
text=pdfobj.get_pdf_text(pdf_path=pdf_path)

# output
print(text)
Method Description
.get_pdf_text(pdf_path,pdf_password="",pages=[],left_most_x=0,left_most_y=0,right_most_x=1,right_most_y=1) Returns a list of list, of texts, present in each of the page in the document.pdf_password argument takes a string input,if pdf is encrypted with password, the password needs to be passed to this argument. Pages argument takes a list of pages or int (single page) from where the text needs to be extracted, if text from all pages are required the default parameter will take care. left_most_x this parameter defines the starting point of text extraction on x axis (width). Its value lies between [0,1], like if we need .25 percent of right side of page (width) then we will pass .75 as argument. left_most_y this parameter defines the starting point of text extraction on y axis (height). Its value lies between [0,1], like if we need .25 percent of text from bottom side of page (height) then we will pass .75 as argument. right_most_x this parameter defines the end point of text extraction on x axis (width). Its value lies between [0,1]. right_most_y this parameter defines the end point of text extraction on y axis (height). Its value lies between [0,1]. These parameters right_most_y,left_most_x,right_most_x,left_most_y are set to default for extracting text from complete page without cropping, if the text needs to be extracted from a particular area of page, these parameters become handy.

Contributing

Pull requests are welcome. For major changes, please open an issue first to discuss what you would like to change.

License

MIT

Project details


Download files

Download the file for your platform. If you're not sure which to choose, learn more about installing packages.

Source Distribution

pdflayoutxt-0.0.10.tar.gz (4.4 kB view details)

Uploaded Source

Built Distribution

If you're not sure about the file name format, learn more about wheel file names.

pdflayoutxt-0.0.10-py3-none-any.whl (4.9 kB view details)

Uploaded Python 3

File details

Details for the file pdflayoutxt-0.0.10.tar.gz.

File metadata

  • Download URL: pdflayoutxt-0.0.10.tar.gz
  • Upload date:
  • Size: 4.4 kB
  • Tags: Source
  • Uploaded using Trusted Publishing? No
  • Uploaded via: twine/4.0.1 CPython/3.8.5

File hashes

Hashes for pdflayoutxt-0.0.10.tar.gz
Algorithm Hash digest
SHA256 2fbf2be836e70dcfa5e78250165e3fed81de71b41f9697969f06a4851a17d470
MD5 b992ad4d51d6eedf2083d5bef3f3e01a
BLAKE2b-256 8bc4b4fd4022b8c768bd978fb9a26864d51ae5c2fd7de3083e027440c5115edc

See more details on using hashes here.

File details

Details for the file pdflayoutxt-0.0.10-py3-none-any.whl.

File metadata

  • Download URL: pdflayoutxt-0.0.10-py3-none-any.whl
  • Upload date:
  • Size: 4.9 kB
  • Tags: Python 3
  • Uploaded using Trusted Publishing? No
  • Uploaded via: twine/4.0.1 CPython/3.8.5

File hashes

Hashes for pdflayoutxt-0.0.10-py3-none-any.whl
Algorithm Hash digest
SHA256 4e8d15c38ff2a0780fbee3a6df04d97892df33cfb147dab41fd386d16b407c5b
MD5 41cd278b1c1abd754f0b21c29a155763
BLAKE2b-256 d513501afec18f526623255571cc182d9d1c20614a5aa2d94377ed55957c06e7

See more details on using hashes here.

Supported by

AWS Cloud computing and Security Sponsor Datadog Monitoring Depot Continuous Integration Fastly CDN Google Download Analytics Pingdom Monitoring Sentry Error logging StatusPage Status page