This library helps in extracting text from searchable pdf files by keeping the layout intact.

These details have not been verified by PyPI

Project description

pdflayoutxt

pdflayoutxt is a Python library for extracting text from searchable pdf's (Non Scanned) and it make sures the extracted text is in the same layout as the document.

Installation

Use the package manager pip to install foobar.

pip install pdflayoutxt

Usage

# import the library
import pdflayoutxt

# creates an object of pdfextracter
pdfobj=pdflayoutxt.pdfextracter()

# returns a list, each index being the text extracted from that index page. 
# In simple terms no_of_pages_in_document==len(list_returned)
pdf_path="./abc.pdf"
text=pdfobj.get_pdf_text(pdf_path=pdf_path)

# output
print(text)

Method Description

.get_pdf_text(pdf_path,pdf_password="",pages=[],left_most_x=0,left_most_y=0,right_most_x=1,right_most_y=1) Returns a list of list, of texts, present in each of the page in the document.pdf_password argument takes a string input,if pdf is encrypted with password, the password needs to be passed to this argument. Pages argument takes a list of pages or int (single page) from where the text needs to be extracted, if text from all pages are required the default parameter will take care. left_most_x this parameter defines the starting point of text extraction on x axis (width). Its value lies between [0,1], like if we need .25 percent of right side of page (width) then we will pass .75 as argument. left_most_y this parameter defines the starting point of text extraction on y axis (height). Its value lies between [0,1], like if we need .25 percent of text from bottom side of page (height) then we will pass .75 as argument. right_most_x this parameter defines the end point of text extraction on x axis (width). Its value lies between [0,1]. right_most_y this parameter defines the end point of text extraction on y axis (height). Its value lies between [0,1]. These parameters right_most_y,left_most_x,right_most_x,left_most_y are set to default for extracting text from complete page without cropping, if the text needs to be extracted from a particular area of page, these parameters become handy.

Method	Description
`.get_pdf_text(pdf_path,pdf_password="",pages=[],left_most_x=0,left_most_y=0,right_most_x=1,right_most_y=1)`	Returns a list of list, of texts, present in each of the page in the document.`pdf_password` argument takes a string input,if pdf is encrypted with password, the password needs to be passed to this argument. `Pages` argument takes a list of pages or int (single page) from where the text needs to be extracted, if text from all pages are required the default parameter will take care. `left_most_x` this parameter defines the starting point of text extraction on x axis (width). Its value lies between [0,1], like if we need .25 percent of right side of page (width) then we will pass .75 as argument. `left_most_y` this parameter defines the starting point of text extraction on y axis (height). Its value lies between [0,1], like if we need .25 percent of text from bottom side of page (height) then we will pass .75 as argument. `right_most_x` this parameter defines the end point of text extraction on x axis (width). Its value lies between [0,1]. `right_most_y` this parameter defines the end point of text extraction on y axis (height). Its value lies between [0,1]. These parameters `right_most_y`,`left_most_x`,`right_most_x`,`left_most_y` are set to default for extracting text from complete page without cropping, if the text needs to be extracted from a particular area of page, these parameters become handy.

Contributing

Pull requests are welcome. For major changes, please open an issue first to discuss what you would like to change.

License

MIT

Project details

These details have not been verified by PyPI

Release history Release notifications | RSS feed

This version

0.0.10

Nov 8, 2022

0.0.9

Nov 8, 2022

0.0.8

Nov 8, 2022

0.0.7.post2

Nov 8, 2022

0.0.7.post1

Nov 8, 2022

0.0.7

Nov 8, 2022

0.0.6

Nov 8, 2022

0.0.6.dev0 pre-release

Nov 8, 2022

0.0.5

Nov 8, 2022

Download files

Download the file for your platform. If you're not sure which to choose, learn more about installing packages.

Source Distribution

pdflayoutxt-0.0.10.tar.gz (4.4 kB view details)

Uploaded Nov 8, 2022 Source

Built Distribution

If you're not sure about the file name format, learn more about wheel file names.

The dropdown lists show the available interpreters, ABIs, and platforms. Enable javascript to be able to filter the list of wheel files.

pdflayoutxt-0.0.10-py3-none-any.whl (4.9 kB view details)

Uploaded Nov 8, 2022 Python 3

File details

Details for the file pdflayoutxt-0.0.10.tar.gz.

File metadata

Download URL: pdflayoutxt-0.0.10.tar.gz
Upload date: Nov 8, 2022
Size: 4.4 kB
Tags: Source
Uploaded using Trusted Publishing? No
Uploaded via: twine/4.0.1 CPython/3.8.5

File hashes

Hashes for pdflayoutxt-0.0.10.tar.gz
Algorithm	Hash digest
SHA256	`2fbf2be836e70dcfa5e78250165e3fed81de71b41f9697969f06a4851a17d470`
MD5	`b992ad4d51d6eedf2083d5bef3f3e01a`
BLAKE2b-256	`8bc4b4fd4022b8c768bd978fb9a26864d51ae5c2fd7de3083e027440c5115edc`

See more details on using hashes here.

File details

Details for the file pdflayoutxt-0.0.10-py3-none-any.whl.

File metadata

Download URL: pdflayoutxt-0.0.10-py3-none-any.whl
Upload date: Nov 8, 2022
Size: 4.9 kB
Tags: Python 3
Uploaded using Trusted Publishing? No
Uploaded via: twine/4.0.1 CPython/3.8.5

File hashes

Hashes for pdflayoutxt-0.0.10-py3-none-any.whl
Algorithm	Hash digest
SHA256	`4e8d15c38ff2a0780fbee3a6df04d97892df33cfb147dab41fd386d16b407c5b`
MD5	`41cd278b1c1abd754f0b21c29a155763`
BLAKE2b-256	`d513501afec18f526623255571cc182d9d1c20614a5aa2d94377ed55957c06e7`

See more details on using hashes here.

pdflayoutxt 0.0.10

Navigation

Verified details

Maintainers

Unverified details

Meta

Project description

pdflayoutxt

Installation

Usage

Contributing

License

Project details

Verified details

Maintainers

Unverified details

Meta

Release history Release notifications | RSS feed

Download files

Source Distribution

Built Distribution

File details

File metadata

File hashes

File details

File metadata

File hashes