For scraping data from DPF
Project description
PDFMaster Package
This is a PDFTableMaster package. You can use Github-flavored Markdown
Parmeters to adjust #pdfTable.set_parameters({'upperBoundry':10, 'lowerBoundry':10 , 'margin':3})
-->upperBoundry and lowerBoundry states the upper and lower boundries in the vertical axis to identify rows -->These values should be modified to fit the PDF table you're about the scrape -->Margin defines the horizontal bountries of the table ( use to identify columns)
Project will provide you with a unstrucured table structure (Lists inside a list) -->User shoud implement the CleanMaster Class that comes with the package to define how the cleaning should be done -->Refer the example.py to get a clear understanding on how you ca use this class -->cleanListMaster() comes under CleanMaster class will define this functionality
class clean(CleanMaster): def cleanListMaster(self , rows): #you have to implement this method with rules to filter out rows finalPageList = [] for row in rows: if(len(row) >= 6 and len(row) <= 6): if(row[0].strip().startswith("LKA") and len(row[0].strip()) == 12 ): finalPageList.append(clean.removeComma(row) )
return finalPageList
Project details
Release history Release notifications | RSS feed
Download files
Download the file for your platform. If you're not sure which to choose, learn more about installing packages.
Source Distribution
Built Distribution
Hashes for PDFMaster-0.0.1-py3-none-any.whl
Algorithm | Hash digest | |
---|---|---|
SHA256 | 0ac030eefdf202dc802e1befdfbd8653d2ceacdb4eb7f0feadbe65482131ce3b |
|
MD5 | 85c1e1a96a6e7b4c9ab6392d750aeae8 |
|
BLAKE2b-256 | 68608123e0b64b57920c247f52e5436bfe19f1e8f4444aaf0b53b19ccb65ccbb |