Skip to main content

For scraping data from DPF

Project description

PDFMaster Package

This is a PDFTableMaster package. You can use Github-flavored Markdown

Parmeters to adjust #pdfTable.set_parameters({'upperBoundry':10, 'lowerBoundry':10 , 'margin':3})

-->upperBoundry and lowerBoundry states the upper and lower boundries in the vertical axis to identify rows -->These values should be modified to fit the PDF table you're about the scrape -->Margin defines the horizontal bountries of the table ( use to identify columns)

Project will provide you with a unstrucured table structure (Lists inside a list) -->User shoud implement the CleanMaster Class that comes with the package to define how the cleaning should be done -->Refer the example.py to get a clear understanding on how you ca use this class -->cleanListMaster() comes under CleanMaster class will define this functionality

class clean(CleanMaster): def cleanListMaster(self , rows): #you have to implement this method with rules to filter out rows finalPageList = [] for row in rows: if(len(row) >= 6 and len(row) <= 6): if(row[0].strip().startswith("LKA") and len(row[0].strip()) == 12 ): finalPageList.append(clean.removeComma(row) )

        return finalPageList

Project details


Download files

Download the file for your platform. If you're not sure which to choose, learn more about installing packages.

Source Distribution

PDFMaster-0.0.1.tar.gz (8.0 kB view hashes)

Uploaded Source

Built Distribution

PDFMaster-0.0.1-py3-none-any.whl (8.4 kB view hashes)

Uploaded Python 3

Supported by

AWS AWS Cloud computing and Security Sponsor Datadog Datadog Monitoring Fastly Fastly CDN Google Google Download Analytics Microsoft Microsoft PSF Sponsor Pingdom Pingdom Monitoring Sentry Sentry Error logging StatusPage StatusPage Status page