Some features may not work without JavaScript. Please try enabling it if you encounter problems.

A simplified PDF table scraping and parsing tool

These details have not been verified by PyPI

Project links

Project description

Scraparser

A generic PDF table scraper and parser for data analysis.

Originally written for scraping and parsing Hong Kong government COVID-19 related public data. Now generalize for hopefully other research purposes as well.

Package is available on pypi.org. The development is on GitLab. You are welcome to submit issue and merge request. And should you want to contribute, please read the Development section.

Prerequisites

To use scraparser, you need Python 3 installed to your system. You will also need to know how to use terminal commands on your system.

The instructions below assumes that Python 3 is available to you through the command python3. If it is only available to you as command python or other name, simply use python or the available name for the commands python3 described below.

Install

The recommended way is to install the PyPi package with the [pip] module:

python3 -m pip install --upgrade scraparser

Example Use

Basic Scraping

To scrap the latest location situation report:

python3 -m scraparser scrap "https://www.chp.gov.hk/files/pdf/local_situation_covid19_tc.pdf" \
| python3 -m scraparser parse-pdf-to-csv --headers="個案編號,報告日期,發病日期,性別,年齡,入住醫院名稱,住院/出院/死亡,香港/非香港居民,個案分類,確診/疑似個案"

The downloaded PDF file and the parsed CSV file will be stored in:

./data/local_situation_covid19_tc.<time-string>.pdf
./data/local_situation_covid19_tc.<time-string>.csv

The time-string will be formated as YYYY-MM-DD-HHmmss.

Parse Previously Downloaded PDF Report

To parse pre-exist PDF file from your local computer:

python3 -m scraparser scrap-location-situation-pdf --file=path/to/somename.pdf

The parsed CSV file will be stored to "path/to/somename.csv"

Utility to Fix or Modify Parsed CSV

Its highly difficult to correctly read tables from PDF files. Common errors include:

Column underflow / overflow

The content of a cell spilled over to the last or next cell
Row overflow

The content of a cell (usually with line wraped into multiple lines), spilled over to create a phantom row with only 1 content-filled cell.

To fix these issue, please use the following subcommands:

`sort`

The command takes CSV filenames either from arguments or from STDIN (one filename) per line:

python3 -m scraparser sort --column=0 --sort-as-number --in-place ./data/local_situation_covid19_tc.<time-string>.csv

This command will:

Read the file and parse 1st column (parameter --column accepts column definition start with 0, like in Python list index)
Sort all rows by the 1st column.
Save the fix result back to the input file.

`fix column-underflow`

The command takes CSV filenames either from arguments or from STDIN (one filename) per line:

python3 -m scraparser fix column-underflow --column=5 --in-place ./data/local_situation_covid19_tc.<time-string>.csv

This command will:

Automatically read all the valid contents in the 6th column (parameter --column accepts column definition start with 0, like in Python list index).
Read every row and check if a cell in that column is empty (math.isnan()).
If so, check the column before it (6th column for our case) and see if it is suffixed by any valid content found in step (1).
Split the content correctly for the 6th and 7th column.
Save the fix result back to the input file.

`fix date-column-underflow`

The command takes CSV filenames either from arguments or from STDIN (one filename) per line:

python3 -m scraparser fix date-column-underflow --column=1 --format=DD/MM/YYYY --in-place ./data/local_situation_covid19_tc.<time-string>.csv

This command will:

Read every row and check if a cell in the 2nd column is empty (math.isnan()).
If so, check the column before it (1st column for our case) and see if it is suffixed by string that matches our specified date format.
Split the content correctly for the 1st and 2nd column.
Save the fix result back to the input file.

`fix date-column-overflow`

The command takes CSV filenames either from arguments or from STDIN (one filename) per line:

python3 -m scraparser fix date-column-overflow --column=1 --format=DD/MM/YYYY --in-place ./data/local_situation_covid19_tc.<time-string>.csv

This command will:

Read every row and check if a cell in the column before target column (i.e. 1st column in our case) to see is empty (math.isnan()).
If so, check the target column and see if it is some string suffixed by date string of specified format.
Split the content correctly for the 1st and 2nd column.
Save the fix result back to the input file.

`fix empty-rows`

The command takes CSV filenames either from arguments or from STDIN (one filename) per line:

python3 -m scraparser fix empty-rows --in-place ./data/local_situation_covid19_tc.<time-string>.csv

This command will:

Read every row and find all rows with all but 1 cell empty (math.isnan()).
If so, append the content of that 1 cell to the cell directly above it.
Drop all "phantom rows" found in step (1).
Save the fix result back to the input file.

Advanced Piping usage

Parse and Show Result Data

To correctly fix all the issue created from the parsed CSV file in local situation report:

Linux

python3 -m scraparser scrap "https://www.chp.gov.hk/files/pdf/local_situation_covid19_tc.pdf" \
| python3 -m scraparser parse-pdf-to-csv --headers="個案編號,報告日期,發病日期,性別,年齡,入住醫院名稱,住院/出院/死亡,香港/非香港居民,個案分類,確診/疑似個案" \
| python3 -m scraparser fix date-column-underflow --column=1 --in-place \
| python3 -m scraparser fix date-column-overerflow --column=1 --in-place \
| python3 -m scraparser fix column-underflow --column=6 --in-place \
| python3 -m scraparser fix column-underflow --column=5 --in-place \
| python3 -m scraparser fix empty-rows --in-place \
| python3 -m scraparser sort --in-place \
| xargs -i xdg-open "{}"

macos

python3 -m scraparser scrap "https://www.chp.gov.hk/files/pdf/local_situation_covid19_tc.pdf" \
| python3 -m scraparser parse-pdf-to-csv --headers="個案編號,報告日期,發病日期,性別,年齡,入住醫院名稱,住院/出院/死亡,香港/非香港居民,個案分類,確診/疑似個案" \
| python3 -m scraparser fix date-column-underflow --column=1 --in-place \
| python3 -m scraparser fix date-column-overerflow --column=1 --in-place \
| python3 -m scraparser fix column-underflow --column=6 --in-place \
| python3 -m scraparser fix column-underflow --column=5 --in-place \
| python3 -m scraparser fix empty-rows --in-place \
| python3 -m scraparser sort --in-place
| xargs -I{} open "{}"

Parse Data then Update Google Sheet

This will overwrite the current data specified in the range. If there are not enough rows in the Google Sheet, the file will be expanded automatically.

Presume you have defined the string $GOOGLE_SHEET_ID and the target sheet 'CHP/DH Local Situation Input' exists:

python3 -m scraparser scrap "https://www.chp.gov.hk/files/pdf/local_situation_covid19_tc.pdf" \
| python3 -m scraparser parse-pdf-to-csv --headers="個案編號,報告日期,發病日期,性別,年齡,入住醫院名稱,住院/出院/死亡,香港/非香港居民,個案分類,確診/疑似個案" \
| python3 -m scraparser fix date-column-underflow --column=1 --in-place \
| python3 -m scraparser fix date-column-overerflow --column=1 --in-place \
| python3 -m scraparser fix column-underflow --column=6 --in-place \
| python3 -m scraparser fix column-underflow --column=5 --in-place \
| python3 -m scraparser fix empty-rows --in-place \
| python3 -m scraparser sort --in-place \
| python3 -m scraparser googlesheet "$GOOGLE_SHEET_ID" update --range="'CHP/DH Local Situation Input'!A2:Z"

Development

First clone this repository by:

git clone https://gitlab.com/yookoala/scraparser.git
cd scraparser

You are recommended to use venv for the development environment.

First you would need to initialize venv and install all the packages specified in the requirements.txt:

pip -m venv .venv
. ./bin/activate.sh
pip install -r requirements.txt

Once this is done, you are ready to run the package in the repository folder as if the module was installed locally:

python3 -m scraparser <command>

You can change the scraparser folder in this repository and this command will run correctly.

Build and Submit

Should you want to fork and create your own scraparser package on Python Package index, you may build and release your package (requires make) with commands.

Building

To build the package for upload, you need to rename the package to something other than scraparser. Let's say you would suffix the package with YOURNAME:

PYPI_PKG_NAME=scraparser-YOURNAME make clean dist

The default version is specified by git commands. If it fails to work, you may force a version string on it:

PYPI_PKG_VERSION=0.5.0 PYPI_PKG_NAME=scraparser-YOURNAME make clean dist

Please note that the version string MUST follow the PEP 440 convension or it cannot be submitted.

Submitting to test.pypi.org

PYPI_TEST_PASSWORD=<your-pypi-test-token> make upload-test

Submitting to pypi.org

PYPI_PASSWORD=<your-pypi-test-token> make upload

License

License under the MIT License. You may obtain the license in this repository.

Project details

These details have not been verified by PyPI

Project links

Release history Release notifications | RSS feed

This version

0.3.1

May 1, 2020

0.3.0

Apr 27, 2020

0.2.5

Apr 27, 2020

0.2.4

Apr 24, 2020

0.2.3

Apr 23, 2020

0.2.1.post0.dev20200423171934 pre-release yanked

Apr 23, 2020

0.0.2 yanked

Apr 23, 2020

Download files

Download the file for your platform. If you're not sure which to choose, learn more about installing packages.

Source Distribution

scraparser-0.3.1.tar.gz (15.8 kB view details)

Uploaded May 1, 2020 Source

Built Distribution

If you're not sure about the file name format, learn more about wheel file names.

The dropdown lists show the available interpreters, ABIs, and platforms. Enable javascript to be able to filter the list of wheel files.

scraparser-0.3.1-py3-none-any.whl (15.1 kB view details)

Uploaded May 1, 2020 Python 3

File details

Details for the file scraparser-0.3.1.tar.gz.

File metadata

Download URL: scraparser-0.3.1.tar.gz
Upload date: May 1, 2020
Size: 15.8 kB
Tags: Source
Uploaded using Trusted Publishing? No
Uploaded via: twine/3.1.1 pkginfo/1.5.0.1 requests/2.23.0 setuptools/46.1.3 requests-toolbelt/0.9.1 tqdm/4.45.0 CPython/3.8.2

File hashes

Hashes for scraparser-0.3.1.tar.gz
Algorithm	Hash digest
SHA256	`70de239adb8740316516799a26fffcae40396c261ff0a0311196c9f1da18b494`
MD5	`d320e5aeb83a4f316f6289349dd4eee1`
BLAKE2b-256	`b7e01a8961d17917c205d57b638c8b870eee77c0e3ec5cee97cea2c2b653d57b`

See more details on using hashes here.

File details

Details for the file scraparser-0.3.1-py3-none-any.whl.

File metadata

Download URL: scraparser-0.3.1-py3-none-any.whl
Upload date: May 1, 2020
Size: 15.1 kB
Tags: Python 3
Uploaded using Trusted Publishing? No
Uploaded via: twine/3.1.1 pkginfo/1.5.0.1 requests/2.23.0 setuptools/46.1.3 requests-toolbelt/0.9.1 tqdm/4.45.0 CPython/3.8.2

File hashes

Hashes for scraparser-0.3.1-py3-none-any.whl
Algorithm	Hash digest
SHA256	`ce7422c0febd8825aca4f7fa1bd4cf2bd9a146c0371482fa8be2df6c179a9f3d`
MD5	`fad1a89c6d3188a93f844a9cfc1f362b`
BLAKE2b-256	`c1c3de9fd1aa0250172bc6545942c52195596d01f0d59695a6a972efeb42cc9d`

See more details on using hashes here.

Supported by

AWS Cloud computing and Security Sponsor

Datadog Monitoring

Depot Continuous Integration

Google Download Analytics

Pingdom Monitoring

Sentry Error logging

StatusPage Status page