Skip to main content

A simplified PDF table scraping and parsing tool

Project description

Scraparser

A generic PDF table scraper and parser for data analysis.

CI Status PyPi

Originally written for scraping and parsing Hong Kong government COVID-19 related public data. Now generalize for hopefully other research purposes as well.

Package is available on pypi.org. The development is on GitLab. You are welcome to submit issue and merge request. And should you want to contribute, please read the Development section.

Prerequisites

To use scraparser, you need Python 3 installed to your system. You will also need to know how to use terminal commands on your system.

The instructions below assumes that Python 3 is available to you through the command python3. If it is only available to you as command python or other name, simply use python or the available name for the commands python3 described below.

Install

The recommended way is to install the PyPi package with the [pip] module:

python3 -m pip install --upgrade scraparser

Example Use

Basic Scraping

To scrap the latest location situation report:

python3 -m scraparser scrap "https://www.chp.gov.hk/files/pdf/local_situation_covid19_tc.pdf" \
| python3 -m scraparser parse-pdf-to-csv --headers="個案編號,報告日期,發病日期,性別,年齡,入住醫院名稱,住院/出院/死亡,香港/非香港居民,個案分類,確診/疑似個案"

The downloaded PDF file and the parsed CSV file will be stored in:

./data/local_situation_covid19_tc.<time-string>.pdf
./data/local_situation_covid19_tc.<time-string>.csv

The time-string will be formated as YYYY-MM-DD-HHmmss.

Parse Previously Downloaded PDF Report

To parse pre-exist PDF file from your local computer:

python3 -m scraparser scrap-location-situation-pdf --file=path/to/somename.pdf

The parsed CSV file will be stored to "path/to/somename.csv"

Utility to Fix or Modify Parsed CSV

Its highly difficult to correctly read tables from PDF files. Common errors include:

  • Column underflow / overflow

    The content of a cell spilled over to the last or next cell

  • Row overflow

    The content of a cell (usually with line wraped into multiple lines), spilled over to create a phantom row with only 1 content-filled cell.

To fix these issue, please use the following subcommands:

sort

The command takes CSV filenames either from arguments or from STDIN (one filename) per line:

python3 -m scraparser sort --column=0 --sort-as-number --in-place ./data/local_situation_covid19_tc.<time-string>.csv

This command will:

  1. Read the file and parse 1st column (parameter --column accepts column definition start with 0, like in Python list index)
  2. Sort all rows by the 1st column.
  3. Save the fix result back to the input file.

fix-column-underflow

The command takes CSV filenames either from arguments or from STDIN (one filename) per line:

python3 -m scraparser fix-column-underflow --column=5 --in-place ./data/local_situation_covid19_tc.<time-string>.csv

This command will:

  1. Automatically read all the valid contents in the 6th column (parameter --column accepts column definition start with 0, like in Python list index).
  2. Read every row and check if a cell in that column is empty (math.isnan()).
  3. If so, check the column before it (6th column for our case) and see if it is suffixed by any valid content found in step (1).
  4. Split the content correctly for the 6th and 7th column.
  5. Save the fix result back to the input file.

fix-date-column-underflow

The command takes CSV filenames either from arguments or from STDIN (one filename) per line:

python3 -m scraparser fix-date-column-underflow --column=1 --format=DD/MM/YYYY --in-place ./data/local_situation_covid19_tc.<time-string>.csv

This command will:

  1. Read every row and check if a cell in the 2nd column is empty (math.isnan()).
  2. If so, check the column before it (1st column for our case) and see if it is suffixed by string that matches our specified date format.
  3. Split the content correctly for the 1st and 2nd column.
  4. Save the fix result back to the input file.

fix-empty-rows

The command takes CSV filenames either from arguments or from STDIN (one filename) per line:

python3 -m scraparser fix-empty-rows --in-place ./data/local_situation_covid19_tc.<time-string>.csv

This command will:

  1. Read every row and find all rows with all but 1 cell empty (math.isnan()).
  2. If so, append the content of that 1 cell to the cell directly above it.
  3. Drop all "phantom rows" found in step (1).
  4. Save the fix result back to the input file.

Advanced Piping usage

Parse and Show Result Data

To correctly fix all the issue created from the parsed CSV file in local situation report:

Linux

python3 -m scraparser scrap "https://www.chp.gov.hk/files/pdf/local_situation_covid19_tc.pdf" \
| python3 -m scraparser parse-pdf-to-csv --headers="個案編號,報告日期,發病日期,性別,年齡,入住醫院名稱,住院/出院/死亡,香港/非香港居民,個案分類,確診/疑似個案" \
| python3 -m scraparser fix-date-column-underflow --column=1 --in-place \
| python3 -m scraparser fix-column-underflow --column=6 --in-place \
| python3 -m scraparser fix-column-underflow --column=5 --in-place \
| python3 -m scraparser fix-empty-rows --in-place \
| python3 -m scraparser sort --in-place \
| xargs -i xdg-open "{}"

macos

python3 ./scraparser scrap "https://www.chp.gov.hk/files/pdf/local_situation_covid19_tc.pdf" \
| python3 -m scraparser parse-pdf-to-csv --headers="個案編號,報告日期,發病日期,性別,年齡,入住醫院名稱,住院/出院/死亡,香港/非香港居民,個案分類,確診/疑似個案" \
| python3 -m scraparser fix-date-column-underflow --column=1 --in-place \
| python3 -m scraparser fix-column-underflow --column=6 --in-place \
| python3 -m scraparser fix-column-underflow --column=5 --in-place \
| python3 -m scraparser fix-empty-rows --in-place \
| python3 -m scraparser sort --in-place
| xargs -I{} open "{}"

Parse Data then Update Google Sheet

This will overwrite the current data specified in the range. If there are not enough rows in the Google Sheet, the file will be expanded automatically.

Presume you have defined the string $GOOGLE_SHEET_ID and the target sheet 'CHP/DH Local Situation Input' exists:

python3 -m scraparser scrap "https://www.chp.gov.hk/files/pdf/local_situation_covid19_tc.pdf" \
| python3 -m scraparser parse-pdf-to-csv --headers="個案編號,報告日期,發病日期,性別,年齡,入住醫院名稱,住院/出院/死亡,香港/非香港居民,個案分類,確診/疑似個案" \
| python3 -m scraparser fix-date-column-underflow --column=1 --in-place \
| python3 -m scraparser fix-column-underflow --column=6 --in-place \
| python3 -m scraparser fix-column-underflow --column=5 --in-place \
| python3 -m scraparser fix-empty-rows --in-place \
| python3 -m scraparser sort --in-place \
| python3 -m scraparser googlesheet "$GOOGLE_SHEET_ID" update --range="'CHP/DH Local Situation Input'!A2:Z" 

Development

First clone this repository by:

git clone https://gitlab.com/yookoala/scraparser.git
cd scraparser

You are recommended to use venv for the development environment.

First you would need to initialize venv and install all the packages specified in the requirements.txt:

pip -m venv .venv
. ./bin/activate.sh
pip install -r requirements.txt

Once this is done, you are ready to run the package like this:

python3 ./scraparser <command>

You can change the scraparser folder in this repository and this command will run correctly.

Build and Submit

Should you want to fork and create your own scraparser package on Python Package index, you may build and release your package (requires make) with commands.

Building

To build the package for upload, you need to rename the package to something other than scraparser. Let's say you would suffix the package with YOURNAME:

PYPI_PKG_NAME=scraparser-YOURNAME make clean dist

The default version is specified by git commands. If it fails to work, you may force a version string on it:

PYPI_PKG_VERSION=0.5.0 PYPI_PKG_NAME=scraparser-YOURNAME make clean dist

Please note that the version string MUST follow the PEP 440 convension or it cannot be submitted.

Submitting to test.pypi.org

PYPI_TEST_PASSWORD=<your-pypi-test-token> make upload-test

Submitting to pypi.org

PYPI_PASSWORD=<your-pypi-test-token> make upload

License

License under the MIT License. You may obtain the license in this repository.

Project details


Download files

Download the file for your platform. If you're not sure which to choose, learn more about installing packages.

Source Distribution

scraparser-0.2.3.tar.gz (13.9 kB view details)

Uploaded Source

Built Distribution

If you're not sure about the file name format, learn more about wheel file names.

scraparser-0.2.3-py3-none-any.whl (12.4 kB view details)

Uploaded Python 3

File details

Details for the file scraparser-0.2.3.tar.gz.

File metadata

  • Download URL: scraparser-0.2.3.tar.gz
  • Upload date:
  • Size: 13.9 kB
  • Tags: Source
  • Uploaded using Trusted Publishing? No
  • Uploaded via: twine/3.1.1 pkginfo/1.5.0.1 requests/2.23.0 setuptools/46.1.3 requests-toolbelt/0.9.1 tqdm/4.45.0 CPython/3.8.2

File hashes

Hashes for scraparser-0.2.3.tar.gz
Algorithm Hash digest
SHA256 db5a45117566678a2be0fcb7463f98d9fb82ac5c134f2afaa8ef84238d55e28f
MD5 37784b3013878ca6efe4a319904fdd22
BLAKE2b-256 087ab62a9d02e4a65bd2d46cb3fda2890c2f1938c0b0b1d33f04d3e1da856010

See more details on using hashes here.

File details

Details for the file scraparser-0.2.3-py3-none-any.whl.

File metadata

  • Download URL: scraparser-0.2.3-py3-none-any.whl
  • Upload date:
  • Size: 12.4 kB
  • Tags: Python 3
  • Uploaded using Trusted Publishing? No
  • Uploaded via: twine/3.1.1 pkginfo/1.5.0.1 requests/2.23.0 setuptools/46.1.3 requests-toolbelt/0.9.1 tqdm/4.45.0 CPython/3.8.2

File hashes

Hashes for scraparser-0.2.3-py3-none-any.whl
Algorithm Hash digest
SHA256 a07fd2682cd5ddb85dabafcd567255cab4360283e2f33a499c1fb6a12987b8dc
MD5 52422fb291be4a68a18d6c557e52365c
BLAKE2b-256 073cac35a51ce3b9000a3f655bdde3d155e2616160a1530eb1552d51dcbf98bb

See more details on using hashes here.

Supported by

AWS Cloud computing and Security Sponsor Datadog Monitoring Depot Continuous Integration Fastly CDN Google Download Analytics Pingdom Monitoring Sentry Error logging StatusPage Status page