A simplified PDF table scraping and parsing tool
Project description
Scraparser
A generic PDF table scraper and parser for data analysis.
Originally written for scraping and parsing Hong Kong government COVID-19 related public data. Now generalize for hopefully other research purposes as well.
Package is available on pypi.org. The development is on GitLab. You are welcome to submit issue and merge request. And should you want to contribute, please read the Development section.
Prerequisites
To use scraparser, you need Python 3 installed to your system. You will also need to know how to use terminal commands on your system.
The instructions below assumes that Python 3 is available to
you through the command python3. If it is only available to
you as command python or other name, simply use python or
the available name for the commands python3 described below.
Install
The recommended way is to install the PyPi package with the [pip] module:
python3 -m pip install --upgrade scraparser
Example Use
Basic Scraping
To scrap the latest location situation report:
python3 -m scraparser scrap "https://www.chp.gov.hk/files/pdf/local_situation_covid19_tc.pdf" \
| python3 -m scraparser parse-pdf-to-csv --headers="個案編號,報告日期,發病日期,性別,年齡,入住醫院名稱,住院/出院/死亡,香港/非香港居民,個案分類,確診/疑似個案"
The downloaded PDF file and the parsed CSV file will be stored in:
./data/local_situation_covid19_tc.<time-string>.pdf
./data/local_situation_covid19_tc.<time-string>.csv
The time-string will be formated as YYYY-MM-DD-HHmmss.
Parse Previously Downloaded PDF Report
To parse pre-exist PDF file from your local computer:
python3 -m scraparser scrap-location-situation-pdf --file=path/to/somename.pdf
The parsed CSV file will be stored to "path/to/somename.csv"
Utility to Fix or Modify Parsed CSV
Its highly difficult to correctly read tables from PDF files. Common errors include:
-
Column underflow / overflow
The content of a cell spilled over to the last or next cell
-
Row overflow
The content of a cell (usually with line wraped into multiple lines), spilled over to create a phantom row with only 1 content-filled cell.
To fix these issue, please use the following subcommands:
sort
The command takes CSV filenames either from arguments or from STDIN (one filename)
per line:
python3 -m scraparser sort --column=0 --sort-as-number --in-place ./data/local_situation_covid19_tc.<time-string>.csv
This command will:
- Read the file and parse 1st column (parameter
--columnaccepts column definition start with 0, like in Python list index) - Sort all rows by the 1st column.
- Save the fix result back to the input file.
fix column-underflow
The command takes CSV filenames either from arguments or from STDIN (one filename)
per line:
python3 -m scraparser fix column-underflow --column=5 --in-place ./data/local_situation_covid19_tc.<time-string>.csv
This command will:
- Automatically read all the valid contents in the 6th column (parameter
--columnaccepts column definition start with 0, like in Python list index). - Read every row and check if a cell in that column is empty (
math.isnan()). - If so, check the column before it (6th column for our case) and see if it is suffixed by any valid content found in step (1).
- Split the content correctly for the 6th and 7th column.
- Save the fix result back to the input file.
fix date-column-underflow
The command takes CSV filenames either from arguments or from STDIN (one filename)
per line:
python3 -m scraparser fix date-column-underflow --column=1 --format=DD/MM/YYYY --in-place ./data/local_situation_covid19_tc.<time-string>.csv
This command will:
- Read every row and check if a cell in the 2nd column is empty (
math.isnan()). - If so, check the column before it (1st column for our case) and see if it is suffixed by string that matches our specified date format.
- Split the content correctly for the 1st and 2nd column.
- Save the fix result back to the input file.
fix date-column-overflow
The command takes CSV filenames either from arguments or from STDIN (one filename)
per line:
python3 -m scraparser fix date-column-overflow --column=1 --format=DD/MM/YYYY --in-place ./data/local_situation_covid19_tc.<time-string>.csv
This command will:
- Read every row and check if a cell in the column before target column
(i.e. 1st column in our case) to see is empty (
math.isnan()). - If so, check the target column and see if it is some string suffixed by date string of specified format.
- Split the content correctly for the 1st and 2nd column.
- Save the fix result back to the input file.
fix empty-rows
The command takes CSV filenames either from arguments or from STDIN (one filename)
per line:
python3 -m scraparser fix empty-rows --in-place ./data/local_situation_covid19_tc.<time-string>.csv
This command will:
- Read every row and find all rows with all but 1 cell empty (
math.isnan()). - If so, append the content of that 1 cell to the cell directly above it.
- Drop all "phantom rows" found in step (1).
- Save the fix result back to the input file.
Advanced Piping usage
Parse and Show Result Data
To correctly fix all the issue created from the parsed CSV file in local situation report:
Linux
python3 -m scraparser scrap "https://www.chp.gov.hk/files/pdf/local_situation_covid19_tc.pdf" \
| python3 -m scraparser parse-pdf-to-csv --headers="個案編號,報告日期,發病日期,性別,年齡,入住醫院名稱,住院/出院/死亡,香港/非香港居民,個案分類,確診/疑似個案" \
| python3 -m scraparser fix date-column-underflow --column=1 --in-place \
| python3 -m scraparser fix date-column-overerflow --column=1 --in-place \
| python3 -m scraparser fix column-underflow --column=6 --in-place \
| python3 -m scraparser fix column-underflow --column=5 --in-place \
| python3 -m scraparser fix empty-rows --in-place \
| python3 -m scraparser sort --in-place \
| xargs -i xdg-open "{}"
macos
python3 -m scraparser scrap "https://www.chp.gov.hk/files/pdf/local_situation_covid19_tc.pdf" \
| python3 -m scraparser parse-pdf-to-csv --headers="個案編號,報告日期,發病日期,性別,年齡,入住醫院名稱,住院/出院/死亡,香港/非香港居民,個案分類,確診/疑似個案" \
| python3 -m scraparser fix date-column-underflow --column=1 --in-place \
| python3 -m scraparser fix date-column-overerflow --column=1 --in-place \
| python3 -m scraparser fix column-underflow --column=6 --in-place \
| python3 -m scraparser fix column-underflow --column=5 --in-place \
| python3 -m scraparser fix empty-rows --in-place \
| python3 -m scraparser sort --in-place
| xargs -I{} open "{}"
Parse Data then Update Google Sheet
This will overwrite the current data specified in the range. If there are not enough rows in the Google Sheet, the file will be expanded automatically.
Presume you have defined the string $GOOGLE_SHEET_ID and the target sheet
'CHP/DH Local Situation Input' exists:
python3 -m scraparser scrap "https://www.chp.gov.hk/files/pdf/local_situation_covid19_tc.pdf" \
| python3 -m scraparser parse-pdf-to-csv --headers="個案編號,報告日期,發病日期,性別,年齡,入住醫院名稱,住院/出院/死亡,香港/非香港居民,個案分類,確診/疑似個案" \
| python3 -m scraparser fix date-column-underflow --column=1 --in-place \
| python3 -m scraparser fix date-column-overerflow --column=1 --in-place \
| python3 -m scraparser fix column-underflow --column=6 --in-place \
| python3 -m scraparser fix column-underflow --column=5 --in-place \
| python3 -m scraparser fix empty-rows --in-place \
| python3 -m scraparser sort --in-place \
| python3 -m scraparser googlesheet "$GOOGLE_SHEET_ID" update --range="'CHP/DH Local Situation Input'!A2:Z"
Development
First clone this repository by:
git clone https://gitlab.com/yookoala/scraparser.git
cd scraparser
You are recommended to use venv for the development environment.
First you would need to initialize venv and install all the packages specified in the requirements.txt:
pip -m venv .venv
. ./bin/activate.sh
pip install -r requirements.txt
Once this is done, you are ready to run the package in the repository folder as if the module was installed locally:
python3 -m scraparser <command>
You can change the scraparser folder in this repository and this command will run correctly.
Build and Submit
Should you want to fork and create your own scraparser package on Python Package index,
you may build and release your package (requires
make) with commands.
Building
To build the package for upload, you need to rename the package to something other
than scraparser. Let's say you would suffix the package with YOURNAME:
PYPI_PKG_NAME=scraparser-YOURNAME make clean dist
The default version is specified by git commands. If it fails to work, you may force a version string on it:
PYPI_PKG_VERSION=0.5.0 PYPI_PKG_NAME=scraparser-YOURNAME make clean dist
Please note that the version string MUST follow the PEP 440 convension or it cannot be submitted.
Submitting to test.pypi.org
PYPI_TEST_PASSWORD=<your-pypi-test-token> make upload-test
Submitting to pypi.org
PYPI_PASSWORD=<your-pypi-test-token> make upload
License
License under the MIT License. You may obtain the license in this repository.
Project details
Release history Release notifications | RSS feed
Download files
Download the file for your platform. If you're not sure which to choose, learn more about installing packages.
Source Distribution
Built Distribution
Filter files by name, interpreter, ABI, and platform.
If you're not sure about the file name format, learn more about wheel file names.
Copy a direct link to the current filters
File details
Details for the file scraparser-0.3.1.tar.gz.
File metadata
- Download URL: scraparser-0.3.1.tar.gz
- Upload date:
- Size: 15.8 kB
- Tags: Source
- Uploaded using Trusted Publishing? No
- Uploaded via: twine/3.1.1 pkginfo/1.5.0.1 requests/2.23.0 setuptools/46.1.3 requests-toolbelt/0.9.1 tqdm/4.45.0 CPython/3.8.2
File hashes
| Algorithm | Hash digest | |
|---|---|---|
| SHA256 |
70de239adb8740316516799a26fffcae40396c261ff0a0311196c9f1da18b494
|
|
| MD5 |
d320e5aeb83a4f316f6289349dd4eee1
|
|
| BLAKE2b-256 |
b7e01a8961d17917c205d57b638c8b870eee77c0e3ec5cee97cea2c2b653d57b
|
File details
Details for the file scraparser-0.3.1-py3-none-any.whl.
File metadata
- Download URL: scraparser-0.3.1-py3-none-any.whl
- Upload date:
- Size: 15.1 kB
- Tags: Python 3
- Uploaded using Trusted Publishing? No
- Uploaded via: twine/3.1.1 pkginfo/1.5.0.1 requests/2.23.0 setuptools/46.1.3 requests-toolbelt/0.9.1 tqdm/4.45.0 CPython/3.8.2
File hashes
| Algorithm | Hash digest | |
|---|---|---|
| SHA256 |
ce7422c0febd8825aca4f7fa1bd4cf2bd9a146c0371482fa8be2df6c179a9f3d
|
|
| MD5 |
fad1a89c6d3188a93f844a9cfc1f362b
|
|
| BLAKE2b-256 |
c1c3de9fd1aa0250172bc6545942c52195596d01f0d59695a6a972efeb42cc9d
|