Blackstone's Criminal Practice 2022 PDF Scraper
Project description
Blackstone's Criminal Procedure 2022 PDF Scraper
IMPORTANT: This is a tool that helps with scraping PDF files for Blackstone's Criminal Proedure 2022 downloaded from LexisLibrary. It will NOT work if you don't have the PDF files beforehand.
Introduction
This is a mini project and my first attempt at writing a package and pushing it to Github and PyPi. Aside from helping my friend with a quicker way to copy from the law document, I used this as an opportunity to practice version control, sharpen my understanding of regex and pattern recognition, understanding how to write and publish packages on PyPi and practice good documentation behaviour.
This package is designed to get all the sections and subsections from Blackstone's Criminal Practice 2022 Documents based on the input in a JSON file. All the text will then be formatted appropriately and exported in a word document (.docx) per the structure described in that initial JSON file.
TL;DR - A tool to extract subsections as required from Blackstone's Criminal Practice 2022 from Lexis Library into a word document.
NOTE: This program was designed purely for Part D of the Blackstone's Criminal Practice 2022 Document from Lexis Library.
Update 06/10/22: The program seems to work fine on Parts D, E, F and R from Blackstone's Criminal Practice 2022 from Lexis Library.
How to use the tool
-
Use pip to install the package as follows.
pip install bcpscraper
-
Create a folder called
data
that contains all the PDF files. These files should be named according to their section (i.e. D5.pdf OR E14.pdf)
For more information, refer to the "PDF File Naming Convention" section below.
-
Create a JSON file with the structure below so the program knows which sections and subsections to extract and how to organise them in the word document. This is the instruction file that is read by the program.
// All text with angle brackets <> are variables and can be named according to preference. // All other text are constants that are used as keys throughout the program. { "doc_title": "", // This is the title of the .docx file that will be created. "doc_data": { // This is the data that the program should look for. // Start of a Topic "<topic_name>" : { // This is the start of a topic. There can be as many topics as you want within this JSON file. "title": "", // The title of this topic. "sections": { // The sections and subsections that the progrma should look for // The keys here are actually variables but I've displayed them as text as an example situation. "D5": [1,2,3,4,5], // Use a list for the subsections within that particular section "D9": [2,3,4,5,6,7,8] // Example: D5.1 - D5.5 and D9.2 - D9.8 . . . } } // End of a Topic . . . } }
-
Import bcpscraper into your project.
import bcpscraper as bcp
-
Specify the path of the JSON file when creating the bcp DocxWriter object.
writer = bcp.DocxWriter('example.json')
-
Use the function
createDocument(folder)
to create the document. The parameterfolder
is the directory that the word file will be exported to.writer.createDocument('output') # This stores it in the output folder
An overall example of how this would look like in your code would be:
import bcpscrapper as bcp
path = 'example.json'
writer = bcp.DocxWriter(path)
code = writer.createDocument('output')
This is shown in example.py
and example.json
.
PDF File Naming Convention
Name the PDF file based on the Part and Section that it belongs to. For example:
Part D5 - Starting a Prosecution and Preliminary Proceedings in Magistrates' Court should be named as D5.pdf.
The file should be saved in a folder called data
.
Future Work
- Different log category classifications.
- Introduce tests in the code.
Documentation
Check out the code documentation wiki page for the official documentation.
Project details
Release history Release notifications | RSS feed
Download files
Download the file for your platform. If you're not sure which to choose, learn more about installing packages.
Source Distribution
Built Distribution
File details
Details for the file bcpscraper-0.0.2.tar.gz
.
File metadata
- Download URL: bcpscraper-0.0.2.tar.gz
- Upload date:
- Size: 10.7 kB
- Tags: Source
- Uploaded using Trusted Publishing? No
- Uploaded via: twine/4.0.1 CPython/3.8.8
File hashes
Algorithm | Hash digest | |
---|---|---|
SHA256 | 8cfb64f13ec2cc1d2ffb60276788ca47fb057970c21462288745662c7b2581c9 |
|
MD5 | 5621ecb925504c898a76aeaff5cf974b |
|
BLAKE2b-256 | e51106881e73b8b33b04e2cdd2a00cb99d4cebcfe9dfbbf0921f66ee5f8e242c |
File details
Details for the file bcpscraper-0.0.2-py3-none-any.whl
.
File metadata
- Download URL: bcpscraper-0.0.2-py3-none-any.whl
- Upload date:
- Size: 10.7 kB
- Tags: Python 3
- Uploaded using Trusted Publishing? No
- Uploaded via: twine/4.0.1 CPython/3.8.8
File hashes
Algorithm | Hash digest | |
---|---|---|
SHA256 | 7ec1b441c1da4ad2f43c86122b61465ad60a6278bcb92f9d38e1efddde82c3bb |
|
MD5 | 99ade374bb125ff5563612f6f6c9b191 |
|
BLAKE2b-256 | 7bcdab24f10199aeae47d67bd758369fc078be7a863fc8da7dda14559f3dcd39 |