No project description provided
Project description
Table of Contents
Scraper Tutorial
This is a tutorial on how to use the SCALES Scraper tool to download data from PACER.
The scraper has three modules:
- Query scraper: to download case queries
- Docket Scraper: to download case dockets
- Document Scraper: to download case documents
This tutorial will step through the basics of getting set up with the scraping tool and using each module. For more details, see the documentation here
The tutorial uses the PACER demo site located here: https://dcecf.psc.uscourts.gov/cgi-bin/ShowIndex.pl
This is a demo version of PACER with demo credentials that can be used for free. It contains a sample of cases from Western District of New York (nywd). However because it runs on its own domain with psc (PACER Service Center) instead of on the nywd PACER site, we will use psc as the court abbreviation for this tutorial.
1. Getting Started
- Install the package:
pip install pacer-tools
- Make sure you have a recent version of Firefox installed (80.0+) and GeckoDriver for Firefox
Download folder For this tutorial we are going to use the resources in the /demo directory of this repo and will put our data into /demo/pacer. The scraper separates out data by district, so it's best to have a subdirectory for each district, named by court abbreviation (e.g. demo/pacer/ilnd for Northern District of Illinois). When the scraper runs it will build the necessary structure inside of that subdirectory that it needs to download and house the data from Pacer.
Since we are using the PACER demo, we will use the court abbreviation it uses which is psc (for PACER Service Centre). The scraper will take an inpath
argument, to which we will pass demo/pacer/psc.
2. Pacer credentials
For most use you will need to put your Pacer login details into a json file. For this tutorial we'll be using the Pacer training site with the login details contained in demo/auth.json. When you are running the scraper using your own credentials you can use that file as a template.
3. Query Scraper
The first thing we'll do with the scraper is download some query results. There is a demo query located at demo/query_conf.json. This is a .json file that maps search criteria to fields in the Pacer query form. To create your own query later you can use the query builder (see the documentation).
Throughout this tutorial we will be using the scraper command from the PACER-tools command-line utility. Run pacer-tools scraper
to see the full set of arguments.
Running script
To use the Query Scraper we just need to run the following:
pacer-tools scraper --override-time --query-conf demo/query_conf.json demo/pacer/psc
- *The
--override-time
flag is to override time restriction (as it is designed to run be run overnight) - The
--query-conf
option points the scraper to a json config file with the parameters for our query.
The user will be prompted for the following:
- Mode: for this step we want to choose query
- Court: for the demo site the court abbreviation we want to enter is psc
- Auth path: This is the relative path to our PACER login credentials. Running this from the downloader folder the path to the demo credentials is login/demo.auth
- Case limit: This limits the number of cases downloaded in a single session, to prevent accidental overspending on PACER. For this example lets just enter 50.
Note: All of these parameters that the user was prompted for can actually be given as arguments to the script. These are all explained in full in the documentation. To avoid the prompting you can instead run:
pacer-tools scraper --override-time --query-conf demo/query_conf.json -m query -c psc -a demo/auth.json -cl 50 demo/pacer/psc
Result Once these values have all been input, the Scraper should launch at this point and download the query results. You should see in the terminal the following message:
Query results saved to <path_to_psc>/psc/queries
If you navigate to the psc folder you will see firstly that a few subfolders have been created to house the data, and secondly within the queries folder there should be a .html file that contains the query results.
4. Docket Scraper
Next we will take that query results file and download all of the dockets for the listed cases. The Docket Scraper module can take a .html query file, which we have just downloaded, as its input.
Running script To use the Docket Scraper we will run the following:
pacer-tools scraper -m docket --docket-input demo/pacer/psc/queries/<query_file>.html -c psc -a demo/auth.json -cl 50 --override-time demo/pacer/psc
- The
--docket-input
option takes the path to the query file. The actual name of the query file (<query_file>
) will vary as it includes a timestamp.
The Docket Scraper (as well as the Document Scraper which will look at next) runs asynchronously across multiple Firefox instances, by default two. The no. of instances (workers) running can be adjusted by the n-workers
option (see the documentation).
Note: the scraper only keeps the civil and criminal cases, to download a specific case type you can use the --case-type
option.
Result Once both browsers have finished and closed, all of the cases from the query results file should be downloaded and can be found in demo/pacer/psc/html
5. Document Scraper
Lastly, we will get the actual documents associated with docket lines of the cases. The docket scraper can take a few different types of inputs, including a list of specific cases, but for this tutorial we will give it the directory of docket .html files as an input so that documents for all cases will be downloaded. By default, for each case all documents and attachments will be downloaded.
Running script To use the Document Scraper we run the following:
pacer-tools scraper -m document -c psc -a demo/auth.json -cl 50 --override-time --document-input demo/document_input.csv demo/pacer/psc
- There is a default limit of 1000 documents per case. Any case that has more than 1000 documents will be skipped. This limit can be changed by the
--document-limit
option.
Result The Document Scraper will usually take significantly longer to run than the Docket Scraper given the volume of documents in most cases. Once the documents have finished downloading they can be found in the demo/pacer/psc/docs folder.
Attachments and specific documents
- To skip docket line attachments you can use the
--no-document-att
flag. - To get specific documents from specific cases, you can use the
--document-input
option to pass a .csv file with cases ids and list specific documents to retrieve, see the documentation for more.
To see more specifics, options and use cases check out the detailed documentation here.
Parser Tutorial
This short section explains how to use the SCALES Parser tool to read HTMLs downloaded from Pacer and convert them into JSON format. The parser takes as its input the results of running the docket scraper - namely, a folder of HTMLs.
Running script To use the parser on the HTMLs from the docket scraper in the previous tutorial, we will simply run the following:
pacer-tools parser demo/pacer/psc/html
Result Once the parser has finished, all the parsed versions of the HTML files can be found in /data/pacer/psc/json.
To see more specifics, options, and details on the JSON schema, check out the detailed documentation here.
Project details
Release history Release notifications | RSS feed
Download files
Download the file for your platform. If you're not sure which to choose, learn more about installing packages.
Source Distributions
Built Distribution
File details
Details for the file pacer_tools-0.1.9-py2.py3-none-any.whl
.
File metadata
- Download URL: pacer_tools-0.1.9-py2.py3-none-any.whl
- Upload date:
- Size: 2.9 MB
- Tags: Python 2, Python 3
- Uploaded using Trusted Publishing? No
- Uploaded via: twine/4.0.2 CPython/3.11.4
File hashes
Algorithm | Hash digest | |
---|---|---|
SHA256 | 2eac74fdad0ac2dc83617bfe1514ebfd5851972e5b6dd8fc05d8c2606845170c |
|
MD5 | 8eb8763a2cb60217de66b842a0590673 |
|
BLAKE2b-256 | f40ba209aea0e6270bb15c48ceb9dd925cebbd2824908538f3b2f63471fccc14 |