An API to scrape American court websites for metadata.
Project description
What is This?
Juriscraper is a scraper library started many years ago that gathers judicial opinions, oral arguments, and PACER data in the American court system. It is currently able to scrape:
a variety of pages and reports within the PACER system
opinions from all major appellate Federal courts
opinions from all state courts of last resort except for Georgia (typically their “Supreme Court”)
oral arguments from all appellate federal courts that offer them
Juriscraper is part of a two-part system. The second part is your code, which calls Juriscraper. Your code is responsible for calling a scraper, downloading and saving its results. A reference implementation of the caller has been developed and is in use at CourtListener.com. The code for that caller can be found here. There is also a basic sample caller included in Juriscraper that can be used for testing or as a starting point when developing your own.
Some of the design goals for this project are:
extensibility to support video, oral argument audio, etc.
extensibility to support geographies (US, Cuba, Mexico, California)
Mime type identification through magic numbers
Generalized architecture with minimal code repetition
XPath-based scraping powered by lxml’s html parser
return all meta data available on court websites (caller can pick what it needs)
no need for a database
clear log levels (DEBUG, INFO, WARN, CRITICAL)
friendly as possible to court websites
Installation & Dependencies
First step: Install Python 3.9+, then:
Install the dependencies
On Ubuntu based distributions/Debian Linux:
sudo apt-get install libxml2-dev libxslt-dev libyaml-dev
On Arch based distributions:
sudo pacman -S libxml2 libxslt libyaml
On macOS with Homebrew <https://brew.sh>:
brew install libyaml
Then install the code
pip install juriscraper
You can set an environment variable for where you want to stash your logs (this can be skipped, and /var/log/juriscraper/debug.log will be used as the default if it exists on the filesystem):
export JURISCRAPER_LOG=/path/to/your/log.txt
Finally, do your WebDriver
Some websites are too difficult to crawl without some sort of automated WebDriver. For these, Juriscraper either uses a locally-installed copy of geckodriver or can be configured to connect to a remote webdriver. If you prefer the local installation, you can download Selenium FireFox Geckodriver:
# choose OS compatible package from: # https://github.com/mozilla/geckodriver/releases/tag/v0.26.0 # un-tar/zip your download sudo mv geckodriver /usr/local/bin
If you prefer to use a remote webdriver, like Selenium’s docker image, you can configure it with the following variables:
WEBDRIVER_CONN: Use this to set the connection string to your remote webdriver. By default, this is local, meaning it will look for a local installation of geckodriver. Instead, you can set this to something like, 'http://YOUR_DOCKER_IP:4444/wd/hub', which will switch it to using a remote driver and connect it to that location.
SELENIUM_VISIBLE: Set this to any value to disable headless mode in your selenium driver, if it supports it. Otherwise, it defaults to headless.
For example, if you want to watch a headless browser run, you can do so by starting selenium with:
docker run \
-p 4444:4444 \
-p 5900:5900 \
-v /dev/shm:/dev/shm \
selenium/standalone-firefox-debug
That’ll launch it on your local machine with two open ports. 4444 is the default on the image for accessing the webdriver. 5900 can be used to connect via a VNC viewer, and can be used to watch progress if the SELENIUM_VISIBLE variable is set.
Once you have selenium running like that, you can do a test like:
WEBDRIVER_CONN='http://localhost:4444/wd/hub' \
SELENIUM_VISIBLE=yes \
python sample_caller.py -c juriscraper.opinions.united_states.state.kan_p
Kansas’s precedential scraper uses a webdriver. If you do this and watch selenium, you should see it in action.
Contributing
We welcome contributions! If you’d like to get involved, please take a look at our CONTRIBUTING.md guide for instructions on setting up your environment, running tests, and more.
License
Juriscraper is licensed under the permissive BSD license.
Project details
Release history Release notifications | RSS feed
Download files
Download the file for your platform. If you're not sure which to choose, learn more about installing packages.
Source Distribution
Built Distribution
Filter files by name, interpreter, ABI, and platform.
If you're not sure about the file name format, learn more about wheel file names.
Copy a direct link to the current filters
File details
Details for the file juriscraper-2.7.5.tar.gz.
File metadata
- Download URL: juriscraper-2.7.5.tar.gz
- Upload date:
- Size: 375.9 kB
- Tags: Source
- Uploaded using Trusted Publishing? No
- Uploaded via: twine/6.1.0 CPython/3.13.7
File hashes
| Algorithm | Hash digest | |
|---|---|---|
| SHA256 |
6d0768945c8b10408894f710f30d5cfab56e6a89b7b97ba4eea41c40ae5fa915
|
|
| MD5 |
719c416d0d86d0e9c46520e478fdc09b
|
|
| BLAKE2b-256 |
cd04aae3dde2abf5e7d95cd288e3fde8fd28edf87f8e7ea7ca1adcdc4a0a14f8
|
File details
Details for the file juriscraper-2.7.5-py3-none-any.whl.
File metadata
- Download URL: juriscraper-2.7.5-py3-none-any.whl
- Upload date:
- Size: 599.4 kB
- Tags: Python 3
- Uploaded using Trusted Publishing? No
- Uploaded via: twine/6.1.0 CPython/3.13.7
File hashes
| Algorithm | Hash digest | |
|---|---|---|
| SHA256 |
eb1150fea3107244277b884b8fdc53764a44d18a5ece6b9306d089ba9542206a
|
|
| MD5 |
7d618486580aee25f877cd7db09318ed
|
|
| BLAKE2b-256 |
dafb57d1cc125bc01092dd9702ce4ae2432cde06622842fdc3b7a92e36fe83f1
|