Skip to main content

An API to scrape American court websites for metadata.

Project description

Lint Badge

Test Badge

Version Badge

What is This?

Juriscraper is a scraper library started many years ago that gathers judicial opinions, oral arguments, and PACER data in the American court system. It is currently able to scrape:

  • a variety of pages and reports within the PACER system

  • opinions from all major appellate Federal courts

  • opinions from all state courts of last resort except for Georgia (typically their “Supreme Court”)

  • oral arguments from all appellate federal courts that offer them

Juriscraper is part of a two-part system. The second part is your code, which calls Juriscraper. Your code is responsible for calling a scraper, downloading and saving its results. A reference implementation of the caller has been developed and is in use at CourtListener.com. The code for that caller can be found here. There is also a basic sample caller included in Juriscraper that can be used for testing or as a starting point when developing your own.

Some of the design goals for this project are:

  • extensibility to support video, oral argument audio, etc.

  • extensibility to support geographies (US, Cuba, Mexico, California)

  • Mime type identification through magic numbers

  • Generalized architecture with minimal code repetition

  • XPath-based scraping powered by lxml’s html parser

  • return all meta data available on court websites (caller can pick what it needs)

  • no need for a database

  • clear log levels (DEBUG, INFO, WARN, CRITICAL)

  • friendly as possible to court websites

Installation & Dependencies

First step: Install Python 3.9+, then:

Install the dependencies

On Ubuntu based distributions/Debian Linux:

sudo apt-get install libxml2-dev libxslt-dev libyaml-dev

On Arch based distributions:

sudo pacman -S libxml2 libxslt libyaml

On macOS with Homebrew <https://brew.sh>:

brew install libyaml

Then install the code

pip install juriscraper

You can set an environment variable for where you want to stash your logs (this can be skipped, and /var/log/juriscraper/debug.log will be used as the default if it exists on the filesystem):

export JURISCRAPER_LOG=/path/to/your/log.txt

Finally, do your WebDriver

Some websites are too difficult to crawl without some sort of automated WebDriver. For these, Juriscraper either uses a locally-installed copy of geckodriver or can be configured to connect to a remote webdriver. If you prefer the local installation, you can download Selenium FireFox Geckodriver:

# choose OS compatible package from:
#   https://github.com/mozilla/geckodriver/releases/tag/v0.26.0
# un-tar/zip your download
sudo mv geckodriver /usr/local/bin

If you prefer to use a remote webdriver, like Selenium’s docker image, you can configure it with the following variables:

WEBDRIVER_CONN: Use this to set the connection string to your remote webdriver. By default, this is local, meaning it will look for a local installation of geckodriver. Instead, you can set this to something like, 'http://YOUR_DOCKER_IP:4444/wd/hub', which will switch it to using a remote driver and connect it to that location.

SELENIUM_VISIBLE: Set this to any value to disable headless mode in your selenium driver, if it supports it. Otherwise, it defaults to headless.

For example, if you want to watch a headless browser run, you can do so by starting selenium with:

docker run \
    -p 4444:4444 \
    -p 5900:5900 \
    -v /dev/shm:/dev/shm \
    selenium/standalone-firefox-debug

That’ll launch it on your local machine with two open ports. 4444 is the default on the image for accessing the webdriver. 5900 can be used to connect via a VNC viewer, and can be used to watch progress if the SELENIUM_VISIBLE variable is set.

Once you have selenium running like that, you can do a test like:

WEBDRIVER_CONN='http://localhost:4444/wd/hub' \
    SELENIUM_VISIBLE=yes \
    python sample_caller.py -c juriscraper.opinions.united_states.state.kan_p

Kansas’s precedential scraper uses a webdriver. If you do this and watch selenium, you should see it in action.

Contributing

We welcome contributions! If you’d like to get involved, please take a look at our CONTRIBUTING.md guide for instructions on setting up your environment, running tests, and more.

License

Juriscraper is licensed under the permissive BSD license.

forthebadge made-with-python

Project details


Release history Release notifications | RSS feed

Download files

Download the file for your platform. If you're not sure which to choose, learn more about installing packages.

Source Distribution

juriscraper-3.0.12.tar.gz (394.5 kB view details)

Uploaded Source

Built Distribution

If you're not sure about the file name format, learn more about wheel file names.

juriscraper-3.0.12-py3-none-any.whl (623.8 kB view details)

Uploaded Python 3

File details

Details for the file juriscraper-3.0.12.tar.gz.

File metadata

  • Download URL: juriscraper-3.0.12.tar.gz
  • Upload date:
  • Size: 394.5 kB
  • Tags: Source
  • Uploaded using Trusted Publishing? No
  • Uploaded via: twine/6.1.0 CPython/3.13.12

File hashes

Hashes for juriscraper-3.0.12.tar.gz
Algorithm Hash digest
SHA256 e7a535e0497956283158af18e27c97e789e2b63555abe9e24aa6cb65f928181a
MD5 38903b181c9a36553c68acd87ef14290
BLAKE2b-256 9626bef5b9edd89a4348976b3d47e580cc973bb44b771ca1ce1604ec3ad0e38e

See more details on using hashes here.

File details

Details for the file juriscraper-3.0.12-py3-none-any.whl.

File metadata

  • Download URL: juriscraper-3.0.12-py3-none-any.whl
  • Upload date:
  • Size: 623.8 kB
  • Tags: Python 3
  • Uploaded using Trusted Publishing? No
  • Uploaded via: twine/6.1.0 CPython/3.13.12

File hashes

Hashes for juriscraper-3.0.12-py3-none-any.whl
Algorithm Hash digest
SHA256 ce14d864a85331ef7ab6dae5fc4f68b8b1957b8ca44427f545935cf5ba94156f
MD5 bd2236cdb9c1022f7496084c8504ef96
BLAKE2b-256 279ae367c6d18a588fd4b27d6a3f3f4fbc9aa235ddfd48b02b5ae38d8f963a0a

See more details on using hashes here.

Supported by

AWS Cloud computing and Security Sponsor Datadog Monitoring Depot Continuous Integration Fastly CDN Google Download Analytics Pingdom Monitoring Sentry Error logging StatusPage Status page