Skip to main content

An API to scrape American court websites for metadata.

Project description

Lint Badge

Test Badge

Version Badge

What is This?

Juriscraper is a scraper library started many years ago that gathers judicial opinions, oral arguments, and PACER data in the American court system. It is currently able to scrape:

  • a variety of pages and reports within the PACER system

  • opinions from all major appellate Federal courts

  • opinions from all state courts of last resort except for Georgia (typically their “Supreme Court”)

  • oral arguments from all appellate federal courts that offer them

Juriscraper is part of a two-part system. The second part is your code, which calls Juriscraper. Your code is responsible for calling a scraper, downloading and saving its results. A reference implementation of the caller has been developed and is in use at CourtListener.com. The code for that caller can be found here. There is also a basic sample caller included in Juriscraper that can be used for testing or as a starting point when developing your own.

Some of the design goals for this project are:

  • extensibility to support video, oral argument audio, etc.

  • extensibility to support geographies (US, Cuba, Mexico, California)

  • Mime type identification through magic numbers

  • Generalized architecture with minimal code repetition

  • XPath-based scraping powered by lxml’s html parser

  • return all meta data available on court websites (caller can pick what it needs)

  • no need for a database

  • clear log levels (DEBUG, INFO, WARN, CRITICAL)

  • friendly as possible to court websites

Installation & Dependencies

First step: Install Python 3.9+, then:

Install the dependencies

On Ubuntu based distributions/Debian Linux:

sudo apt-get install libxml2-dev libxslt-dev libyaml-dev

On Arch based distributions:

sudo pacman -S libxml2 libxslt libyaml

On macOS with Homebrew <https://brew.sh>:

brew install libyaml

Then install the code

pip install juriscraper

You can set an environment variable for where you want to stash your logs (this can be skipped, and /var/log/juriscraper/debug.log will be used as the default if it exists on the filesystem):

export JURISCRAPER_LOG=/path/to/your/log.txt

Finally, do your WebDriver

Some websites are too difficult to crawl without some sort of automated WebDriver. For these, Juriscraper either uses a locally-installed copy of geckodriver or can be configured to connect to a remote webdriver. If you prefer the local installation, you can download Selenium FireFox Geckodriver:

# choose OS compatible package from:
#   https://github.com/mozilla/geckodriver/releases/tag/v0.26.0
# un-tar/zip your download
sudo mv geckodriver /usr/local/bin

If you prefer to use a remote webdriver, like Selenium’s docker image, you can configure it with the following variables:

WEBDRIVER_CONN: Use this to set the connection string to your remote webdriver. By default, this is local, meaning it will look for a local installation of geckodriver. Instead, you can set this to something like, 'http://YOUR_DOCKER_IP:4444/wd/hub', which will switch it to using a remote driver and connect it to that location.

SELENIUM_VISIBLE: Set this to any value to disable headless mode in your selenium driver, if it supports it. Otherwise, it defaults to headless.

For example, if you want to watch a headless browser run, you can do so by starting selenium with:

docker run \
    -p 4444:4444 \
    -p 5900:5900 \
    -v /dev/shm:/dev/shm \
    selenium/standalone-firefox-debug

That’ll launch it on your local machine with two open ports. 4444 is the default on the image for accessing the webdriver. 5900 can be used to connect via a VNC viewer, and can be used to watch progress if the SELENIUM_VISIBLE variable is set.

Once you have selenium running like that, you can do a test like:

WEBDRIVER_CONN='http://localhost:4444/wd/hub' \
    SELENIUM_VISIBLE=yes \
    python sample_caller.py -c juriscraper.opinions.united_states.state.kan_p

Kansas’s precedential scraper uses a webdriver. If you do this and watch selenium, you should see it in action.

Contributing

We welcome contributions! If you’d like to get involved, please take a look at our CONTRIBUTING.md guide for instructions on setting up your environment, running tests, and more.

License

Juriscraper is licensed under the permissive BSD license.

forthebadge made-with-python

Project details


Release history Release notifications | RSS feed

Download files

Download the file for your platform. If you're not sure which to choose, learn more about installing packages.

Source Distribution

juriscraper-3.0.9.tar.gz (393.0 kB view details)

Uploaded Source

Built Distribution

If you're not sure about the file name format, learn more about wheel file names.

juriscraper-3.0.9-py3-none-any.whl (621.9 kB view details)

Uploaded Python 3

File details

Details for the file juriscraper-3.0.9.tar.gz.

File metadata

  • Download URL: juriscraper-3.0.9.tar.gz
  • Upload date:
  • Size: 393.0 kB
  • Tags: Source
  • Uploaded using Trusted Publishing? No
  • Uploaded via: twine/6.1.0 CPython/3.13.12

File hashes

Hashes for juriscraper-3.0.9.tar.gz
Algorithm Hash digest
SHA256 ea49b3e959d1cbaa8dba105e0e09b1bacf96693c50c3481939fa0043caf6d4ab
MD5 d489d35e72f487a4d7475adafcedb87b
BLAKE2b-256 979015e68f49ddabe91a6865b4f1e5fdb354d232dcb5235264122cb14482cf16

See more details on using hashes here.

File details

Details for the file juriscraper-3.0.9-py3-none-any.whl.

File metadata

  • Download URL: juriscraper-3.0.9-py3-none-any.whl
  • Upload date:
  • Size: 621.9 kB
  • Tags: Python 3
  • Uploaded using Trusted Publishing? No
  • Uploaded via: twine/6.1.0 CPython/3.13.12

File hashes

Hashes for juriscraper-3.0.9-py3-none-any.whl
Algorithm Hash digest
SHA256 630959b2dbbce0e7effc13a27ff213b5b9c0b7b9c9aa9d59bffb3bae6871d520
MD5 0e04041268627fe25200223caf8ac826
BLAKE2b-256 3b742648b105a13e5a270dfccdcdb62bfe8020dc8c051c270fb0fe61696afc86

See more details on using hashes here.

Supported by

AWS Cloud computing and Security Sponsor Datadog Monitoring Depot Continuous Integration Fastly CDN Google Download Analytics Pingdom Monitoring Sentry Error logging StatusPage Status page