Skip to main content

An API to scrape American court websites for metadata.

Project description

Lint Badge

Test Badge

Version Badge

What is This?

Juriscraper is a scraper library started many years ago that gathers judicial opinions, oral arguments, and PACER data in the American court system. It is currently able to scrape:

  • a variety of pages and reports within the PACER system

  • opinions from all major appellate Federal courts

  • opinions from all state courts of last resort except for Georgia (typically their “Supreme Court”)

  • oral arguments from all appellate federal courts that offer them

Juriscraper is part of a two-part system. The second part is your code, which calls Juriscraper. Your code is responsible for calling a scraper, downloading and saving its results. A reference implementation of the caller has been developed and is in use at CourtListener.com. The code for that caller can be found here. There is also a basic sample caller included in Juriscraper that can be used for testing or as a starting point when developing your own.

Some of the design goals for this project are:

  • extensibility to support video, oral argument audio, etc.

  • extensibility to support geographies (US, Cuba, Mexico, California)

  • Mime type identification through magic numbers

  • Generalized architecture with minimal code repetition

  • XPath-based scraping powered by lxml’s html parser

  • return all meta data available on court websites (caller can pick what it needs)

  • no need for a database

  • clear log levels (DEBUG, INFO, WARN, CRITICAL)

  • friendly as possible to court websites

Installation & Dependencies

First step: Install Python 3.9+, then:

Install the dependencies

On Ubuntu based distributions/Debian Linux:

sudo apt-get install libxml2-dev libxslt-dev libyaml-dev

On Arch based distributions:

sudo pacman -S libxml2 libxslt libyaml

On macOS with Homebrew <https://brew.sh>:

brew install libyaml

Then install the code

pip install juriscraper

You can set an environment variable for where you want to stash your logs (this can be skipped, and /var/log/juriscraper/debug.log will be used as the default if it exists on the filesystem):

export JURISCRAPER_LOG=/path/to/your/log.txt

Finally, do your WebDriver

Some websites are too difficult to crawl without some sort of automated WebDriver. For these, Juriscraper either uses a locally-installed copy of geckodriver or can be configured to connect to a remote webdriver. If you prefer the local installation, you can download Selenium FireFox Geckodriver:

# choose OS compatible package from:
#   https://github.com/mozilla/geckodriver/releases/tag/v0.26.0
# un-tar/zip your download
sudo mv geckodriver /usr/local/bin

If you prefer to use a remote webdriver, like Selenium’s docker image, you can configure it with the following variables:

WEBDRIVER_CONN: Use this to set the connection string to your remote webdriver. By default, this is local, meaning it will look for a local installation of geckodriver. Instead, you can set this to something like, 'http://YOUR_DOCKER_IP:4444/wd/hub', which will switch it to using a remote driver and connect it to that location.

SELENIUM_VISIBLE: Set this to any value to disable headless mode in your selenium driver, if it supports it. Otherwise, it defaults to headless.

For example, if you want to watch a headless browser run, you can do so by starting selenium with:

docker run \
    -p 4444:4444 \
    -p 5900:5900 \
    -v /dev/shm:/dev/shm \
    selenium/standalone-firefox-debug

That’ll launch it on your local machine with two open ports. 4444 is the default on the image for accessing the webdriver. 5900 can be used to connect via a VNC viewer, and can be used to watch progress if the SELENIUM_VISIBLE variable is set.

Once you have selenium running like that, you can do a test like:

WEBDRIVER_CONN='http://localhost:4444/wd/hub' \
    SELENIUM_VISIBLE=yes \
    python sample_caller.py -c juriscraper.opinions.united_states.state.kan_p

Kansas’s precedential scraper uses a webdriver. If you do this and watch selenium, you should see it in action.

Contributing

We welcome contributions! If you’d like to get involved, please take a look at our CONTRIBUTING.md guide for instructions on setting up your environment, running tests, and more.

License

Juriscraper is licensed under the permissive BSD license.

forthebadge made-with-python

Project details


Release history Release notifications | RSS feed

Download files

Download the file for your platform. If you're not sure which to choose, learn more about installing packages.

Source Distribution

juriscraper-2.7.5.tar.gz (375.9 kB view details)

Uploaded Source

Built Distribution

If you're not sure about the file name format, learn more about wheel file names.

juriscraper-2.7.5-py3-none-any.whl (599.4 kB view details)

Uploaded Python 3

File details

Details for the file juriscraper-2.7.5.tar.gz.

File metadata

  • Download URL: juriscraper-2.7.5.tar.gz
  • Upload date:
  • Size: 375.9 kB
  • Tags: Source
  • Uploaded using Trusted Publishing? No
  • Uploaded via: twine/6.1.0 CPython/3.13.7

File hashes

Hashes for juriscraper-2.7.5.tar.gz
Algorithm Hash digest
SHA256 6d0768945c8b10408894f710f30d5cfab56e6a89b7b97ba4eea41c40ae5fa915
MD5 719c416d0d86d0e9c46520e478fdc09b
BLAKE2b-256 cd04aae3dde2abf5e7d95cd288e3fde8fd28edf87f8e7ea7ca1adcdc4a0a14f8

See more details on using hashes here.

File details

Details for the file juriscraper-2.7.5-py3-none-any.whl.

File metadata

  • Download URL: juriscraper-2.7.5-py3-none-any.whl
  • Upload date:
  • Size: 599.4 kB
  • Tags: Python 3
  • Uploaded using Trusted Publishing? No
  • Uploaded via: twine/6.1.0 CPython/3.13.7

File hashes

Hashes for juriscraper-2.7.5-py3-none-any.whl
Algorithm Hash digest
SHA256 eb1150fea3107244277b884b8fdc53764a44d18a5ece6b9306d089ba9542206a
MD5 7d618486580aee25f877cd7db09318ed
BLAKE2b-256 dafb57d1cc125bc01092dd9702ce4ae2432cde06622842fdc3b7a92e36fe83f1

See more details on using hashes here.

Supported by

AWS Cloud computing and Security Sponsor Datadog Monitoring Depot Continuous Integration Fastly CDN Google Download Analytics Pingdom Monitoring Sentry Error logging StatusPage Status page