An API to scrape American court websites for metadata.
Juriscraper is a scraper library started several years ago that gathers judicial opinions and oral arguments in the American court system. It is currently able to scrape:
Juriscraper is part of a two-part system. The second part is your code, which calls Juriscraper. Your code is responsible for calling a scraper, downloading and saving its results. A reference implementation of the caller has been developed and is in use at CourtListener.com. The code for that caller can be found here. There is also a basic sample caller included in Juriscraper that can be used for testing or as a starting point when developing your own.
Some of the design goals for this project are:
First step: Install Python 2.7.x, then:
# -- Install the dependencies # On Ubuntu/Debian Linux: sudo apt-get install libxml2-dev libxslt-dev libyaml-dev # On macOS with Homebrew <https://brew.sh>: brew install libyaml # -- Install PhantomJS # On Ubuntu/Debian Linux wget https://bitbucket.org/ariya/phantomjs/downloads/phantomjs-1.9.7-linux-x86_64.tar.bz2 tar -x -f phantomjs-1.9.7-linux-x86_64.tar.bz2 sudo mkdir -p /usr/local/phantomjs sudo mv phantomjs-1.9.7-linux-x86_64/bin/phantomjs /usr/local/phantomjs rm -r phantomjs-1.9.7* # Cleanup # On macOS with Homebrew: brew install phantomjs # Finally, install the code. pip install juriscraper # create a directory for logs (this can be skipped, and no logs will be created) sudo mkdir -p /var/log/juriscraper
We use a few tools pretty frequently while building these scrapers. The first is a sister project called xpath-tester that helps debug XPath queries. xpath-tester can be installed locally in a few minutes.
We also generally use Intellij with PyCharm installed. These are useful because they allow syntax highlighting, code inspection, and PyLint integration. A license for Intellij is available for interested and proven contributors.
For scrapers to be merged:
When you’re ready to develop a scraper, get in touch, and we’ll find you a scraper that makes sense and that nobody else is working on. We have a wiki list of courts that you can browse yourself. There are templates for new scrapers here (for opinions) and here (for oral arguments).
When you’re done with your scraper, fork this repository, push your changes into your fork, and then send a pull request for your changes. Be sure to remember to update the __init__.py file as well, since it contains a list of completed scrapers.
Before we can accept any changes from any contributor, we need a signed and completed Contributor License Agreement. You can find this agreement in the root of the repository. While an annoying bit of paperwork, this license is for your protection as a Contributor as well as the protection of Free Law Project and our users; it does not change your rights to use your own Contributions for any other purpose.
To get set up as a developer of Juriscraper, you’ll want to install the code from git. To do that, install the dependencies and phantomjs as described above. Instead of installing Juriscraper via pip, do the following:
git clone https://github.com/freelawproject/juriscraper.git . pip install -r requirements.txt python setup.py test
The scrapers are written in Python, and can can scrape a court as follows:
from juriscraper.opinions.united_states.federal_appellate import ca1 # Create a site object site = ca1.Site() # Populate it with data, downloading the page if necessary site.parse() # Print out the object print str(site) # Print it out as JSON print site.to_json() # Iterate over the item for opinion in site: print opinion
That will print out all the current meta data for a site, including links to the objects you wish to download (typically opinions or oral arguments). If you download those opinions, we also recommend running the _cleanup_content() method against the items that you download (PDFs, HTML, etc.). See the sample_caller.py for an example and see _cleanup_content() for an explanation of what it does.
It’s also possible to iterate over all courts in a Python package, even if they’re not known before starting the scraper. For example:
# Start with an import path. This will do all federal courts. court_id = 'juriscraper.opinions.united_states.federal' # Import all the scrapers scrapers = __import__( court_id, globals(), locals(), ['*'] ).__all__ for scraper in scrapers: mod = __import__( '%s.%s' % (court_id, scraper), globals(), locals(), [scraper] ) # Create a Site instance, then get the contents site = mod.Site() site.parse() print str(site)
This can be useful if you wish to create a command line scraper that iterates over all courts of a certain jurisdiction that is provided by a script. See lib/importer.py for an example that’s used in the sample caller.
We got that! You can (and should) run the tests with tox. This will run python setup.py test for all supported Python runtimes, iterating over all of the *_example* files and run the scrapers against them.
In addition, we use Travis-CI to automatically run the tests whenever code is committed to the repository or whenever a pull request is created. You can make sure that your pull request is good to go by waiting for the automated tests to complete.
The current status of Travis CI on our master branch is:
Beyond - Support video, additional oral argument audio, and transcripts everywhere available - Add other countries, starting with courts issuing opinions in English.
Deployment to PyPi should happen automatically by Travis CI whenever a new tag is created in Github on the master branch. It will fail if the version has not been updated or if Travis CI failed.
If you wish to create a new version manually, the process is:
Generate a distribution
python setup.py bdist_wheel
Upload the distribution
twine upload dist/* -r pypi (or pypitest)
Juriscraper is licensed under the permissive BSD license.