Skip to main content

Tool for extracting legal citations from text strings.

Project description

eyecite

eyecite is an open source tool for extracting legal citations from text. It is used, among other things, to process millions of legal documents in the collections of CourtListener and Harvard’s Caselaw Access Project, and has been developed in collaboration with both projects.

eyecite recognizes a wide variety of citations commonly appearing in American legal decisions, including:

  • full case: Bush v. Gore, 531 U.S. 98, 99-100 (2000)

  • reference: In Gore, the Supreme Court...

  • short case: 531 U.S., at 99

  • statutory: Mass. Gen. Laws ch. 1, § 2

  • law journal: 1 Minn. L. Rev. 1

  • supra: Bush, supra, at 100

  • id.: Id., at 101

All contributors, corrections, and additions are welcome!

If you use eyecite for your research, please consider citing our paper:

@article{eyecite,
    title = {eyecite: A Tool for Parsing Legal Citations},
    author = {Cushman, Jack and Dahl, Matthew and Lissner, Michael},
    year = {2021},
    journal = {Journal of Open Source Software},
    volume = {6},
    number = {66},
    pages = {3617},
    url = {https://doi.org/10.21105/joss.03617},
}

Functionality

eyecite offers four core functions:

  • Extraction: Recognize and extract citations from text, using a database that has been trained on over 55 million existing citations (see all of the citation patterns eyecite looks for over in reporters_db).

  • Aggregation: Aggregate citations with common references (e.g., supra and id. citations) based on their logical antecedents.

  • Annotation: Annotate citation-laden text with custom markup surrounding each citation, using a fast diffing algorithm.

  • Cleaning: Clean and pre-process text for easy use with eyecite.

Read on below for how to get started quickly or for a short tutorial in using eyecite.

Contributions & Support

Please see the issues list on GitHub for things we need, or start a conversation if you have questions or need support.

If you are fixing bugs or adding features, before you make your first contribution, we’ll need a signed contributor license agreement. See the template in the root of the repo for how to get that taken care of.

API

The API documentation is located here:

https://freelawproject.github.io/eyecite/

It is autogenerated whenever we release a new version. Unfortunately, for now we do not support old versions of the API documentation, but it can be browsed in the gh-pages branch if needed.

Quickstart

Install eyecite:

pip install eyecite

Here’s a short example of extracting citations and their metadata from text using eyecite’s main get_citations() function:

from eyecite import get_citations

text = """
    Mass. Gen. Laws ch. 1, § 2 (West 1999) (barring ...).
    Foo v. Bar, 1 U.S. 2, 3-4 (1999) (overruling ...).
    Id. at 3.
    Foo, supra, at 5.
"""

get_citations(text)

# returns:
[
    FullLawCitation(
        'Mass. Gen. Laws ch. 1, § 2',
        groups={'reporter': 'Mass. Gen. Laws', 'chapter': '1', 'section': '2'},
        metadata=Metadata(parenthetical='barring ...', pin_cite=None, year='1999', publisher='West', ...)
    ),
    FullCaseCitation(
        '1 U.S. 2',
        groups={'volume': '1', 'reporter': 'U.S.', 'page': '2'},
        metadata=Metadata(parenthetical='overruling ...', pin_cite='3-4', year='1999', court='scotus', plaintiff='Foo', defendant='Bar,', ...)
    ),
    IdCitation(
        'Id.',
        metadata=Metadata(pin_cite='at 3')
    ),
    SupraCitation(
        'supra,',
        metadata=Metadata(antecedent_guess='Foo', pin_cite='at 5', ...)
    )
]

Tutorial

For a more full-featured walkthrough of how to use all of eyecite’s functionality, please see the tutorial.

Documentation

eyecite’s full API is documented here, but here are details regarding its four core functions, its tokenization logic, and its debugging tools.

Extracting Citations

get_citations(), the main executable function, takes four parameters.

  1. plain_text ==> str, default '': The text to parse. If the

    text has markup, it’s better to use the markup_text argument to get enhanced extraction. One of plain_text or markup_text must be passed as input.

  2. remove_ambiguous ==> bool, default False: Whether to remove citations

    that might refer to more than one reporter and can’t be narrowed down by date.

  3. tokenizer ==> Tokenizer, default eyecite.tokenizers.default_tokenizer:

    An instance of a Tokenizer object (see “Tokenizers” below).

  4. markup_text ==> str, default '': optional XML or HTML source

    text that will be used to extract ReferenceCitations or help identify case names using markup tags.

  5. clean_steps ==> list, default None: list of callables or the

    name string of functions in clean.py. Used to clean the input text

Resolving Reference Citations

Eyecite now supports a two-step process for extracting and resolving reference citations. This feature improves handling of citations that reference previously mentioned cases without explicitly repeating the full case name or citation.

Reference citations, such as “Theatre Enterprises at 552”, can be difficult to extract accurately if a judge is citing to Theatre Enterprises, Inc. v. Paramount Film Distributing Corp., 346 U. S. 537, 541 (1954) they lack a full case name. To address this, Eyecite allows for an initial citation extraction, followed by a secondary reference resolution step. If you have an external database (e.g., CourtListener) that provides resolved case names, you can use this feature to enhance citation finding.:

from eyecite import get_citations
from eyecite.find import extract_reference_citations
from eyecite.helpers import filter_citations

plain_text = (
    "quoting Theatre Enterprises, Inc. v. Paramount Film Distributing Corp., 346 U. S. 537, 541 (1954); "
    "alterations in original. Thus, the District Court understood that allegations of "
    "parallel business conduct, taken alone, do not state a claim under § 1; "
    "plaintiffs must allege additional facts that “ten to exclude independent "
    "self-interested conduct as an As Theatre Enterprises at 552 held, parallel"
    )


from eyecite import get_citations
from eyecite.find import extract_reference_citations
from eyecite.helpers import filter_citations

# Step 1: Extract full citations
citations = get_citations(plain_text)

# Step 2: Resolve the case name from an external database or prior knowledge
citations[0].metadata.resolved_case_name_short = "Theatre Enterprises"

# Step 3: Extract reference citations using the resolved name
references = extract_reference_citations(citations[0], plain_text)

# Step 4: Filter and merge citations
new_citations = filter_citations(citations + references)

Keep in mind that this feature requires an external database or heuristic method to resolve the short case name before extracting reference citations a second time.

Cleaning Input Text

For a given citation text such as “… 1 Baldwin’s Rep. 1 …”, you can input the cleaned text and pass it in the plain_text argument without clean_steps`, or you can pass it without pre processing and pass a list to clean_steps

  • Spaces will be single space characters, not multiple spaces or other whitespace.

  • Quotes and hyphens will be standard quote and hyphen characters.

  • No junk such as HTML tags inside the citation.

The cleanup is done via clean_text:

from eyecite import clean_text, get_citations

source_text = '<p>foo   1  U.S.  1   </p>'
plain_text = clean_text(text, ['html', 'inline_whitespace', my_func])
found_citations = get_citations(plain_text)

See the Annotating Citations section for how to insert links into the original text using citations extracted from the cleaned text.

clean_text currently accepts these values as cleaners:

  1. inline_whitespace: replace all runs of tab and space characters with a single space character

  2. all_whitespace: replace all runs of any whitespace character with a single space character

  3. underscores: remove two or more underscores, a common error in text extracted from PDFs

  4. html: remove non-visible HTML content using the lxml library

  5. Custom function: any function taking a string and returning a string.

Annotating Citations

For simple plain text, you can insert links to citations using the annotate_citations function:

from eyecite import get_citations, annotate_citations

plain_text = 'bob lissner v. test 1 U.S. 12, 347-348 (4th Cir. 1982)'
citations = get_citations(plain_text)
linked_text = annotate_citations(plain_text, [[c.span(), "<a>", "</a>"] for c in citations])

returns:
'bob lissner v. test <a>1 U.S. 12</a>, 347-348 (4th Cir. 1982)'

Each citation returned by get_citations keeps track of where it was found in the source text. As a result, annotate_citations must be called with the same cleaned text used by get_citations to extract citations. If you do not, the offsets returned by the citation’s span method will not align with the text, and your annotations will be in the wrong place.

If you want to clean text and then insert annotations into the original text, you can pass the original text in as source_text:

from eyecite import get_citations, annotate_citations, clean_text

source_text = '<p>bob lissner v. <i>test   1 U.S.</i> 12,   347-348 (4th Cir. 1982)</p>'
plain_text = clean_text(source_text, ['html', 'inline_whitespace'])
citations = get_citations(plain_text)
linked_text = annotate_citations(plain_text, [[c.span(), "<a>", "</a>"] for c in citations], source_text=source_text)

returns:
'<p>bob lissner v. <i>test   <a>1 U.S.</i> 12</a>,   347-348 (4th Cir. 1982)</p>'

The above example extracts citations from plain_text and applies them to source_text, using a diffing algorithm to insert annotations in the correct locations in the original text.

There is also a full_span attribute that can be used to get the indexes of the full citation, including the pre- and post-citation attributes.

Wrapping HTML Tags

Note that the above example includes mismatched HTML tags: “<a>1 U.S.</i> 12</a>”. To specify handling for unbalanced tags, use the unbalanced_tags parameter:

  • unbalanced_tags="skip": annotations that would result in unbalanced tags will not be inserted. A simple correction for style tags is attempted. This is a common case when finding ReferenceCitations or IdCitations. See utils.maybe_balance_style_tags

  • unbalanced_tags="wrap": unbalanced tags will be wrapped, resulting in <a>1 U.S.</a></i><a> 12</a>

Important: unbalanced_tags="wrap" uses a simple regular expression and will only work for HTML where angle brackets are properly escaped, such as the HTML emitted by lxml.html.tostring. It is intended for regularly formatted documents such as case text published by courts. It may have unpredictable results for deliberately-constructed challenging inputs such as citations containing partial HTML comments or <pre> tags.

Customizing Annotation

If inserting text before and after isn’t sufficient, supply a callable under the annotator parameter that takes (before, span_text, after) and returns the annotated text:

def annotator(before, span_text, after):
    return before + span_text.lower() + after
linked_text = annotate_citations(plain_text, [[c.span(), "<a>", "</a>"] for c in citations], annotator=annotator)

returns:
'bob lissner v. test <a>1 u.s. 12</a>, 347-348 (4th Cir. 1982)'

Resolving Citations

Once you have extracted citations from a document, you may wish to resolve them to their common references. To do so, just pass the results of get_citations() into resolve_citations(). This function will do its best to resolve each “full,” “short form,” “supra,” and “id” citation to a common Resource object, returning a dictionary that maps resources to lists of associated citations:

from eyecite import get_citations, resolve_citations

text = 'first citation: 1 U.S. 12. second citation: 2 F.3d 2. third citation: Id.'
found_citations = get_citations(text)
resolved_citations = resolve_citations(found_citations)

returns (pseudo):
{
    <Resource object>: [FullCaseCitation('1 U.S. 12')],
    <Resource object>: [FullCaseCitation('2 F.3d 2'), IdCitation('Id.')]
}

Importantly, eyecite performs these resolutions using only its immanent knowledge about each citation’s textual representation. If you want to perform more sophisticated resolution (e.g., by augmenting each citation with information from a third-party API), simply pass custom resolve_id_citation(), resolve_supra_citation(), resolve_shortcase_citation(), and resolve_full_citation() functions to resolve_citations() as keyword arguments. You can also configure those functions to return a more complex resource object (such as a Django model), so long as that object inherits the eyecite.models.ResourceType type (which simply requires hashability). For example, you might implement a custom full citation resolution function as follows, using the default resolution logic as a fallback:

def my_resolve(full_cite):
    # special handling for resolution of known cases in our database
    resource = MyOpinion.objects.get(full_cite)
    if resource:
        return resource
    # allow normal clustering of other citations
    return resolve_full_citation(full_cite)

resolve_citations(citations, resolve_full_citation=my_resolve)

returns (pseudo):
{
    <MyOpinion object>: [<full_cite>, <short_cite>, <id_cite>],
    <Resource object>: [<full cite>, <short cite>],
}

Tokenizers

Internally, eyecite works by applying a list of regular expressions to the source text to convert it to a list of tokens:

In [1]: from eyecite.tokenizers import default_tokenizer

In [2]: list(default_tokenizer.tokenize("Foo v. Bar, 123 U.S. 456 (2016). Id. at 457."))
Out[2]:
['Foo',
 StopWordToken(data='v.', ...),
 'Bar,',
 CitationToken(data='123 U.S. 456', volume='123', reporter='U.S.', page='456', ...),
 '(2016).',
 IdToken(data='Id.', ...),
 'at',
 '457.']

Tokens are then scanned to determine values like the citation year or case name for citation resolution.

Alternate tokenizers can be substituted by providing a tokenizer instance to get_citations():

from eyecite.tokenizers import HyperscanTokenizer
hyperscan_tokenizer = HyperscanTokenizer(cache_dir='.hyperscan')
cites = get_citations(text, tokenizer=hyperscan_tokenizer)

test_FindTest.py includes a simplified example of using a custom tokenizer that uses modified regular expressions to extract citations with OCR errors.

eyecite ships with two tokenizers:

AhocorasickTokenizer (default)

The default tokenizer uses the pyahocorasick library to filter down eyecite’s list of extractor regexes. It then performs extraction using the builtin re library.

HyperscanTokenizer

The alternate HyperscanTokenizer compiles all extraction regexes into a hyperscan database so they can be extracted in a single pass. This is far faster than the default tokenizer (exactly how much faster depends on how many citation formats are included in the target text), but requires the optional dependency hyperscan, which you can install with Pip like:

pip install hyperscan

Compiling the hyperscan database takes several seconds, so short-running scripts may want to provide a cache directory where the database can be stored. The directory should be writeable only by the user:

hyperscan_tokenizer = HyperscanTokenizer(cache_dir='.hyperscan')

Debugging

If you want to see what metadata eyecite is able to extract for each citation, you can use dump_citations. This is primarily useful for developing eyecite, but may also be useful for exploring what data is available to you:

In [1]: from eyecite import dump_citations, get_citations

In [2]: text="Mass. Gen. Laws ch. 1, § 2. Foo v. Bar, 1 U.S. 2, 3-4 (1999). Id. at 3. Foo, supra, at 5."

In [3]: cites=get_citations(text)

In [4]: print(dump_citations(get_citations(text), text))
FullLawCitation: Mass. Gen. Laws ch. 1, § 2. Foo v. Bar, 1 U.S. 2, 3-4 (1
  * groups
    * reporter='Mass. Gen. Laws'
    * chapter='1'
    * section='2'
FullCaseCitation: Laws ch. 1, § 2. Foo v. Bar, 1 U.S. 2, 3-4 (1999). Id. at 3. Foo, s
  * groups
    * volume='1'
    * reporter='U.S.'
    * page='2'
  * metadata
    * pin_cite='3-4'
    * year='1999'
    * court='scotus'
    * plaintiff='Foo'
    * defendant='Bar,'
  * year=1999
IdCitation: v. Bar, 1 U.S. 2, 3-4 (1999). Id. at 3. Foo, supra, at 5.
  * metadata
    * pin_cite='at 3'
SupraCitation: 2, 3-4 (1999). Id. at 3. Foo, supra, at 5.
  * metadata
    * antecedent_guess='Foo'
    * pin_cite='at 5'

In the real terminal, the span() of each extracted citation will be highlighted. You can use the context_chars=30 parameter to control how much text is shown before and after.

Installation

With Pip:

$ pip install eyecite

Or, to install the latest in-development version from GitHub:

pip install https://github.com/freelawproject/eyecite/archive/main.zip#egg=eyecite

Deployment

  1. Update CHANGES.md.

  1. Update version info in pyproject.toml by running uv version --bump [major|minor|patch].

  1. Commit and make a pull request.

  1. Tag the merged commit with the new version number in the format vx.y.z:

    $ git tag -a v1.2.3 -m v1.2.3
  1. Push the tag:

    $ git push origin v1.2.3

The automated deployment process will then take care of the rest, publishing the new version to PyPI and building the documentation.

Testing

eyecite comes with a robust test suite of different citation strings that it is equipped to handle. Run these tests as follows:

python3 -m unittest discover -s tests -p 'test_*.py'

If you would like to create mock citation objects to assist you in writing your own local tests, import and use the following functions for convenience:

from eyecite.test_factories import (
    case_citation,
    id_citation,
    supra_citation,
    unknown_citation,
)

Development

When a pull request is generated for changes from changes to eyecite, a github workflow will automatically trigger. The workflow, benchmark.yml will test improvements in accuracy and speed against the current main branch.

The results are committed to an artifacts branch, and an ever updating comment in the PR comments with the output.

License

This repository is available under the permissive BSD license, making it easy and safe to incorporate in your own libraries.

Pull and feature requests welcome. Online editing in GitHub is possible (and easy!).

Project details


Download files

Download the file for your platform. If you're not sure which to choose, learn more about installing packages.

Source Distribution

eyecite-2.7.6.tar.gz (87.0 kB view details)

Uploaded Source

Built Distribution

If you're not sure about the file name format, learn more about wheel file names.

eyecite-2.7.6-py3-none-any.whl (55.6 kB view details)

Uploaded Python 3

File details

Details for the file eyecite-2.7.6.tar.gz.

File metadata

  • Download URL: eyecite-2.7.6.tar.gz
  • Upload date:
  • Size: 87.0 kB
  • Tags: Source
  • Uploaded using Trusted Publishing? No
  • Uploaded via: twine/6.1.0 CPython/3.12.9

File hashes

Hashes for eyecite-2.7.6.tar.gz
Algorithm Hash digest
SHA256 2afa8e8a4106c6a074fbf1216686c3df97f006359faccb0dd27d29fd4614a1c0
MD5 e5e4f665f36b6882616001363d703404
BLAKE2b-256 f6c0c2fd431c81e9843768f880608f965c0fbd58244856ec18a920ae4f213228

See more details on using hashes here.

File details

Details for the file eyecite-2.7.6-py3-none-any.whl.

File metadata

  • Download URL: eyecite-2.7.6-py3-none-any.whl
  • Upload date:
  • Size: 55.6 kB
  • Tags: Python 3
  • Uploaded using Trusted Publishing? No
  • Uploaded via: twine/6.1.0 CPython/3.12.9

File hashes

Hashes for eyecite-2.7.6-py3-none-any.whl
Algorithm Hash digest
SHA256 cf7b6e1fe852a244aa613cea6004dc435b90b86807a1588061f562b62c0ec918
MD5 094dcb637b8a7c5a80b5cd735e5967ec
BLAKE2b-256 260a77649b9261129534543c3bd66cc94c30f07446967d760a44c6f50508dc39

See more details on using hashes here.

Supported by

AWS Cloud computing and Security Sponsor Datadog Monitoring Depot Continuous Integration Fastly CDN Google Download Analytics Pingdom Monitoring Sentry Error logging StatusPage Status page