Tool for extracting legal citations from text strings.
Project description
eyecite
eyecite is an open source tool for extracting legal citations from text strings. Originally built for use with Courtlistener.com, it is now a freestanding package.
Its main purpose is to facilitate the conversion of raw text into structured citation entities. It includes mechanisms to recognize and extract “full” citation references (e.g., Bush v. Gore, 531 U.S. 98
), “short form” references (e.g., 531 U.S., at 99
), “supra” references (e.g., Bush, supra, at 100
), “id.” references (e.g., Id., at 101
), and “ibid.” references (e.g., Ibid.
).
Further development is intended and all contributors, corrections, and additions are welcome.
Background
This project is the culmination of years of work to build a citator within Courtlistener.com. This project represents the next step in that development: Decoupling the parsing logic and exposing it for third-party use as a standalone Python package.
Quickstart
Simply feed in a raw string of text (or HTML), and receive a list of structured citation objects, ordered in the sequence that they appear in the text.
from eyecite import get_citations text = 'bob lissner v. test 1 U.S. 12, 347-348 (4th Cir. 1982)' found_citations = get_citations(text) returns: [FullCaseCitation(plaintiff='lissner', defendant='test', volume=1, reporter='U.S.', page='12', year=1982, extra='347-348', court='ca4', canonical_reporter='U.S.', lookup_index=0, token_index=5, reporter_found='U.S.')]
Options
get_citations()
, the main executable function, takes several parameters.
do_post_citation
==> bool; whether additional, post-citation information should be extracted (e.g., the court, year, and/or date range of the citation)do_defendant
==> bool; whether the pre-citation defendant (and possibily plaintiff) reference should be extracteddisambiguate
==> bool; whether each citation’s (possibly ambiguous) reporter should be resolved to its (unambiguous) formtokenizer
==> Tokenizer; an instance of a Tokenizer object (see “Tokenizers” below)
Cleaning Input Text
For a given citation text such as “… 1 Baldwin’s Rep. 1 …”, eyecite expects that the text
will be “clean” before being passed to get_citation
. This means:
Spaces will be single space characters, not multiple spaces or other whitespace.
Quotes and hyphens will be standard quote and hyphen characters.
No junk such as HTML tags inside the citation.
You can use clean_text
to help with this:
from eyecite import clean_text, get_citations source_text = '<p>foo 1 U.S. 1 </p>' plain_text = clean_text(text, ['html', 'inline_whitespace', my_func]) found_citations = get_citations(plain_text)
See the Annotating Citations section for how to insert links into the original text using citations extracted from the cleaned text.
clean_text
currently accepts these values as cleaners:
inline_whitespace
: replace all runs of tab and space characters with a single space characterall_whitespace
: replace all runs of any whitespace character with a single space characterunderscores
: remove two or more underscores, a common error in text extracted from PDFshtml
: remove non-visible HTML content using the lxml libraryCustom function: any function taking a string and returning a string.
Annotating Citations
For simple plain text, you can insert links to citations using the annotate
function:
from eyecite import get_citations, annotate plain_text = 'bob lissner v. test 1 U.S. 12, 347-348 (4th Cir. 1982)' citations = get_citations(plain_text) linked_text = annotate(plain_text, [[c.span(), "<a>", "</a>"] for c in citations]) returns: 'bob lissner v. test <a>1 U.S. 12</a>, 347-348 (4th Cir. 1982)'
Each citation returned by get_citations keeps track of where it was found in the source text.
As a result, annotate
must be called with the same cleaned text used by get_citations
to extract citations. If you do not, the offsets returned by the citation’s span
method will
not align with the text, and your annotations will be in the wrong place.
If you want to clean text and then insert annotations into the original text, you can pass
the original text in as source_text
:
from eyecite import get_citations, annotate, clean_text source_text = '<p>bob lissner v. <i>test 1 U.S.</i> 12, 347-348 (4th Cir. 1982)</p>' plain_text = clean_text(source_text, ['html', 'inline_whitespace']) citations = get_citations(plain_text) linked_text = annotate(plain_text, [[c.span(), "<a>", "</a>"] for c in citations], source_text=source_text) returns: '<p>bob lissner v. <i>test <a>1 U.S.</i> 12</a>, 347-348 (4th Cir. 1982)</p>'
The above example extracts citations from plain_text
and applies them to
source_text
, using a diffing algorithm to insert annotations in the correct locations
in the original text.
Customizing Annotation
If inserting text before and after isn’t sufficient, supply a callable under the annotator
parameter
that takes (before, span_text, after)
and returns the annotated text:
def annotator(before, span_text, after): return before + span_text.lower() + after linked_text = annotate(plain_text, [[c.span(), "<a>", "</a>"] for c in citations], annotator=annotator) returns: 'bob lissner v. test <a>1 u.s. 12</a>, 347-348 (4th Cir. 1982)'
Resolving Citations
Once you have extracted citations from a document, you may wish to resolve them to their common references.
To do so, just pass the results of get_citations()
into resolve_citations()
. This function will
do its best to resolve each “full,” “short form,” “supra,” and “id” citation to a common Resource
object,
returning a dictionary that maps resources to lists of associated citations:
from eyecite import get_citations, resolve_citations text = 'first citation: 1 U.S. 12. second citation: 2 F.3d 2. third citation: Id.' found_citations = get_citations(text) resolved_citations = resolve_citations(found_citations) returns (pseudo): { <Resource object>: [FullCaseCitation('1 U.S. 12')], <Resource object>: [FullCaseCitation('2 F.3d 2'), IdCitation('Id.')] }
Importantly, eyecite performs these resolutions using only its immanent knowledge about each citation’s
textual representation. If you want to perform more sophisticated resolution (e.g., by augmenting each
citation with information from a third-party API), simply pass custom resolve_id_citation()
,
resolve_supra_citation()
, resolve_shortcase_citation()
, and resolve_full_citation()
functions to resolve_citations()
as keyword arguments. You can also configure those functions to
return a more complex resource object (such as a Django model), so long as that object inherits the
eyecite.models.ResourceType
type (which simply requires hashability). For example, you might implement
a custom full citation resolution function as follows, using the default resolution logic as a fallback:
def my_resolve(full_cite): # special handling for resolution of known cases in our database resource = MyOpinion.objects.get(full_cite) if resource: return resource # allow normal clustering of other citations return resolve_full_citation(full_cite) resolve_citations(citations, resolve_full_citation=my_resolve) returns (pseudo): { <MyOpinion object>: [<full_cite>, <short_cite>, <id_cite>], <Resource object>: [<full cite>, <short cite>], }
Tokenizers
Internally, eyecite works by applying a list of regular expressions to the source text to convert it to a list of tokens:
In [1]: from eyecite.tokenizers import default_tokenizer In [2]: list(default_tokenizer.tokenize("Foo v. Bar, 123 U.S. 456 (2016). Id. at 457.")) Out[2]: ['Foo', StopWordToken(data='v.', stop_word='v'), 'Bar,', CitationToken(data='123 U.S. 456', volume='123', reporter='U.S.', page='456' ...), '(2016).', IdToken(data='Id.'), 'at', '457.']
Tokens are then scanned to determine values like the citation year or case name for citation resolution.
Alternate tokenizers can be substituted by providing a tokenizer instance to get_citations()
:
from eyecite.tokenizers import HyperscanTokenizer hyperscan_tokenizer = HyperscanTokenizer(cache_dir='.hyperscan') cites = get_citations(text, tokenizer=hyperscan_tokenizer)
test_FindTest.py includes a simplified example of using a custom tokenizer that uses modified regular expressions to extract citations with OCR errors.
eyecite ships with two tokenizers:
AhocorasickTokenizer (default)
The default tokenizer uses the pyahocorasick library to filter down eyecite’s list of
extractor regexes. It then performs extraction using the builtin re
library.
HyperscanTokenizer
The alternate HyperscanTokenizer compiles all extraction regexes into a hyperscan database
so they can be extracted in a single pass. This is far faster than the default tokenizer
(exactly how much faster depends on how many citation formats are included in the target text),
but requires the optional hyperscan
dependency that is limited to the x86 platform.
Compiling the hyperscan database takes several seconds, so short-running scripts may want to provide a cache directory where the database can be stored. The directory should be writeable only by the user:
hyperscan_tokenizer = HyperscanTokenizer(cache_dir='.hyperscan')
Installation
Installing eyecite is easy.
poetry add eyecite
Or via pip:
sh pip install eyecite
Or install the latest dev version from github
sh pip install git+https://github.com/freelawproject/eyecite.git@master
Deployment
Update version info in
pyproject.toml
.
For an automated deployment, tag the commit with vx.y.z, and push it to master. An automated deploy will be completed for you.
For a manual deployment, run:
sh poetry publish --build
Testing
eyecite comes with a robust test suite of different citation strings that it is equipped to handle. Run these tests as follows:
python3 -m unittest discover -s tests -p 'test_*.py'
If you would like to create mock citation objects to assist you in writing your own local tests, import and use the following functions for convenience:
from eyecite.test_factories import ( case_citation, id_citation, nonopinion_citation, supra_citation, )
License
This repository is available under the permissive BSD license, making it easy and safe to incorporate in your own libraries.
Pull and feature requests welcome. Online editing in GitHub is possible (and easy!).
Project details
Release history Release notifications | RSS feed
Download files
Download the file for your platform. If you're not sure which to choose, learn more about installing packages.