Utility functions to preprocess Phil. legalese in weasel-based flows.
Project description
corpus-preprocess
Utility functions to preprocess Phil. legalese in weasel-based flows:
- lexcat-proj; and
- lexcat-multi
[!IMPORTANT] Relies on a private corpus-assets to be cloned locally.
- corpus-assets: # folder should have the following structure:
- data: # used as data folder in tokenization
- single_tokens.json
- report_publishers.json
- ents: # collected in `setup_span_ruler.py`
- casenames.txt # each line is a clean case
- clean_statute_titles.txt # each line is a clean title
- concepts: # collected in `setup_span_ruler.py`
- political: # main subject category
- bill_of_rights: # sub-topic
- patterns.json # contains matcher files
- q.txt # contains lines which can be used to query the database
- metas: # collected in `setup_span_ruler.py`
- artifacts:
- axiom:
- patterns.json # same
- q.txt # same
Custom tokenizer / span ruler
import spacy
from .setup_span_ruler import set_patterns_from_assets
from .setup_tokenizer import customize_tokenizer
from .tokens_single import import_data_tokens
from .utils import validated_path
# limit number of spans returned, ruler key is default
@Language.component(name="filter_added_spans")
def filter_added_spans(doc):
doc.spans["ruler"] = filter_spans(doc.spans["ruler"])
return doc
# initialize model, get special rules for tokenization, here: tokens_dir = /corpus_assets/data
rules_file = validated_path(tokens_dir)
special_rules = import_data_tokens(data_path=rules_file)
nlp = spacy.load("en_core_web_sm", exclude=("ner", "senter"))
nlp.tokenizer = customize_tokenizer(nlp, special_rules)
# prepare patterns for span rule, here assets_dir = /corpus_assets
span_patterns = set_patterns_from_assets(path=validated_path(assets_dir))
ruler = nlp.add_pipe("span_ruler", config={"phrase_matcher_attr": "LOWER"})
ruler.add_patterns(span_patterns)
nlp.add_pipe("filter_added_spans")
nlp.to_disk("models/") # will save entire directory which includes the pipeline
[!NOTE] Loading the model can take awhile if more patterns in
set_patterns_from_assets()
are included, e.g. 130k pattern files takes about 90seconds.
Processes
Generate queries
The q.txt
lines will be used as criteria to fetch relevant segments from the database.
The db file should be have an "opinion_segments" table with fts-enabled on the "text" column. /scripts/extract.py
utilizes table.search().
See code
def extract_txt_from_db(
source_db_file: str,
path: Path,
max_segments: int,
min_char_segment: int = 100,
max_char_segment: int = 3000,
is_unique_txt: bool = True,
):
"""An fts expression is auto-generated by `q.txt` files found in the `path`. This
expression is used to generate strings of text that match the aggregated query."""
db = Database(source_db_file)
tbl = db["opinion_segments"]
rows = tbl.search( # type: ignore
q=create_fts_expr(path),
where="category='ruling' and char_count > :min_char and char_count < :max_char ",
where_args={"min_char": min_char_segment, "max_char": max_char_segment},
limit=max_segments,
columns=["text", "id"],
)
if is_unique_txt:
rows = filter_unique_texts(rows)
return rows
Create matcher patterns
A SpanRuler component will be based on patterns.json
(with q.txt
as phrases). These patterns are aggregated via set_patterns_from_assets()
. See code:
def set_patterns_from_assets(path: Path):
axioms = axiom.collect_patterns(path.joinpath("meta"))
concepts = create_concept_patterns(path.joinpath("concepts"))
ents = extract_ents(path.joinpath("ents"))
return axioms + concepts + ents
Categorize queried segments via patterns found
A TextCategorizer component can be trained using the results of the span ruler: see sample code:
@Language.factory(name="add_cats_from_spans")
class AddTextCatComponent:
def __init__(self, nlp: Language, name: str, path: str):
self.nlp = nlp
options = list({p["id"].split("/")[0] for p in create_patterns(path)}) # type: ignore
if len(options) == 1:
options.append(f"not_{options[0]}")
self.options = options
def __call__(self, doc) -> Doc:
default = {op: 0.0 for op in self.options}
cats = [self.nlp.vocab.strings[s.id].split("/")[0] for s in doc.spans["sc"]]
doc.cats = default | {k: 1.0 for k, _ in Counter(cats).items()}
return doc
[!NOTE] Note: if textcat is in the pipeline, if only one label is found, will error out, hence need to a not option. If textcat_multilabel is used, then a single category is fine.
Prerequisites to lexcat-*
item | desc | project.yml declaration |
---|---|---|
db | sqlite database to fetch segments[^1] | db_file |
corpus-assets | A folder to retrieve q.txt and patterns.json files | patterns_dir |
corpus-preprocess | This toolkit | see usage in /scripts/build.py and /scripts/extract.py |
[^1]: Although it might be better to allow segment access via lawsql's API.
Installation of lexcat-*
Clone the above repos and activate virtual env with requirements.txt
:
python -m venv .venv && \
source .venv/bin/activate && \
python -m pip install -U pip && \
python -W ignore -m pip install -r requirements.txt && \
weasel run init
lexcat-proj
- Results in a model trained on a specific concept category
- Need to adjust project.yml's name, topic_dir, and total_segments variables (
vars
). - Running
weasel run all
produces packages/en_lex_name
_total_segments
-0.0.0/dist - The output is based on q.txt and patterns.json files sourced from e.g. ../patterns/
topic_dir
. - Alternatively, can override CLI arguments, e.g.
weasel run all . --vars.topic_dir <value> --vars.name <value>
broad implementation
topic | name | status |
---|---|---|
political | pol | ok |
labor | labor | ok |
criminal | crim | ok |
civil | civ | ok |
remedial | rem | - |
commercial | com | - |
ethics | eths | - |
remedial | rem | - |
Example use on command line (note .
)[^2]:
weasel run all . \
--vars.topic_dir criminal \
--vars.name crim \
--vars.total_segments 5000
[^2]: The override of weasel project variables vars
on the command line requires tinkering
granular implementation
topic | name | status |
---|---|---|
political/review | name=pol_rev | ok |
political/sovereignty | name=pol_sov | ok |
political/bill_of_rights | name=pol_bill | ok |
political/administrative | name=pol_adm | ok |
Example use on command line (note .
):
weasel run all . \
--vars.topic_dir political/administrative \
--vars.name pol_adm \
--vars.total_segments 250
lexcat-multi
- Results in a model trained on all concept categories
- Each category's example files are found in assets
- Running
weasel run all
produces packages/en_lexcat
-0.0.0/dist - The output is based on q.txt and patterns.json files sourced from e.g. ../
patterns
(the parent directory)
Models
There are two models to consider, both will be created under /training
rule-based, weak supervision via keywords
- The first model is a rule-based temporary model.
- Basic pipeline makes use of a tokenizer and SpanRuler to make adjustments to doc.spans.
- The pipeline is applied to segments fetched from the database.
- The model is built via the
scripts/build.py
. - The config of this model can be seen in
/training/{name}_ruler/config.cfg
- The purpose of this model is seen in the
weasel run bin
to output acorpus/train.spacy
.
lexcat, generate model to test on prodigy
- The second model is the statistical "training"
lexcat
. - Utilizes a separate
lexcat_proj/config.cfg
with outputcorpus/train.spacy
(from rule-based model). - The purpose of this model, found in
training/lexcat/model-best
afterweasel run train
, is to package it for later use. - This model becomes a weak supervision model that can be checked by human annotators later.
Packaged models
Install via filepath, e.g.
pip install ../lexcat-proj/packages/en_lex_labor_5000-0.0.0/dist/en_lex_labor_5000-0.0.0.tar.gz # poetry add
This will enable:
nlp = spacy.load('en_lex_labor_5000')
Gotchas
weasel
- See CLI overrides in weasel, previously spacy projects
- Too many warnings so note the
-W ignore
option used in running python command line scripts. - Do not name a script/function.py file named
tokenize.py
, this results inAttributeError: partially initialized module 'inspect' has no attribute 'getmro' (most likely due to a circular import)
- Although
project.yml
produces output Markdown formatting, it will not respect full markdown formatting (e.g. headers, tables, enumerations) withinproject.yml
fields like description, hence need for this NOTES.md file as a supplement. - In creating Language.factories configs, using
Path
as a type results inFatal Python error: Segmentation fault
.
Project details
Download files
Download the file for your platform. If you're not sure which to choose, learn more about installing packages.
Source Distribution
Built Distribution
Hashes for corpus_preprocess-0.0.3-py3-none-any.whl
Algorithm | Hash digest | |
---|---|---|
SHA256 | a5f96b79a7579b8e0540e99dd43c894af6f1b634c41efb935872ef7d4ede8657 |
|
MD5 | 9420694e7467511509be0f0ed9f13958 |
|
BLAKE2b-256 | f487e8d7a4c0971df63b0b7c82591d891a0da303fc28f4771ea16e1d03d7c70a |