Building blocks for spacy custom tokenization and Matcher patterns

These details have not been verified by PyPI

Project links

GitHub Statistics

View statistics for this project via Libraries.io, or by using our public dataset on Google BigQuery

Project description

corpus-preprocess

Github CI

Utility functions to preprocess Phil. legalese in weasel-based flows:

lexcat-proj; and
lexcat-multi

[!IMPORTANT] Relies on a private corpus-assets to be cloned locally.

corpus-assets folder should have the following structure:

- data: # used as data folder in tokenization
  - single_tokens.json
  - report_publishers.json
- ents: # collected in `setup_span_ruler.py`
  - casenames.txt # each line is a clean case
  - clean_statute_titles.txt # each line is a clean title
- concepts: # collected in `setup_span_ruler.py`
  - political: # main subject category
      - bill_of_rights: # sub-topic
          - patterns.json # contains matcher files
          - q.txt # contains lines which can be used to query the database
- metas: # collected in `setup_span_ruler.py`
  - artifacts:
    - axiom:
      - patterns.json # same
      - q.txt # same

Custom tokenizer

import spacy

@spacy.registry.tokenizers("lex.tokenizer.v1")  # type: ignore
def lex_tokenize(data_folder: str):
    """
    The tokenizer:

    1. Removes dashes from infixes
    2. Adds prefix/suffix rules for parenthesis/brackets
    3. Adds special exceptions via the `data_folder`
    """
    def modify_tokenizer(nlp):
        data = import_data_tokens(validated_path(data_folder))
        nlp.tokenizer = customize_tokenizer(data)
        return nlp.tokenizer

    return modify_tokenizer


def create_base_nlp(base_model: str, data_folder: str):
    """
    Need to declare a new empty model to have custom tokenization then
    plug pipeline with required parts of a pre-trained model. The data
    folder modifies the tokenizer via `nlp.tokenizer.add_special_rules()`
    """
    nlp = spacy.blank(
        name="en",
        config={
            "nlp": {
                "tokenizer": {
                    "@tokenizers": "lex.tokenizer.v1",
                    "data_folder": data_folder, # add special rules from third-party source
                }
            }
        },
    )
    source_nlp = spacy.load(base_model, exclude="ner,senter")
    nlp.vocab.vectors = source_nlp.vocab.vectors
    for name in source_nlp.pipe_names:
        nlp.add_pipe(name, source=source_nlp)
    return nlp

SpanRuler from assets

Use in tandem with tokenizer, ensure only longest spans:

from spacy.language import Language
from spacy.util import filter_spans
from preprocess import set_patterns_from_assets
import spacy

@Language.component(name="filter_added_spans")
def filter_added_spans(doc):
    doc.spans["ruler"] = filter_spans(doc.spans["ruler"])
    return doc

ruler = nlp.add_pipe("span_ruler", config={"phrase_matcher_attr": "LOWER"}, validate=True) # defaults to 'ruler' key
patterns = set_patterns_from_assets(folder)
ruler.add_patterns(patterns)
nlp.add_pipe("filter_added_spans") # ensures only longest spans are included
nlp.to_disk("models/")  # will save entire directory which includes the pipeline

Processes

Generate queries

The q.txt lines will be used as criteria to fetch relevant segments from the database.

The db file should be have an "opinion_segments" table with fts-enabled on the "text" column. /scripts/extract.py utilizes table.search().

See code

def extract_txt_from_db(
    source_db_file: str,
    path: Path,
    max_segments: int,
    min_char_segment: int = 100,
    max_char_segment: int = 3000,
    is_unique_txt: bool = True,
):
    """An fts expression is auto-generated by `q.txt` files found in the `path`. This
    expression is used to generate strings of text that match the aggregated query."""
    db = Database(source_db_file)
    tbl = db["opinion_segments"]
    rows = tbl.search(  # type: ignore
        q=create_fts_expr(path),
        where="category='ruling' and char_count > :min_char and char_count < :max_char ",
        where_args={"min_char": min_char_segment, "max_char": max_char_segment},
        limit=max_segments,
        columns=["text", "id"],
    )
    if is_unique_txt:
        rows = filter_unique_texts(rows)
    return rows

Create matcher patterns

A SpanRuler component will be based on patterns.json (with q.txt as phrases). These patterns are aggregated via set_patterns_from_assets(). See code:

def set_patterns_from_assets(path: Path):
    axioms = axiom.collect_patterns(path.joinpath("meta"))
    concepts = create_concept_patterns(path.joinpath("concepts"))
    ents = extract_ents(path.joinpath("ents"))
    return axioms + concepts + ents

Categorize queried segments via patterns found

A TextCategorizer component can be trained using the results of the span ruler: see sample code:

@Language.factory(name="add_cats_from_spans")
class AddTextCatComponent:
    def __init__(self, nlp: Language, name: str, path: str):
        self.nlp = nlp
        options = list({p["id"].split("/")[0] for p in create_patterns(path)})  # type: ignore
        if len(options) == 1:
            options.append(f"not_{options[0]}")
        self.options = options

    def __call__(self, doc) -> Doc:
        default = {op: 0.0 for op in self.options}
        cats = [self.nlp.vocab.strings[s.id].split("/")[0] for s in doc.spans["sc"]]
        doc.cats = default | {k: 1.0 for k, _ in Counter(cats).items()}
        return doc

[!NOTE] Note: if textcat is in the pipeline, if only one label is found, will error out, hence need to a not option. If textcat_multilabel is used, then a single category is fine.

Prerequisites to lexcat-*

item	desc	`project.yml` declaration
db	sqlite database to fetch segments[^1]	`db_file`
corpus-assets	A folder to retrieve q.txt and patterns.json files	`patterns_dir`
corpus-preprocess	This toolkit	see usage in `/scripts/build.py` and `/scripts/extract.py`

[^1]: Although it might be better to allow segment access via lawsql's API.

Installation of lexcat-*

Clone the above repos and activate virtual env with requirements.txt:

python -m venv .venv && \
source .venv/bin/activate && \
python -m pip install -U pip && \
python -W ignore -m pip install -r requirements.txt && \
weasel run init

lexcat-proj

Results in a model trained on a specific concept category
Need to adjust project.yml's name, topic_dir, and total_segments variables (vars).
Running weasel run all produces packages/en_lex_name_total_segments-0.0.0/dist
The output is based on q.txt and patterns.json files sourced from e.g. ../patterns/topic_dir.
Alternatively, can override CLI arguments, e.g. weasel run all . --vars.topic_dir <value> --vars.name <value>

broad implementation

topic	name	status
political	pol	ok
labor	labor	ok
criminal	crim	ok
civil	civ	ok
remedial	rem	-
commercial	com	-
ethics	eths	-
remedial	rem	-

Example use on command line (note .)[^2]:

weasel run all . \
    --vars.topic_dir criminal \
    --vars.name crim \
    --vars.total_segments 5000

[^2]: The override of weasel project variables vars on the command line requires tinkering

granular implementation

topic	name	status
political/review	name=pol_rev	ok
political/sovereignty	name=pol_sov	ok
political/bill_of_rights	name=pol_bill	ok
political/administrative	name=pol_adm	ok

Example use on command line (note .):

weasel run all . \
    --vars.topic_dir political/administrative \
    --vars.name pol_adm \
    --vars.total_segments 250

lexcat-multi

Results in a model trained on all concept categories
Each category's example files are found in assets
Running weasel run all produces packages/en_lexcat-0.0.0/dist
The output is based on q.txt and patterns.json files sourced from e.g. ../patterns (the parent directory)

Models

There are two models to consider, both will be created under /training

rule-based, weak supervision via keywords

The first model is a rule-based temporary model.
Basic pipeline makes use of a tokenizer and SpanRuler to make adjustments to doc.spans.
The pipeline is applied to segments fetched from the database.
The model is built via the scripts/build.py.
The config of this model can be seen in /training/{name}_ruler/config.cfg
The purpose of this model is seen in the weasel run bin to output a corpus/train.spacy.

lexcat, generate model to test on prodigy

The second model is the statistical "training" lexcat.
Utilizes a separate lexcat_proj/config.cfg with output corpus/train.spacy (from rule-based model).
The purpose of this model, found in training/lexcat/model-best after weasel run train, is to package it for later use.
This model becomes a weak supervision model that can be checked by human annotators later.

Packaged models

Install via filepath, e.g.

pip install ../lexcat-proj/packages/en_lex_labor_5000-0.0.0/dist/en_lex_labor_5000-0.0.0.tar.gz # poetry add

This will enable:

nlp = spacy.load('en_lex_labor_5000')

Gotchas

spacy

Cannot override pre-defined models' tokenization, can only create custom tokenization with an empty model

weasel

See CLI overrides in weasel, previously spacy projects
Too many warnings so note the -W ignore option used in running python command line scripts.
Do not name a script/function.py file named tokenize.py, this results in AttributeError: partially initialized module 'inspect' has no attribute 'getmro' (most likely due to a circular import)
Although project.yml produces output Markdown formatting, it will not respect full markdown formatting (e.g. headers, tables, enumerations) within project.yml fields like description, hence need for this NOTES.md file as a supplement.
In creating Language.factories configs, using Path as a type results in Fatal Python error: Segmentation fault.

Project details

These details have not been verified by PyPI

Project links

GitHub Statistics

View statistics for this project via Libraries.io, or by using our public dataset on Google BigQuery

Release history Release notifications | RSS feed

0.0.7

Dec 22, 2023

0.0.6

Dec 22, 2023

0.0.5

Dec 22, 2023

0.0.4

Dec 21, 2023

0.0.3

Dec 20, 2023

This version

0.0.2

Dec 20, 2023

0.0.1

Dec 20, 2023

Download files

Download the file for your platform. If you're not sure which to choose, learn more about installing packages.

Source Distribution

corpus_preprocess-0.0.2.tar.gz (24.9 kB view hashes)

Uploaded Dec 20, 2023 Source

Built Distribution

corpus_preprocess-0.0.2-py3-none-any.whl (27.5 kB view hashes)

Uploaded Dec 20, 2023 Python 3

Hashes for corpus_preprocess-0.0.2.tar.gz

Hashes for corpus_preprocess-0.0.2.tar.gz
Algorithm	Hash digest
SHA256	`46a78dba04da62a2150ff6a184a564cffd2b4882803cc892e70c21cd089badbe`
MD5	`24747a48626dff6bc07475d302d33a0d`
BLAKE2b-256	`e1db4dbb897d7d190e6d060a40e2ed59b9d7cd161f5fed0dcb7726c74866e47c`

Hashes for corpus_preprocess-0.0.2-py3-none-any.whl

Hashes for corpus_preprocess-0.0.2-py3-none-any.whl
Algorithm	Hash digest
SHA256	`2dd2f9b0355a9ab3f7409a3f3161a925266b19a7236fc81b7ef024443a781cff`
MD5	`13e2a745afe8a0cb2015fb639bc4d939`
BLAKE2b-256	`a61c5485f95d6b4c9014009f2a964194c7813822ba90d57fbe09fa6e76ed2704`