Skip to main content

Legalese tokenization

Project description

lextok

Github CI

Rule-based tokenizer and pattern matching for basic Philippine entities using spacy.

[!IMPORTANT] Should be used in tandem with doclex

Quickstart

poetry env use 3.11.6 # 3.12 not yet supported
poetry install
poetry shell
python -m spacy download en_core_web_sm # base model

Rationale

Before

import spacy

nlp = spacy.load("en_core_web_sm")  # no modifications to the model
doc1 = nlp("Sec. 36(b)(21)")
for token in doc1:
    print(f"{token.text=} {token.pos_=} {token.ent_type_=}, {token.i=}")
"""
token.text='Sec' token.pos_='PROPN' token.ent_type_='ORG' token.i=0
token.text='.' token.pos_='PUNCT' token.ent_type_='' token.i=1
token.text='36(b)(21' token.pos_='NUM' token.ent_type_='CARDINAL' token.i=2
token.text=')' token.pos_='PUNCT' token.ent_type_='' token.i=3
"""

After

from lextok import lextok

lex = lextok()  # inclusion of custom tokenizer, attribute and entity ruler
doc2 = lex("Sec. 36(b)(21)")
for token in doc2:
    print(f"{token.text=} {token.pos_=} {token.ent_type_=} {token.i=}")
"""
token.text='Sec.' token.pos_='NOUN' token.ent_type_='ProvisionNum' token.i=0
token.text='36(b)(21)' token.pos_='NUM' token.ent_type_='ProvisionNum' token.i=1
"""

Token entities can be merged:

from lextok import lextok

lex = lextok(finalize_entities=True)
doc2 = lex("Sec. 36(b)(21)")
for token in doc2:
    print(f"{token.text=} {token.pos_=} {token.ent_type_=} {token.i=}")
"""
token.text='Sec. 36(b)(21)' token.pos_='NUM' token.ent_type_='ProvisionNum' token.i=0
"""

Pattern creation

A pattern consists of a list of tokens, e.g. space space between the word, a dot, and the number?

[
    {"ORTH": {"IN": ["Tit", "Bk", "Ch", "Sub-Chap", "Art", "Sec", "Par", "Sub-Par"]}},
    {"ORTH": "."},  # with dot
    {"POS": "NUM"},
]

This is another pattern where the dot is connected to the word:

[
    {
        "ORTH": {
            "IN": [
                "Tit.",
                "Bk.",
                "Ch.",
                "Sub-Chap.",
                "Art.",
                "Sec.",
                "Par.",
                "Sub-Par.",
            ]
        }
    },
    {"POS": "NUM"},
]  # no separate dot

There are many variations. It becomes possible to generate a list of patterns algorithmically and save them to a *.jsonl file, e.g.:

from lextok.entity_rules_citeable import statutory_provisions

print(statutory_provisions.patterns)  # view patterns
statutory_provisions.create_file()  # located in /lextok/rules/ if path not specified

Rules and Labels

Each Rule may consist of many patterns, and this collection of patterns can be associated with a Label.

In spacy parlance, the label represents the ENT_TYPE but for this library's purpose, it's also adopted for non-entities to cater to SpanRuler patterns.

To distinguish labels strictly for entities from labels for non-entities, a collection of labels is defined in SPAN_RULER_LABELS. If not included in this list of labels, then the implication is that the Rule's patterns ought to be governed by the EntityRuler; otherwise, the SpanRuler.

Considering the number of Rules declared (or to be declared), instead of importing each instance individually, these can be extracted dynamically with Rule.extract_from_files().

Existing data structures

from lextok import Label, ENTITY_RULES, SPAN_RULES

for label in Label:
    print(label.name)  # pattern labels
for e in ENTITY_RULES:
    print(e)
for s in SPAN_RULES:
    print(s)

Add more entity rules

Create a list of Rule objects, e.g.:

from lextok import lextok, Rule, ENTITY_RULES, Label

added_rules = [
    Rule(
        id="ministry-labor",
        label=Label.GovtDivision,
        patterns=[
            [
                {"LOWER": "the", "OP": "?"},
                {"LOWER": "ministry"},
                {"LOWER": "of"},
                {"LOWER": "labor"},
            ]
        ],
    ),
    Rule(
        id="intermediate-scrutiny",
        label=Label.Doctrine,
        patterns=[
            [
                {"LOWER": "test", "OP": "?"},
                {"LOWER": "of", "OP": "?"},
                {"LOWER": "intermediate"},
                {"LOWER": "scrutiny"},
                {"LEMMA": {"IN": ["test", "approach"]}, "OP": "?"},
            ]
        ],
    ),
]

# Include new rules in lextok language
nlp = lextok(finalize_entities=True, entity_rules=ENTITY_RULES + added_rules)

# Test detection
doc = nlp(
    "Lorem ipsum, sample text. The Ministry of Labor is a govt division. Hello world. The test of intermediate scrutiny is a constitutional law concept."
)
doc.ents  # (The Ministry of Labor, test of intermediate scrutiny)

Project details


Download files

Download the file for your platform. If you're not sure which to choose, learn more about installing packages.

Source Distribution

lextok-0.0.27.tar.gz (22.3 kB view details)

Uploaded Source

Built Distribution

lextok-0.0.27-py3-none-any.whl (27.3 kB view details)

Uploaded Python 3

File details

Details for the file lextok-0.0.27.tar.gz.

File metadata

  • Download URL: lextok-0.0.27.tar.gz
  • Upload date:
  • Size: 22.3 kB
  • Tags: Source
  • Uploaded using Trusted Publishing? No
  • Uploaded via: poetry/1.7.1 CPython/3.10.6 Darwin/23.1.0

File hashes

Hashes for lextok-0.0.27.tar.gz
Algorithm Hash digest
SHA256 40744e3c50584dddbf23726744e2d99963a25ce3c90fce5c5fdc08b6e28be291
MD5 a3cc097fc8670fe05d28693194247b8f
BLAKE2b-256 38be127df317f5d2792ec74192b2f8118119fc635ebe41f4bfefe137ac1cdcd3

See more details on using hashes here.

File details

Details for the file lextok-0.0.27-py3-none-any.whl.

File metadata

  • Download URL: lextok-0.0.27-py3-none-any.whl
  • Upload date:
  • Size: 27.3 kB
  • Tags: Python 3
  • Uploaded using Trusted Publishing? No
  • Uploaded via: poetry/1.7.1 CPython/3.10.6 Darwin/23.1.0

File hashes

Hashes for lextok-0.0.27-py3-none-any.whl
Algorithm Hash digest
SHA256 6e2c9f642f77e5b38d7fc8f346740c1a4cf8b06f1bcf041426a679abee6be3f6
MD5 e1f3fb472dc5bf62a95bf1502a1c5519
BLAKE2b-256 8d70ad80d75e18326cc295fb0c567b009f00821e6788a8656749c82d66a21052

See more details on using hashes here.

Supported by

AWS AWS Cloud computing and Security Sponsor Datadog Datadog Monitoring Fastly Fastly CDN Google Google Download Analytics Microsoft Microsoft PSF Sponsor Pingdom Pingdom Monitoring Sentry Sentry Error logging StatusPage StatusPage Status page