Legalese tokenization
Project description
lextok
Rule-based tokenizer and pattern matching for basic Philippine entities using spacy.
[!IMPORTANT] Should be used in tandem with doclex
Quickstart
poetry env use 3.11.6 # 3.12 not yet supported
poetry install
poetry shell
python -m spacy download en_core_web_sm # base model
Rationale
Before
import spacy
nlp = spacy.load("en_core_web_sm") # no modifications to the model
doc1 = nlp("Sec. 36(b)(21)")
for token in doc1:
print(f"{token.text=} {token.pos_=} {token.ent_type_=}, {token.i=}")
"""
token.text='Sec' token.pos_='PROPN' token.ent_type_='ORG' token.i=0
token.text='.' token.pos_='PUNCT' token.ent_type_='' token.i=1
token.text='36(b)(21' token.pos_='NUM' token.ent_type_='CARDINAL' token.i=2
token.text=')' token.pos_='PUNCT' token.ent_type_='' token.i=3
"""
After
from lextok import lextok
lex = lextok() # inclusion of custom tokenizer, attribute and entity ruler
doc2 = lex("Sec. 36(b)(21)")
for token in doc2:
print(f"{token.text=} {token.pos_=} {token.ent_type_=} {token.i=}")
"""
token.text='Sec.' token.pos_='NOUN' token.ent_type_='ProvisionNum' token.i=0
token.text='36(b)(21)' token.pos_='NUM' token.ent_type_='ProvisionNum' token.i=1
"""
Token entities can be merged:
from lextok import lextok
lex = lextok(finalize_entities=True)
doc2 = lex("Sec. 36(b)(21)")
for token in doc2:
print(f"{token.text=} {token.pos_=} {token.ent_type_=} {token.i=}")
"""
token.text='Sec. 36(b)(21)' token.pos_='NUM' token.ent_type_='ProvisionNum' token.i=0
"""
Pattern creation
A pattern consists of a list of tokens, e.g. space space between the word, a dot, and the number?
[
{"ORTH": {"IN": ["Tit", "Bk", "Ch", "Sub-Chap", "Art", "Sec", "Par", "Sub-Par"]}},
{"ORTH": "."}, # with dot
{"POS": "NUM"},
]
This is another pattern where the dot is connected to the word:
[
{
"ORTH": {
"IN": [
"Tit.",
"Bk.",
"Ch.",
"Sub-Chap.",
"Art.",
"Sec.",
"Par.",
"Sub-Par.",
]
}
},
{"POS": "NUM"},
] # no separate dot
There are many variations. It becomes possible to generate a list of patterns algorithmically and save them to a *.jsonl
file, e.g.:
from lextok.entity_rules_citeable import statutory_provisions
print(statutory_provisions.patterns) # view patterns
statutory_provisions.create_file() # located in /lextok/rules/ if path not specified
Rules and Labels
Each Rule
may consist of many patterns, and this collection of patterns can be associated with a Label
.
In spacy parlance, the label represents the ENT_TYPE
but for this library's purpose, it's also adopted for non-entities to cater to SpanRuler patterns.
To distinguish labels strictly for entities from labels for non-entities, a collection of labels is defined in SPAN_RULER_LABELS
. If not included in this list of labels, then the implication is that the Rule's patterns ought to be governed by the EntityRuler; otherwise, the SpanRuler.
Considering the number of Rules
declared (or to be declared), instead of importing each instance individually, these can be extracted dynamically with Rule.extract_from_files()
.
Existing data structures
from lextok import Label, ENTITY_RULES, SPAN_RULES
for label in Label:
print(label.name) # pattern labels
for e in ENTITY_RULES:
print(e)
for s in SPAN_RULES:
print(s)
Add more entity rules
Create a list of Rule
objects, e.g.:
from lextok import lextok, Rule, ENTITY_RULES, Label
added_rules = [
Rule(
id="ministry-labor",
label=Label.GovtDivision,
patterns=[
[
{"LOWER": "the", "OP": "?"},
{"LOWER": "ministry"},
{"LOWER": "of"},
{"LOWER": "labor"},
]
],
),
Rule(
id="intermediate-scrutiny",
label=Label.Doctrine,
patterns=[
[
{"LOWER": "test", "OP": "?"},
{"LOWER": "of", "OP": "?"},
{"LOWER": "intermediate"},
{"LOWER": "scrutiny"},
{"LEMMA": {"IN": ["test", "approach"]}, "OP": "?"},
]
],
),
]
# Include new rules in lextok language
nlp = lextok(finalize_entities=True, entity_rules=ENTITY_RULES + added_rules)
# Test detection
doc = nlp(
"Lorem ipsum, sample text. The Ministry of Labor is a govt division. Hello world. The test of intermediate scrutiny is a constitutional law concept."
)
doc.ents # (The Ministry of Labor, test of intermediate scrutiny)
Project details
Download files
Download the file for your platform. If you're not sure which to choose, learn more about installing packages.
Source Distribution
Built Distribution
File details
Details for the file lextok-0.0.22.tar.gz
.
File metadata
- Download URL: lextok-0.0.22.tar.gz
- Upload date:
- Size: 20.1 kB
- Tags: Source
- Uploaded using Trusted Publishing? No
- Uploaded via: poetry/1.7.1 CPython/3.10.6 Darwin/23.1.0
File hashes
Algorithm | Hash digest | |
---|---|---|
SHA256 | 3482557e099573f0ac41563c36e6ed54ba8f13e3ea6a416552e5dcbdc6b5f873 |
|
MD5 | b6dc48fd0061f3658982367db8f2cc18 |
|
BLAKE2b-256 | e894553680abccf35bb3abaa64c3cf3d7ff428d9d891b01f38b42b380b92ed17 |
File details
Details for the file lextok-0.0.22-py3-none-any.whl
.
File metadata
- Download URL: lextok-0.0.22-py3-none-any.whl
- Upload date:
- Size: 24.5 kB
- Tags: Python 3
- Uploaded using Trusted Publishing? No
- Uploaded via: poetry/1.7.1 CPython/3.10.6 Darwin/23.1.0
File hashes
Algorithm | Hash digest | |
---|---|---|
SHA256 | 0a57f3c569f71848f5dc6472da9d4d83776fa20a76fe97915ad73cab00f3c8f2 |
|
MD5 | 14c5ab9bd86eb8d01f380f004bde92dd |
|
BLAKE2b-256 | 10dd92c6d0fd73afa81b6b839faf778c0fe520e799821cf7371958f1a6cc5271 |