Utility functions to preprocess Phil. legalese in weasel-based flows.
Project description
corpus-preprocess
Utility functions to preprocess Phil. legalese in weasel-based flows:
- lexcat-proj; and
- lexcat-multi
[!IMPORTANT] Requires private corpus-assets folder and sqlite3 db in citelaws-data to be cloned locally.
- corpus-assets: # folder structure
- concept: # must be two-level nested patterns.json + q.txt
- artifact: # single folder patterns.json + q.txt
- text: # each file is a .txt
Language customization
Assuming familiarity with spacy:
nlp.tokenizer = customize_tokenizer(nlp, special_token_rules) # custom tokenizer
ruler = nlp.add_pipe(
"span_ruler",
config={
"spans_key": "ruler",
"phrase_matcher_attr": "LOWER",
"spans_filter": {"@misc": "spacy.first_longest_spans_filter.v1"}, # longest spans only
},
)
ruler.add_patterns(patterns) # created patterns from this library and corpus-assets
[!NOTE] Loading model with 130k pattern lines takes ~2 min.
Training data
Concept spans
for folder in get_concepts(asset_dir.joinpath("concept")):
bn = DocBin()
# use q.txt as queries to the db
# number of segments per q.txt to fetch
docs = apply_concept_q_filter(nlp, db_file, filter_path=folder, max_segments=500)
for doc in docs:
bn.add(doc)
bn.to_disk(asset_dir.joinpath(f"train/{folder.stem}.spacy"))
Each concept_dir contains subtopics:
- corpus-assets: # folder structure
- concept: # must be two-level nested
- political: # main subject category
- bill_of_rights: # sub-topic
- patterns.json # contains matcher files
- q.txt # contains lines which can be used to query the database
Because of this structure, it's possible to train a textcat_multilabel
component:
textcat_options = [concept["id"].split("/")[0] for concept in concept_patterns]
@Language.factory(name="add_cats_from_spans")
class AddTextCatComponent:
def __init__(self, nlp: Language, name: str, options: list[str]):
self.nlp = nlp
self.options = options
def __call__(self, doc) -> Doc:
doc.cats = {op: 0.0 for op in self.options}
for span in doc.spans["sc"]:
if span.id: # some spans won't have an id
value = self.nlp.vocab.strings[span.id]
if "/" in value: # e.g. political/bill_of_rights
main_topic = value.split("/")[0] # just political
if main_topic in self.options:
if doc.cats[main_topic] == 0.0:
doc.cats[main_topic] = 1.0
return doc
Non-concept spans
Although patterns from set_patterns()
are included in the constructed nlp
object,
can ensure that a certain of rows (filter_count
) are fetched from the database that have spans which are labeled
title
and/or serial
, etc.
for label in {"unit", "ref", "serial", "title", "axiom", "date", "juridical"}:
bn = DocBin()
docs = apply_label_filter(nlp, db_file, filter_labels={label}, filter_count=1500)
for doc in docs:
bn.add(doc)
bn.to_disk(asset_dir.joinpath(f"train/{label}.spacy"))
Project details
Download files
Download the file for your platform. If you're not sure which to choose, learn more about installing packages.
Source Distribution
Built Distribution
Hashes for corpus_preprocess-0.0.7-py3-none-any.whl
Algorithm | Hash digest | |
---|---|---|
SHA256 | 5d83ebe3e7892ed7759b94877a02e07229641cfcb2a6d4976e7f9e3d136da60d |
|
MD5 | c9acde61a643cc93bf8ff3c1d6504608 |
|
BLAKE2b-256 | 627988d8f73fe64988d87e5b43972636d6c542c80002bb7c179b9dc773c83979 |