Utility functions to preprocess Phil. legalese in weasel-based flows.
Project description
corpus-preprocess
Utility functions to preprocess Phil. legalese in weasel-based flows:
- lexcat-proj; and
- lexcat-multi
[!IMPORTANT] Requires private corpus-assets folder and sqlite3 db in citelaws-data to be cloned locally.
- corpus-assets: # folder structure
- concept: # must be two-level nested patterns.json + q.txt
- artifact: # single folder patterns.json + q.txt
- text: # each file is a .txt
Language customization
Assuming familiarity with spacy:
nlp.tokenizer = customize_tokenizer(nlp, special_token_rules) # custom tokenizer
ruler = nlp.add_pipe(
"span_ruler",
config={
"spans_key": "ruler",
"phrase_matcher_attr": "LOWER",
"spans_filter": {"@misc": "spacy.first_longest_spans_filter.v1"}, # longest spans only
},
)
ruler.add_patterns(patterns) # created patterns from this library and corpus-assets
[!NOTE] Loading model with 130k pattern lines takes ~2 min.
Training data
Concept spans
for folder in get_concepts(asset_dir.joinpath("concept")):
bn = DocBin()
# use q.txt as queries to the db
# number of segments per q.txt to fetch
docs = apply_concept_q_filter(nlp, db_file, filter_path=folder, max_segments=500)
for doc in docs:
bn.add(doc)
bn.to_disk(asset_dir.joinpath(f"train/{folder.stem}.spacy"))
Each concept_dir contains subtopics:
- corpus-assets: # folder structure
- concept: # must be two-level nested
- political: # main subject category
- bill_of_rights: # sub-topic
- patterns.json # contains matcher files
- q.txt # contains lines which can be used to query the database
Because of this structure, it's possible to train a textcat_multilabel
component:
textcat_options = [concept["id"].split("/")[0] for concept in concept_patterns]
@Language.factory(name="add_cats_from_spans")
class AddTextCatComponent:
def __init__(self, nlp: Language, name: str, options: list[str]):
self.nlp = nlp
self.options = options
def __call__(self, doc) -> Doc:
doc.cats = {op: 0.0 for op in self.options}
for span in doc.spans["sc"]:
if span.id: # some spans won't have an id
value = self.nlp.vocab.strings[span.id]
if "/" in value: # e.g. political/bill_of_rights
main_topic = value.split("/")[0] # just political
if main_topic in self.options:
if doc.cats[main_topic] == 0.0:
doc.cats[main_topic] = 1.0
return doc
Non-concept spans
Although patterns from set_patterns()
are included in the constructed nlp
object,
can ensure that a certain of rows (filter_count
) are fetched from the database that have spans which are labeled
title
and/or serial
, etc.
for label in {"unit", "ref", "serial", "title", "axiom", "date", "juridical"}:
bn = DocBin()
docs = apply_label_filter(nlp, db_file, filter_labels={label}, filter_count=1500)
for doc in docs:
bn.add(doc)
bn.to_disk(asset_dir.joinpath(f"train/{label}.spacy"))
Project details
Download files
Download the file for your platform. If you're not sure which to choose, learn more about installing packages.
Source Distribution
Built Distribution
File details
Details for the file corpus_preprocess-0.0.7.tar.gz
.
File metadata
- Download URL: corpus_preprocess-0.0.7.tar.gz
- Upload date:
- Size: 21.9 kB
- Tags: Source
- Uploaded using Trusted Publishing? No
- Uploaded via: poetry/1.7.1 CPython/3.10.6 Darwin/23.2.0
File hashes
Algorithm | Hash digest | |
---|---|---|
SHA256 | 41bc54d0f12dfc3fa9ba4ba7b0f0dec86459d7a3f5dbfea07c67c77ead045e6f |
|
MD5 | 85ee52bc4a1b444f81b4d2b0d6b83f19 |
|
BLAKE2b-256 | eb3f5293b52d7734940cbffc9fee4b41994c4bb5ab58527c7a286bf8e6b9e335 |
File details
Details for the file corpus_preprocess-0.0.7-py3-none-any.whl
.
File metadata
- Download URL: corpus_preprocess-0.0.7-py3-none-any.whl
- Upload date:
- Size: 27.2 kB
- Tags: Python 3
- Uploaded using Trusted Publishing? No
- Uploaded via: poetry/1.7.1 CPython/3.10.6 Darwin/23.2.0
File hashes
Algorithm | Hash digest | |
---|---|---|
SHA256 | 5d83ebe3e7892ed7759b94877a02e07229641cfcb2a6d4976e7f9e3d136da60d |
|
MD5 | c9acde61a643cc93bf8ff3c1d6504608 |
|
BLAKE2b-256 | 627988d8f73fe64988d87e5b43972636d6c542c80002bb7c179b9dc773c83979 |