Utility functions to preprocess Phil. legalese in weasel-based flows.
Project description
corpus-preprocess
Utility functions to preprocess Phil. legalese in weasel-based flows:
- lexcat-proj; and
- lexcat-multi
[!IMPORTANT] Requires private corpus-assets folder and sqlite3 db in citelaws-data to be cloned locally.
- corpus-assets: # folder structure
- concept: # must be two-level nested patterns.json + q.txt
- artifact: # single folder patterns.json + q.txt
- text: # each file is a .txt
Language customization
Assuming familiarity with spacy:
nlp.tokenizer = customize_tokenizer(nlp, special_token_rules) # custom tokenizer
ruler = nlp.add_pipe(
"span_ruler",
config={
"spans_key": "ruler",
"phrase_matcher_attr": "LOWER",
"spans_filter": {"@misc": "spacy.first_longest_spans_filter.v1"}, # longest spans only
},
)
ruler.add_patterns(patterns) # created patterns from this library and corpus-assets
[!NOTE] Loading model with 130k pattern lines takes ~2 min.
Training data
Concept spans
for folder in get_concepts(asset_dir.joinpath("concept")):
bn = DocBin()
# use q.txt as queries to the db
# number of segments per q.txt to fetch
docs = apply_concept_q_filter(nlp, db_file, filter_path=folder, max_segments=500)
for doc in docs:
bn.add(doc)
bn.to_disk(asset_dir.joinpath(f"train/{folder.stem}.spacy"))
Each concept_dir contains subtopics:
- corpus-assets: # folder structure
- concept: # must be two-level nested
- political: # main subject category
- bill_of_rights: # sub-topic
- patterns.json # contains matcher files
- q.txt # contains lines which can be used to query the database
Because of this structure, it's possible to train a textcat_multilabel component:
textcat_options = [concept["id"].split("/")[0] for concept in concept_patterns]
@Language.factory(name="add_cats_from_spans")
class AddTextCatComponent:
def __init__(self, nlp: Language, name: str, options: list[str]):
self.nlp = nlp
self.options = options
def __call__(self, doc) -> Doc:
doc.cats = {op: 0.0 for op in self.options}
for span in doc.spans["sc"]:
if span.id: # some spans won't have an id
value = self.nlp.vocab.strings[span.id]
if "/" in value: # e.g. political/bill_of_rights
main_topic = value.split("/")[0] # just political
if main_topic in self.options:
if doc.cats[main_topic] == 0.0:
doc.cats[main_topic] = 1.0
return doc
Non-concept spans
Although patterns from set_patterns() are included in the constructed nlp object,
can ensure that a certain of rows (filter_count) are fetched from the database that have spans which are labeled
title and/or serial, etc.
for label in {"unit", "ref", "serial", "title", "axiom", "date", "juridical"}:
bn = DocBin()
docs = apply_label_filter(nlp, db_file, filter_labels={label}, filter_count=1500)
for doc in docs:
bn.add(doc)
bn.to_disk(asset_dir.joinpath(f"train/{label}.spacy"))
Project details
Download files
Download the file for your platform. If you're not sure which to choose, learn more about installing packages.
Source Distribution
Built Distribution
Filter files by name, interpreter, ABI, and platform.
If you're not sure about the file name format, learn more about wheel file names.
Copy a direct link to the current filters
File details
Details for the file corpus_preprocess-0.0.7.tar.gz.
File metadata
- Download URL: corpus_preprocess-0.0.7.tar.gz
- Upload date:
- Size: 21.9 kB
- Tags: Source
- Uploaded using Trusted Publishing? No
- Uploaded via: poetry/1.7.1 CPython/3.10.6 Darwin/23.2.0
File hashes
| Algorithm | Hash digest | |
|---|---|---|
| SHA256 |
41bc54d0f12dfc3fa9ba4ba7b0f0dec86459d7a3f5dbfea07c67c77ead045e6f
|
|
| MD5 |
85ee52bc4a1b444f81b4d2b0d6b83f19
|
|
| BLAKE2b-256 |
eb3f5293b52d7734940cbffc9fee4b41994c4bb5ab58527c7a286bf8e6b9e335
|
File details
Details for the file corpus_preprocess-0.0.7-py3-none-any.whl.
File metadata
- Download URL: corpus_preprocess-0.0.7-py3-none-any.whl
- Upload date:
- Size: 27.2 kB
- Tags: Python 3
- Uploaded using Trusted Publishing? No
- Uploaded via: poetry/1.7.1 CPython/3.10.6 Darwin/23.2.0
File hashes
| Algorithm | Hash digest | |
|---|---|---|
| SHA256 |
5d83ebe3e7892ed7759b94877a02e07229641cfcb2a6d4976e7f9e3d136da60d
|
|
| MD5 |
c9acde61a643cc93bf8ff3c1d6504608
|
|
| BLAKE2b-256 |
627988d8f73fe64988d87e5b43972636d6c542c80002bb7c179b9dc773c83979
|