Skip to main content

Utility functions to preprocess Phil. legalese in weasel-based flows.

Project description

corpus-preprocess

Github CI

Utility functions to preprocess Phil. legalese in weasel-based flows:

  1. lexcat-proj; and
  2. lexcat-multi

[!IMPORTANT] Requires private corpus-assets folder and sqlite3 db in citelaws-data to be cloned locally.

- corpus-assets: # folder structure
  - concept: # must be two-level nested patterns.json + q.txt
  - artifact: # single folder patterns.json + q.txt
  - text: # each file is a .txt

Language customization

Assuming familiarity with spacy:

nlp.tokenizer = customize_tokenizer(nlp, special_token_rules) # custom tokenizer
ruler = nlp.add_pipe(
    "span_ruler",
    config={
        "spans_key": "ruler",
        "phrase_matcher_attr": "LOWER",
        "spans_filter": {"@misc": "spacy.first_longest_spans_filter.v1"}, # longest spans only
    },
)
ruler.add_patterns(patterns)  # created patterns from this library and corpus-assets

[!NOTE] Loading model with 130k pattern lines takes ~2 min.

Training data

Concept spans

for folder in get_concepts(asset_dir.joinpath("concept")):
    bn = DocBin()
    # use q.txt as queries to the db
    # number of segments per q.txt to fetch
    docs = apply_concept_q_filter(nlp, db_file, filter_path=folder, max_segments=500)
    for doc in docs:
        bn.add(doc)
    bn.to_disk(asset_dir.joinpath(f"train/{folder.stem}.spacy"))

Each concept_dir contains subtopics:

- corpus-assets: # folder structure
  - concept: # must be two-level nested
    - political: # main subject category
        - bill_of_rights: # sub-topic
            - patterns.json # contains matcher files
            - q.txt # contains lines which can be used to query the database

Because of this structure, it's possible to train a textcat_multilabel component:

textcat_options = [concept["id"].split("/")[0] for concept in concept_patterns]

@Language.factory(name="add_cats_from_spans")
class AddTextCatComponent:
    def __init__(self, nlp: Language, name: str, options: list[str]):
        self.nlp = nlp
        self.options = options

    def __call__(self, doc) -> Doc:
        doc.cats = {op: 0.0 for op in self.options}
        for span in doc.spans["sc"]:
            if span.id:  # some spans won't have an id
                value = self.nlp.vocab.strings[span.id]
                if "/" in value:  # e.g. political/bill_of_rights
                    main_topic = value.split("/")[0]  # just political
                    if main_topic in self.options:
                        if doc.cats[main_topic] == 0.0:
                            doc.cats[main_topic] = 1.0
        return doc

Non-concept spans

Although patterns from set_patterns() are included in the constructed nlp object, can ensure that a certain of rows (filter_count) are fetched from the database that have spans which are labeled title and/or serial, etc.

for label in {"unit", "ref", "serial", "title", "axiom", "date", "juridical"}:
    bn = DocBin()
    docs = apply_label_filter(nlp, db_file, filter_labels={label}, filter_count=1500)
    for doc in docs:
        bn.add(doc)
    bn.to_disk(asset_dir.joinpath(f"train/{label}.spacy"))

Project details


Download files

Download the file for your platform. If you're not sure which to choose, learn more about installing packages.

Source Distribution

corpus_preprocess-0.0.7.tar.gz (21.9 kB view details)

Uploaded Source

Built Distribution

corpus_preprocess-0.0.7-py3-none-any.whl (27.2 kB view details)

Uploaded Python 3

File details

Details for the file corpus_preprocess-0.0.7.tar.gz.

File metadata

  • Download URL: corpus_preprocess-0.0.7.tar.gz
  • Upload date:
  • Size: 21.9 kB
  • Tags: Source
  • Uploaded using Trusted Publishing? No
  • Uploaded via: poetry/1.7.1 CPython/3.10.6 Darwin/23.2.0

File hashes

Hashes for corpus_preprocess-0.0.7.tar.gz
Algorithm Hash digest
SHA256 41bc54d0f12dfc3fa9ba4ba7b0f0dec86459d7a3f5dbfea07c67c77ead045e6f
MD5 85ee52bc4a1b444f81b4d2b0d6b83f19
BLAKE2b-256 eb3f5293b52d7734940cbffc9fee4b41994c4bb5ab58527c7a286bf8e6b9e335

See more details on using hashes here.

File details

Details for the file corpus_preprocess-0.0.7-py3-none-any.whl.

File metadata

File hashes

Hashes for corpus_preprocess-0.0.7-py3-none-any.whl
Algorithm Hash digest
SHA256 5d83ebe3e7892ed7759b94877a02e07229641cfcb2a6d4976e7f9e3d136da60d
MD5 c9acde61a643cc93bf8ff3c1d6504608
BLAKE2b-256 627988d8f73fe64988d87e5b43972636d6c542c80002bb7c179b9dc773c83979

See more details on using hashes here.

Supported by

AWS AWS Cloud computing and Security Sponsor Datadog Datadog Monitoring Fastly Fastly CDN Google Google Download Analytics Microsoft Microsoft PSF Sponsor Pingdom Pingdom Monitoring Sentry Sentry Error logging StatusPage StatusPage Status page