Skip to main content

Building blocks for spacy Matcher patterns

Project description

corpus-patterns

Github CI

A preparatory utils library.

Create a custom tokenizer

from corpus_patterns import set_tokenizer

nlp = spacy.blank("en")
nlp.tokenizer = set_tokenizer(nlp)

The tokenizer:

  1. Removes dashes from infixes
  2. Adds prefix/suffix rules for parenthesis/brackets
  3. Adds special exceptions to treat dotted text as a single token

Use with modified config file:

@spacy.registry.tokenizers("test")  # type: ignore
def create_corpus_tokenizer():
    def create_tokenizer(nlp):
        return set_tokenizer(nlp)
    return create_tokenizer

nlp = spacy.load("en_core_web_sm", config={"nlp": {"tokenizer": {"@tokenizers": "test"}}},
)

Add .jsonl files to directory

Each file will contain lines of spacy matcher patterns.

from corpus_patterns import create_rules
from pathlib import Path

create_rules(folder=Path("location-here"))  # check directory

Utils

  1. annotate_fragments() - given an nlp object and some *.txt files, create a single annotation *.jsonl file
  2. extract_lines_from_txt_files() - accepts an iterator of *.txt files and yields each line (after sorting the same and ensuring uniqueness of content).
  3. split_data() - given a list of text strings, split the same into two groups and return a dictionary containing these groups based on the ratio provided (defaults to 0.80)

Project details


Download files

Download the file for your platform. If you're not sure which to choose, learn more about installing packages.

Source Distribution

corpus_patterns-0.1.2.tar.gz (18.0 kB view details)

Uploaded Source

Built Distribution

If you're not sure about the file name format, learn more about wheel file names.

corpus_patterns-0.1.2-py3-none-any.whl (23.6 kB view details)

Uploaded Python 3

File details

Details for the file corpus_patterns-0.1.2.tar.gz.

File metadata

  • Download URL: corpus_patterns-0.1.2.tar.gz
  • Upload date:
  • Size: 18.0 kB
  • Tags: Source
  • Uploaded using Trusted Publishing? No
  • Uploaded via: poetry/1.7.1 CPython/3.10.6 Darwin/23.2.0

File hashes

Hashes for corpus_patterns-0.1.2.tar.gz
Algorithm Hash digest
SHA256 dacdacaee24d28d3fd674839d3385aaf0cc562de0f1f358664ceb68cd28c5143
MD5 13b5aecea1cef7e0f1f3062a03a702bc
BLAKE2b-256 47dbfe0d9a2c8a2acf831713e9482e355a88f5521dffd93a2369c0381a1ec50e

See more details on using hashes here.

File details

Details for the file corpus_patterns-0.1.2-py3-none-any.whl.

File metadata

  • Download URL: corpus_patterns-0.1.2-py3-none-any.whl
  • Upload date:
  • Size: 23.6 kB
  • Tags: Python 3
  • Uploaded using Trusted Publishing? No
  • Uploaded via: poetry/1.7.1 CPython/3.10.6 Darwin/23.2.0

File hashes

Hashes for corpus_patterns-0.1.2-py3-none-any.whl
Algorithm Hash digest
SHA256 5544f0bebc839864c400edcc8d4a90524aaeec6de4816a5e57606bee6bf1336d
MD5 175deb278634feff02832f2a2e931f5b
BLAKE2b-256 5c70f24432a621cc1cddda79b8b1b95d411953dd49adbf339607890437e5af11

See more details on using hashes here.

Supported by

AWS Cloud computing and Security Sponsor Datadog Monitoring Depot Continuous Integration Fastly CDN Google Download Analytics Pingdom Monitoring Sentry Error logging StatusPage Status page