Building blocks for spacy Matcher patterns
Project description
corpus-patterns
A preparatory utils library.
Create a custom tokenizer
from corpus_patterns import set_tokenizer
nlp = spacy.blank("en")
nlp.tokenizer = set_tokenizer(nlp)
The tokenizer:
- Removes dashes from infixes
- Adds prefix/suffix rules for parenthesis/brackets
- Adds special exceptions to treat dotted text as a single token
Add .jsonl files to directory
Each file will contain lines of spacy matcher patterns.
from corpus_patterns import create_rules
from pathlib import Path
create_rules(folder=Path("location-here")) # check directory
Search database for text fragments
Assuming DATA_PATH is declared in the .env:
from corpus_patterns import get_segments, load_from_query
load_from_query('<fts-5-query>', limit=5) # returns first 5 results
If ASSETS_DIR contains q.txt files:
get_segments(path=Path("location-here")) # returns iterator of string matches based on queries found in the location's q.txt files```
## Custom loader for main database queries (for prodigy)
See purpose in [prodigy docs](https://prodi.gy/docs/api-loaders):
```py
from corpus_patterns import fts
fts('"police power"', limit=10) # note the FTS search expression
Utils
annotate_fragments()
- given an nlp object and some*.txt
files, create a single annotation*.jsonl
fileextract_lines_from_txt_files()
- accepts an iterator of*.txt
files and yields each line (after sorting the same and ensuring uniqueness of content).split_data()
- given a list of text strings, split the same into two groups and return a dictionary containing these groups based on the ratio provided (defaults to 0.80)
Project details
Release history Release notifications | RSS feed
Download files
Download the file for your platform. If you're not sure which to choose, learn more about installing packages.
Source Distribution
corpus_patterns-0.0.7.tar.gz
(17.3 kB
view hashes)
Built Distribution
Close
Hashes for corpus_patterns-0.0.7-py3-none-any.whl
Algorithm | Hash digest | |
---|---|---|
SHA256 | ffa6b2987494d6e3c2c06001e38675e80ecf6c3a4f72bf1e20ad6f862e55da09 |
|
MD5 | 77fa2af2792f29164136b6b29ed04cf9 |
|
BLAKE2b-256 | 2453da135c4c2a14b28c7a1c7b711d26cd93ac2034c319cea27e180185e1f565 |