Scikit-Learn like Named Entity Recognition modules
Project description
sequence-learn
Sklearn-like API for Sequence Learning tasks like Named Entity Recognition.
sequence-learn
takes as input embedded token lists, which you can produce using e.g. Spacy or NLTK for tokenization and Sklearn or Hugging Face for the embedding procedure. The labels are on token-level, i.e., for each token, you must provide some information in a simple list.
How to install
You can set up this library via either running pip install sequencelearn
, or via cloning this repository and running pip install -r requirements.txt
in your repository.
This works great together with the embedders library, which converts your documents into embeddings within only a few lines of code.
Caution: We currently have this tested for Python 3 up to Python 3.9. If your installation runs into issues, please contact us.
Example
from embedders.extraction.count_based import CharacterTokenEmbedder
from sequencelearn.point_tagger import TreeTagger
corpus = [
"I went to Cologne in 2009",
"My favorite number is 41",
]
labels = [
["OUTSIDE", "OUTSIDE", "OUTSIDE", "CITY", "OUTSIDE", "YEAR"],
["OUTSIDE", "OUTSIDE", "OUTSIDE", "OUTSIDE", "DIGIT"]
]
embedder = CharacterTokenEmbedder("en_core_web_sm")
embeddings = embedder.encode(corpus) # contains a list of ragged shape [num_texts, num_tokens (text-specific), embedding_dimension]
tagger = TreeTagger()
tagger.fit(embeddings, labels)
sentence = "My birthyear is 1998"
print(tagger.predict([sentence]))
How to contribute
Currently, the best way to contribute is via adding issues for the kind of transformations you like and starring this repository :-)
Project details
Release history Release notifications | RSS feed
Download files
Download the file for your platform. If you're not sure which to choose, learn more about installing packages.
Source Distributions
Built Distribution
Hashes for sequencelearn-0.0.3-py2.py3-none-any.whl
Algorithm | Hash digest | |
---|---|---|
SHA256 | c4813e6b826eb0431d629e92b4897748025547a1f354d1024294fd857fa53a9d |
|
MD5 | aec08c63fa70cb0067193c75fd570b43 |
|
BLAKE2b-256 | 6bd79cf49cb489bfdec8dcd8a2f16e4a645fe49d10b5a83cc86bfce88b12fdfd |