Simple, extendable nlp engine that can extract data based on provided conditions.
Project description
prosecco
Description
NLP engine with text extraction capabilities that can be easily extended to desired needs.
Can be used to build chat bots, question answer machines (see example/qa.py), text converters.
Can extract words or even whole sentences in ordered manner.
Provides position of found text.
Has build in Condition
class that can mark data using regex or string comparasion.
Can quick and easy replace or extend each part of it.
See example/custom_condition_class.py to build your own conditions simply by adding 3 properties and overloading __contains__
method.
Install
pip install prosecco
Usage
Basic
from prosecco import Prosecco, Condition, EnglishWordNormalizer
# Read wikipedia https://en.wikipedia.org/wiki/Superhero
with open("superhero.txt") as f:
text = f.read()
# 1. Create conditions with hero names
conditions = [
Condition(lemma_type="hero|dc", compare=["batman", "superman", "wonder woman"], lower=True),
Condition(lemma_type="hero|marvel", normalizer=EnglishWordNormalizer(),
compare=["spiderman", "iron man", "black panther"], lower=True)
]
# 2. Create prosecco
p = Prosecco(conditions=conditions)
# 3. Let's drink and print output
p.drink(text, progress=True)
lemmas = set(p.get_lemmas(type="hero"))
print(" ".join(map(str, lemmas)))
Output
Batman[hero|dc][start:1090] Wonder Woman[hero|dc][start:2101] Captain Marvel[hero|marvel][start:3703] Superman[hero|dc][start:2071] Spider-Man[hero|marvel][start:2081] Black Panther[hero|marvel][start:17691]
Advanced
from prosecco import *
text = """Chrząszcz brzmi w trzcinie w Szczebrzeszynie.
Ząb zupa zębowa, dąb zupa dębowa.
Gdzie Rzym, gdzie Krym. W Pacanowie kozy kują.
Tak, jeśli mam szczęśliwy być, to w Gdańsku muszę żyć!
"""
# 1. Create condition with city names
cities = ["szczebrzeszyn", "pacanow", "gdansk", "rzym", "krym"]
animals = ["koz", "chrzaszcz"]
# 2. Normalizer to remove polish specific charset
n = CharsetNormalizer(Charset.PL_EN)
# 3. Stemmer to remove suffix
s = SuffixStemmer(language="pl")
# 4. Conditions for city and animal
city_condition = Condition(lemma_type="city", compare=cities, normalizer=n, stemmer=s, lower=True)
animal_condition = Condition(lemma_type="animal", compare=animals, normalizer=n, stemmer=s, lower=True)
conditions = [city_condition, animal_condition]
# 5. Create tokenizer for polish charset
tokenizer = LanguageTokenizer(Charset.PL)
# 6. Get list of tokens
tokens = tokenizer.tokenize(text)
# 7. Create visitor with conditions provided in step 1
visitor = Visitor(conditions=conditions)
# 8. Parse tokens based on visitor conditions
lexer = Lexer(tokens=tokens, visitor=visitor)
# 9. Get list of lemmas
lemmas = lexer.lex()
# 10. filter found cities and print output
found = filter(lambda l: l.type == "city", lemmas)
print(" ".join(map(str, found)))
# 11. filter found anumals and print output
found = filter(lambda l: l.type == "animal", lemmas)
print(" ".join(map(str, found)))
Output
Szczebrzeszynie[city][start:29] Rzym[city][start:86] Krym[city][start:98] Pacanowie[city][start:106]
Gdańsku[city][start:163]
Chrząszcz[animal][start:0] kozy[animal][start:116]
Project details
Download files
Download the file for your platform. If you're not sure which to choose, learn more about installing packages.