Skip to main content

Simple, extendable nlp engine that can extract data based on provided conditions.

Project description

prosecco

GitHub pypi GitHub commits since tagged version GitHub last commit

Description

NLP engine with text extraction capabilities that can be easily extended to desired needs.

Can be used to build chat bots, question answer machines (see example/qa.py), text converters.

Can extract words or even whole sentences in ordered manner.
Provides position of found text.
Has build in Condition class that can mark data using regex or string comparasion.
Can quick and easy replace or extend each part of it.
See example/custom_condition_class.py to build your own conditions simply by adding 3 properties and overloading __contains__ method.

Install

pip install prosecco

Usage

Basic

example/basic.py

from prosecco import Prosecco, Condition, EnglishWordNormalizer

# Read wikipedia https://en.wikipedia.org/wiki/Superhero
with open("superhero.txt") as f:
    text = f.read()

# 1. Create conditions with hero names
conditions = [
    Condition(lemma_type="hero|dc", compare=["batman", "superman", "wonder woman"], lower=True),
    Condition(lemma_type="hero|marvel", normalizer=EnglishWordNormalizer(),
              compare=["spiderman", "iron man", "black panther"], lower=True)
]
# 2. Create prosecco
p = Prosecco(conditions=conditions)
# 3. Let's drink and print output
p.drink(text, progress=True)
lemmas  = set(p.get_lemmas(type="hero"))
print(" ".join(map(str, lemmas)))

Output

Batman[hero|dc][start:1090] Wonder Woman[hero|dc][start:2101] Captain Marvel[hero|marvel][start:3703] Superman[hero|dc][start:2071] Spider-Man[hero|marvel][start:2081] Black Panther[hero|marvel][start:17691]

Advanced

example/advanced.py

from prosecco import *

text = """Chrząszcz brzmi w trzcinie w Szczebrzeszynie.
Ząb zupa zębowa, dąb zupa dębowa.
Gdzie Rzym, gdzie Krym. W Pacanowie kozy kują.
Tak, jeśli mam szczęśliwy być, to w Gdańsku muszę żyć! 
"""

# 1. Create condition with city names
cities = ["szczebrzeszyn", "pacanow", "gdansk", "rzym", "krym"]
animals = ["koz", "chrzaszcz"]
# 2. Normalizer to remove polish specific charset
n = CharsetNormalizer(Charset.PL_EN)
# 3. Stemmer to remove suffix
s = SuffixStemmer(language="pl")
# 4. Conditions for city and animal
city_condition = Condition(lemma_type="city", compare=cities, normalizer=n, stemmer=s, lower=True)
animal_condition = Condition(lemma_type="animal", compare=animals, normalizer=n, stemmer=s, lower=True)
conditions = [city_condition, animal_condition]
# 5. Create tokenizer for polish charset
tokenizer = LanguageTokenizer(Charset.PL)
# 6. Get list of tokens
tokens = tokenizer.tokenize(text)
# 7. Create visitor with conditions provided in step 1
visitor = Visitor(conditions=conditions)
# 8. Parse tokens based on visitor conditions
lexer = Lexer(tokens=tokens, visitor=visitor)
# 9. Get list of lemmas
lemmas = lexer.lex()
# 10. filter found cities and print output
found = filter(lambda l: l.type == "city", lemmas)
print(" ".join(map(str, found)))
# 11. filter found anumals and print output
found = filter(lambda l: l.type == "animal", lemmas)
print(" ".join(map(str, found)))

Output

Szczebrzeszynie[city][start:29] Rzym[city][start:86] Krym[city][start:98] Pacanowie[city][start:106] 
Gdańsku[city][start:163]
Chrząszcz[animal][start:0] kozy[animal][start:116]

Project details


Download files

Download the file for your platform. If you're not sure which to choose, learn more about installing packages.

Source Distribution

prosecco-0.0.5.tar.gz (4.3 kB view hashes)

Uploaded Source

Built Distribution

prosecco-0.0.5-py3-none-any.whl (5.5 kB view hashes)

Uploaded Python 3

Supported by

AWS AWS Cloud computing and Security Sponsor Datadog Datadog Monitoring Fastly Fastly CDN Google Google Download Analytics Microsoft Microsoft PSF Sponsor Pingdom Pingdom Monitoring Sentry Sentry Error logging StatusPage StatusPage Status page