Skip to main content

Simple, extendable nlp engine that can extract data based on provided conditions.

Project description

prosecco

GitHub pypi GitHub commits since tagged version GitHub last commit

Description

NLP engine with text extraction capabilities that can be easily extended to desired needs.

Can be used to build chat bots, question answer machines (see example/qa.py), text converters.

Extract words or even whole sentences in ordered manner.
Get position of found text.
Use Condition class and mark data using regex or string comparasion.
Extend each part of it in easy manner. ( see example/custom_condition_class.py).

Install

pip install prosecco

Usage

Basic

example/basic.py

from prosecco import Prosecco, Condition, EnglishWordNormalizer

# Read wikipedia https://en.wikipedia.org/wiki/Superhero
with open("superhero.txt") as f:
    text = f.read()

# 1. Create conditions with hero names
conditions = [
    Condition(lemma_type="hero|dc", compare=["batman", "superman", "wonder woman"], lower=True),
    Condition(lemma_type="hero|marvel", normalizer=EnglishWordNormalizer(),
              compare=["spiderman", "iron man", "black panther"], lower=True)
]
# 2. Create prosecco
p = Prosecco(conditions=conditions)
# 3. Let's drink and print output
p.drink(text, progress=True)
lemmas  = set(p.get_lemmas(type="hero"))
print(" ".join(map(str, lemmas)))

Output

Batman[hero|dc][start:1089] Wonder Woman[hero|dc][start:2100] Iron Man[hero|marvel][start:2184] Superman[hero|dc][start:2070] Spider-Man[hero|marvel][start:2080] Black Panther[hero|marvel][start:17690]

Advanced

example/advanced.py

from prosecco import *

text = """Chrząszcz brzmi w trzcinie w Szczebrzeszynie.
Ząb zupa zębowa, dąb zupa dębowa.
Gdzie Rzym, gdzie Krym. W Pacanowie kozy kują.
Tak, jeśli mam szczęśliwy być, to w Gdańsku muszę żyć! 
"""

# 1. Create condition with city names
cities = ["szczebrzeszyn", "pacanow", "gdansk", "rzym", "krym"]
animals = ["koz", "chrzaszcz"]
# 2. Normalizer to remove polish specific charset
n = CharsetNormalizer(Charset.PL_EN)
# 3. Stemmer to remove suffix
s = SuffixStemmer(language="pl")
# 4. Conditions for city and animal
city_condition = Condition(lemma_type="city", compare=cities, normalizer=n, stemmer=s, lower=True)
animal_condition = Condition(lemma_type="animal", compare=animals, normalizer=n, stemmer=s, lower=True)
conditions = [city_condition, animal_condition]
# 5. Create tokenizer for polish charset
tokenizer = LanguageTokenizer(Charset.PL)
# 6. Get list of tokens
tokens = tokenizer.tokenize(text)
# 7. Create visitor with conditions provided in step 1
visitor = Visitor(conditions=conditions)
# 8. Parse tokens based on visitor conditions
lexer = Lexer(tokens=tokens, visitor=visitor)
# 9. Get list of lemmas
lemmas = lexer.lex()
# 10. filter found cities and print output
found = filter(lambda l: l.type == "city", lemmas)
print(" ".join(map(str, found)))
# 11. filter found anumals and print output
found = list(filter(lambda l: l.type == "animal", lemmas))
print(" ".join(map(str, found)))
# 12. print exact words from text
for l in list(found):
    print(text[l.start:l.start+len(l.sentence)])

Output

Szczebrzeszynie[city][start:29] Rzym[city][start:86] Krym[city][start:98] Pacanowie[city][start:106] Gdańsku[city][start:163]
Chrząszcz[animal][start:0] kozy[animal][start:116]
Chrząszcz
kozy

QA ( question answer machine )

example/qa.py

from datetime import datetime
from prosecco import Prosecco, Condition, EnglishWordNormalizer, SuffixStemmer


messages = """Whats the time ?
How long boil egg?
100 miles to km
30,3 celsius to farenheit"""

# create conditions
question = ("what", "whats", "how")
measure = ("celsius", "farenheit", "mile", "km", "kilometer", "time", "long")
cooking = ("boil","cook", "fry")
food = ("egg",)
conditions = [
    Condition(lemma_type="question", compare=question, normalizer=EnglishWordNormalizer(), lower=True),
    Condition(lemma_type="measure", compare=measure,
              normalizer=EnglishWordNormalizer(),
              stemmer=SuffixStemmer(language="en"),
              lower=True),
    Condition(lemma_type="cooking", compare=cooking, normalizer=EnglishWordNormalizer(), lower=True),
    Condition(lemma_type="food", compare=food,
              normalizer=EnglishWordNormalizer(),
              stemmer=SuffixStemmer(language="en"),
              lower=True),
    Condition(lemma_type="number", compare=r"\d+([\.\,]\d+)?", regex=True, until_character=" "),
]

def printer(data):
    print("Robot : ", data)

# time condition
def resolve_time(p):
    printer(datetime.now())

# cooking condition
def resolve_cooking(p):
    if check_condition(p.get_lemmas("cooking|food"), ["boil", "egg"]):
        printer("""
Hard for 9-15 minutes.
Soft for 6-8 minutes.""")
        return True

def resolve_measure(p):
    measures = p.get_lemmas("measure")
    fr = measures[0]
    to = measures[1]
    numbers = p.get_lemmas("number")
    if len(numbers) == 0:
        printer("No number for conversion provided")
        return True
    value = float(numbers[0].sentence.replace(",", "."))
    if fr.condition == "mile" and to.condition == "km":
        printer(value / 0.62137119)
        return True
    elif fr.condition == "km" and to.condition == "mile":
        printer(value * 0.62137119)
        return True
    elif fr.condition == "celsius" and to.condition == "farenheit":
        print(9/5 * value + 32)
        return True
    elif fr.condition == "farenheit" and to.condition == "celsius":
        print((value - 32) * 5/9)
        return True
    return False

def check_condition(lemmas, conditions):
    for l in lemmas:
        for c in conditions:
            if l.condition == c:
                conditions.remove(c)
    return len(conditions) == 0

def resolve(p, m):
    if len(p.get_lemmas("question")) > 0:
        if check_condition(p.get_lemmas("measure"), ["time"]):
            resolve_time(p)
            return True
        elif len(p.get_lemmas("cooking")) > 0:
            return resolve_cooking(p)
    elif len(p.get_lemmas("measure")) > 0:
        return resolve_measure(p)
    return False

for m in messages.split('\n'):
    print("Question : ", m)
    p = Prosecco(conditions=conditions)
    p.drink(m)
    if not resolve(p, m):
        print("Unsupported resolver : ", p.lemmas)

Output

Question :  Whats the time ?
Robot :  2019-08-13 20:38:06.948720
Question :  How long boil egg?
Robot :  
Hard for 9-15 minutes.
Soft for 6-8 minutes.
Question :  100 miles to km
Robot :  160.93440057946685
Question :  30,3 celsius to farenheit
86.53999999999999

Project details


Download files

Download the file for your platform. If you're not sure which to choose, learn more about installing packages.

Source Distribution

prosecco-0.0.7.tar.gz (31.2 kB view details)

Uploaded Source

Built Distribution

If you're not sure about the file name format, learn more about wheel file names.

prosecco-0.0.7-py2.py3-none-any.whl (32.6 kB view details)

Uploaded Python 2Python 3

File details

Details for the file prosecco-0.0.7.tar.gz.

File metadata

  • Download URL: prosecco-0.0.7.tar.gz
  • Upload date:
  • Size: 31.2 kB
  • Tags: Source
  • Uploaded using Trusted Publishing? No
  • Uploaded via: twine/1.13.0 pkginfo/1.5.0.1 requests/2.22.0 setuptools/40.8.0 requests-toolbelt/0.9.1 tqdm/4.33.0 CPython/3.7.3

File hashes

Hashes for prosecco-0.0.7.tar.gz
Algorithm Hash digest
SHA256 38ea9c640dbe8123b6fbe1e67b1c98a7bc5296d37d7de83de202ff9c6b4ca7d0
MD5 b3cbf57ab0e7b98ff1247cf3e00ce739
BLAKE2b-256 be4f88592b3fc7703277e5e8e37c6918006d4c11c6c40253c643952274db0e7a

See more details on using hashes here.

File details

Details for the file prosecco-0.0.7-py2.py3-none-any.whl.

File metadata

  • Download URL: prosecco-0.0.7-py2.py3-none-any.whl
  • Upload date:
  • Size: 32.6 kB
  • Tags: Python 2, Python 3
  • Uploaded using Trusted Publishing? No
  • Uploaded via: twine/1.13.0 pkginfo/1.5.0.1 requests/2.22.0 setuptools/40.8.0 requests-toolbelt/0.9.1 tqdm/4.33.0 CPython/3.7.3

File hashes

Hashes for prosecco-0.0.7-py2.py3-none-any.whl
Algorithm Hash digest
SHA256 6c387a7f2128edb77817e49de45463b7558e5ca2a34864b75f3ce2e80a91a4f2
MD5 d24ddab1704ea455328f4486d2cd09a6
BLAKE2b-256 8c15088adaf5316639c3f74325dcbc6ef9371efd07d5b31a79588f7a67351595

See more details on using hashes here.

Supported by

AWS Cloud computing and Security Sponsor Datadog Monitoring Depot Continuous Integration Fastly CDN Google Download Analytics Pingdom Monitoring Sentry Error logging StatusPage Status page