Slim, flexible and extendable NLP engine that can produce list of features from text based on provided condtions.

These details have not been verified by PyPI

Project links

Homepage

GitHub Statistics

View statistics for this project via Libraries.io, or by using our public dataset on Google BigQuery

Project description

prosecco

Description

Slim, flexible and extendable NLP engine that can produce list of features from text based on provided condtions.

Features

word categorisation
feature extraction

Install

pip install prosecco

Usage

python example.py
python example_basic.py

Examples

Basic

from prosecco import Prosecco, Condition

# Read wikipedia https://en.wikipedia.org/wiki/Superhero
with open('sample/superhero.txt') as f:
    text = f.read()

# 1. Create conditions based on super hero names
superheroes = ["batman", "spiderman", "superman", "captain marvel", "black panther"]
conditions = [Condition(lemma_type="hero", compare=hero, lower=True) for hero in superheroes]
# 2. Create prosecco
p = Prosecco(conditions=conditions)
# 3. Let's drink and print output
p.drink(text, progress=True)
lemmas = set(p.get_lemmas(type='hero'))
print(" ".join(map(str, lemmas)))

Output

Batman[hero] Black Panther[hero] Superman[hero] Captain Marvel[hero]

Advanced

from prosecco import *

text = """Chrząszcz brzmi w trzcinie w Szczebrzeszynie.
Ząb zupa zębowa, dąb zupa dębowa.
Gdzie Rzym, gdzie Krym. W Pacanowie kozy kują.
Tak, jeśli mam szczęśliwy być, to w Gdańsku muszę żyć! 
"""

# 1. Create conditions based on city names
cities = ["szczebrzeszyn", "pacanow", "gdansk", "rzym", "krym"]
conditions = []
for city in cities:
    conditions.append(Condition(lemma_type="city",
                                compare=city,
                                normalizer=CharsetNormalizer(Charset.PL_EN),
                                stemmer=WordStemmer(language="pl"),
                                lower=True))
# 2. Create tokenizer for polish charset
tokenizer = LanguageTokenizer(Charset.PL)
# 3. Get list of tokens
tokens = tokenizer.tokenize(text)
# 4. Create visitor with conditions provided in step 1
visitor = Visitor(conditions=conditions)
# 5. Parse tokens based on visitor conditions
lexer = Lexer(tokens=tokens, visitor=visitor)
# 6. Get list of lemmas
lemmas = lexer.lex()
# 7. filter found cities
found_cities = filter(lambda l: l.type == "city", lemmas)
# 8. print output
print(" ".join(map(str, found_cities)))

Output

Szczebrzeszynie[city] Rzym[city] Krym[city] Pacanowie[city] Gdańsku[city]

Project details

These details have not been verified by PyPI

Project links

Homepage

GitHub Statistics

View statistics for this project via Libraries.io, or by using our public dataset on Google BigQuery

Release history Release notifications | RSS feed

0.0.7

Aug 14, 2019

0.0.6

Aug 14, 2019

0.0.5

Aug 12, 2019

0.0.4

Aug 11, 2019

This version

0.0.3

Aug 11, 2019

0.0.2

Aug 10, 2019

Download files

Download the file for your platform. If you're not sure which to choose, learn more about installing packages.

Source Distribution

prosecco-0.0.3.tar.gz (2.3 kB view hashes)

Uploaded Aug 11, 2019 Source

Built Distribution

prosecco-0.0.3-py3-none-any.whl (3.0 kB view hashes)

Uploaded Aug 11, 2019 Python 3

Hashes for prosecco-0.0.3.tar.gz

Hashes for prosecco-0.0.3.tar.gz
Algorithm	Hash digest
SHA256	`751e639d6a195f8fb8f4bb3ad7545e16ff6d93ba33c72b5f7da9001de3ccd280`
MD5	`e5e191f86a42716d85b35e2bc1790728`
BLAKE2b-256	`42a59e56ac5aea984904f7aad98f31e30320456899f1fb5c16e466094ba846bb`

Hashes for prosecco-0.0.3-py3-none-any.whl

Hashes for prosecco-0.0.3-py3-none-any.whl
Algorithm	Hash digest
SHA256	`2704d8fd0615df17bbe942d163a205dabc57a12df4c61e2e973c0daf748a8919`
MD5	`23b205b67babd79ceb3c89fcba3f974c`
BLAKE2b-256	`f9fa24b3280a1fa752513e8e529ab2a9c3eb6f4fb7f556973fef7852ef7117f2`