An Lemmatizing node for Deepset Haystack
Project description
Lemmatization
Lemmatization is a text pre-processing technique used in natural language processing (NLP) models to break a word down to its root meaning to identify similarities. For example, a lemmatization algorithm would reduce the word better to its root word, or lemme, good.
This node can be placed within a pipeline to lemmatize documents returned by a Retriever, prior to adding them as context to a prompt (for a PromptNode or similar). The process of lemmatizing the document content can potentially reduce the amount of tokens used by up to 30%, without drastically affecting the meaning of the document.
Before Lemmatization:
After Lemmatization:
Installation
Clone the repo to a directory, change to that directory, then perform a pip install '.'
. This will install the package to your Python libraries.
Usage
Include it in your pipeline - example as follows:
import logging
import re
from datasets import load_dataset
from haystack.document_stores import InMemoryDocumentStore
from haystack.nodes import PromptNode, PromptTemplate, AnswerParser, BM25Retriever
from haystack.pipelines import Pipeline
from haystack_lemmatize_node import LemmatizeDocuments
logging.basicConfig(format="%(levelname)s - %(name)s - %(message)s", level=logging.WARNING)
logging.getLogger("haystack").setLevel(logging.INFO)
document_store = InMemoryDocumentStore(use_bm25=True)
dataset = load_dataset("bilgeyucel/seven-wonders", split="train")
document_store.write_documents(dataset)
retriever = BM25Retriever(document_store=document_store, top_k=2)
lfqa_prompt = PromptTemplate(
name="lfqa",
prompt_text="Given the context please answer the question using your own words. Generate a comprehensive, summarized answer. If the information is not included in the provided context, reply with 'Provided documents didn't contain the necessary information to provide the answer'\n\nContext: {documents}\n\nQuestion: {query} \n\nAnswer:",
output_parser=AnswerParser(),
)
prompt_node = PromptNode(
model_name_or_path="text-davinci-003",
default_prompt_template=lfqa_prompt,
max_length=500,
api_key="sk-OPENAIKEY",
)
lemmatize = LemmatizeDocuments() # you can pass the `base_lang=XX` argument here too, where XX is a language as listed here: https://pypi.org/project/simplemma/
pipe = Pipeline()
pipe.add_node(component=retriever, name="Retriever", inputs=["Query"])
pipe.add_node(component=lemmatize, name="Lemmatize", inputs=["Retriever"])
pipe.add_node(component=prompt_node, name="prompt_node", inputs=["Lemmatize"])
query = "What does the Rhodes Statue look like?"
output = pipe.run(query)
print(output['answers'][0].answer)
Caveats
Sometimes lemmatization can be slow for large document content, but in the world of AI where we can potentially wait 30+ seconds for an LLM to respond (hello GPT-4), what's a couple more seconds?
Project details
Release history Release notifications | RSS feed
Download files
Download the file for your platform. If you're not sure which to choose, learn more about installing packages.
Source Distribution
Built Distribution
Hashes for haystack_lemmatize_node-0.0.2.tar.gz
Algorithm | Hash digest | |
---|---|---|
SHA256 | 6c043f49df2e605f0892ed9461c6b437ed1a323a8edd96f58ba86177486dca57 |
|
MD5 | a23ff7997c18b3ba850e3405490ddfd4 |
|
BLAKE2b-256 | bb804341bdf5e95ce4977df600cca1f0f9597dc0dd75935ec47fc8a60b633aed |
Hashes for haystack_lemmatize_node-0.0.2-py3-none-any.whl
Algorithm | Hash digest | |
---|---|---|
SHA256 | efe1784454e80cb64ff9c8cad28a8085133c0acf33b9ebe9477ed706b64c6b0c |
|
MD5 | f9ae142f17d465744493212d04fade0a |
|
BLAKE2b-256 | c5e29325adb62e500f3b4de1198bdab25e602fb54650cc692cd729fcd778fbcd |