An integration package connecting Jina Late Chunking and LangChain
Project description
langchain-jina
This package contains the LangChain integration with Late Chunking
Installation
pip install -U langchain-jina
Environment Variable
Export your logins:
export JINA_API_KEY="jina_*
Usage
1. Get Embedings
Here is an example usage of these classes:
from langchain_jina import LateChunkEmbeddings
text_embeddings = LateChunkEmbeddings(
jina_api_key=os.environ.get("JINA_API_KEY"),
model_name="jina-embeddings-v3"
)
text = [
"Berlin is the capital and largest city of Germany, by both area and population.",
"With 3.66 million inhabitants, it has the highest population within its city limits of any city in the European Union.",
"The city is also one of the states of Germany, being the third smallest state in the country by area.",
]
# with late chunking
doc_result = text_embeddings.embed_documents(text, late_chunking=True)
print("With late_chunking")
for doc in doc_result:
print(doc)
2. Build with Vectorstore
First of all, we need the context length entire input text limit with the model context length. So, we using tokenizer from transformers to check it.
from transformers import AutoTokenizer
tokenizer = AutoTokenizer.from_pretrained("jinaai/jina-embeddings-v3")
Next, when the tokenizer is loaded, we can combine it with any text_splitter LangChain. The example below giving the instruction of handle the same method of authors.
from langchain_text_splitters import RecursiveCharacterTextSplitter
text_splitter = RecursiveCharacterTextSplitter(
chunk_size=200,
chunk_overlap=0,
length_function=len,
is_separator_regex=False,
)
text_splitter.tokenizer = tokenizer
We create vectorstore embeding, here we use LateChunkQdrant
from qdrant_client import QdrantClient
from langchain_community.docstore.document import Document
from langchain_jina import LateChunkQdrant
client = QdrantClient()
vectorstore = LateChunkQdrant(
client,
collection_name="demo",
embeddings=text_embeddings,
text_splitter=text_splitter
)
# load documents
with open("./state_of_the_union.txt") as f:
state_of_the_union = f.read()
documents = [
Document(
page_content=state_of_the_union,
metadata={"source": "state_of_the_union.txt"}
),
]
vectorstore = vectorstore.from_documents(
documents=documents,
embedding=text_embeddings,
text_splitter=text_splitter,
path="test_db",
collection_name="demo"
)
Finally, we can combine with any purpose
query = "What did the president say about ketanji brown jackson?"
results = vectorstore.similarity_search(query, k=3)
for res in results:
print(f"* {res.page_content} [{res.metadata}]")
License
This project is licensed under the MIT License
Project details
Release history Release notifications | RSS feed
Download files
Download the file for your platform. If you're not sure which to choose, learn more about installing packages.
Source Distribution
Built Distribution
Filter files by name, interpreter, ABI, and platform.
If you're not sure about the file name format, learn more about wheel file names.
Copy a direct link to the current filters
File details
Details for the file langchain_jina-0.0.1.dev0.tar.gz.
File metadata
- Download URL: langchain_jina-0.0.1.dev0.tar.gz
- Upload date:
- Size: 12.0 kB
- Tags: Source
- Uploaded using Trusted Publishing? No
- Uploaded via: poetry/2.1.2 CPython/3.11.0 Linux/6.5.0-27-generic
File hashes
| Algorithm | Hash digest | |
|---|---|---|
| SHA256 |
8570a89e33705bf746e504ccfbd218e9e8e083b06c35db10295bc8176ef28841
|
|
| MD5 |
a12425515ea5f5e518977571f423c616
|
|
| BLAKE2b-256 |
43ed8a406f3818499c130e2a57346df6607008876144e5bde7fbb5fcebb347b6
|
File details
Details for the file langchain_jina-0.0.1.dev0-py3-none-any.whl.
File metadata
- Download URL: langchain_jina-0.0.1.dev0-py3-none-any.whl
- Upload date:
- Size: 12.2 kB
- Tags: Python 3
- Uploaded using Trusted Publishing? No
- Uploaded via: poetry/2.1.2 CPython/3.11.0 Linux/6.5.0-27-generic
File hashes
| Algorithm | Hash digest | |
|---|---|---|
| SHA256 |
b7dbe6992365c99674497f11869d2e128d932617f51a949e9602b440997de41a
|
|
| MD5 |
ed62e34a739a8477d860fbe54b88071f
|
|
| BLAKE2b-256 |
8fed91871e1dd3f91b322d6024524eab51fb2ecd0373f6fe0fced7b49a7c97e1
|