Skip to main content

An integration package connecting Jina Late Chunking and LangChain

Project description

langchain-jina

This package contains the LangChain integration with Late Chunking

Installation

pip install -U langchain-jina

Environment Variable

Export your logins: export JINA_API_KEY="jina_*

Usage

1. Get Embedings

Here is an example usage of these classes:

from langchain_jina import LateChunkEmbeddings

text_embeddings = LateChunkEmbeddings(
    jina_api_key=os.environ.get("JINA_API_KEY"),
    model_name="jina-embeddings-v3"
)

text = [
    "Berlin is the capital and largest city of Germany, by both area and population.",
    "With 3.66 million inhabitants, it has the highest population within its city limits of any city in the European Union.",
    "The city is also one of the states of Germany, being the third smallest state in the country by area.",
]

# with late chunking
doc_result = text_embeddings.embed_documents(text, late_chunking=True)
print("With late_chunking")
for doc in doc_result:
    print(doc)

2. Build with Vectorstore

First of all, we need the context length entire input text limit with the model context length. So, we using tokenizer from transformers to check it.

from transformers import AutoTokenizer

tokenizer = AutoTokenizer.from_pretrained("jinaai/jina-embeddings-v3")

Next, when the tokenizer is loaded, we can combine it with any text_splitter LangChain. The example below giving the instruction of handle the same method of authors.

from langchain_text_splitters import RecursiveCharacterTextSplitter

text_splitter = RecursiveCharacterTextSplitter(
    chunk_size=200,
    chunk_overlap=0,
    length_function=len,
    is_separator_regex=False,
)

text_splitter.tokenizer = tokenizer 

We create vectorstore embeding, here we use LateChunkQdrant

from qdrant_client import QdrantClient
from langchain_community.docstore.document import Document
from langchain_jina import LateChunkQdrant


client = QdrantClient()

vectorstore = LateChunkQdrant(
    client, 
    collection_name="demo",
    embeddings=text_embeddings, 
    text_splitter=text_splitter
)

# load documents
with open("./state_of_the_union.txt") as f:
        state_of_the_union = f.read()

documents  = [
    Document(
        page_content=state_of_the_union, 
        metadata={"source": "state_of_the_union.txt"}
    ),
]

vectorstore = vectorstore.from_documents(
    documents=documents, 
    embedding=text_embeddings,
    text_splitter=text_splitter,
    path="test_db", 
    collection_name="demo"
)

Finally, we can combine with any purpose

query = "What did the president say about ketanji brown jackson?" 
results = vectorstore.similarity_search(query, k=3)

for res in results:
    print(f"* {res.page_content} [{res.metadata}]")

License

This project is licensed under the MIT License

Project details


Download files

Download the file for your platform. If you're not sure which to choose, learn more about installing packages.

Source Distribution

langchain_jina-0.0.1.dev0.tar.gz (12.0 kB view details)

Uploaded Source

Built Distribution

If you're not sure about the file name format, learn more about wheel file names.

langchain_jina-0.0.1.dev0-py3-none-any.whl (12.2 kB view details)

Uploaded Python 3

File details

Details for the file langchain_jina-0.0.1.dev0.tar.gz.

File metadata

  • Download URL: langchain_jina-0.0.1.dev0.tar.gz
  • Upload date:
  • Size: 12.0 kB
  • Tags: Source
  • Uploaded using Trusted Publishing? No
  • Uploaded via: poetry/2.1.2 CPython/3.11.0 Linux/6.5.0-27-generic

File hashes

Hashes for langchain_jina-0.0.1.dev0.tar.gz
Algorithm Hash digest
SHA256 8570a89e33705bf746e504ccfbd218e9e8e083b06c35db10295bc8176ef28841
MD5 a12425515ea5f5e518977571f423c616
BLAKE2b-256 43ed8a406f3818499c130e2a57346df6607008876144e5bde7fbb5fcebb347b6

See more details on using hashes here.

File details

Details for the file langchain_jina-0.0.1.dev0-py3-none-any.whl.

File metadata

  • Download URL: langchain_jina-0.0.1.dev0-py3-none-any.whl
  • Upload date:
  • Size: 12.2 kB
  • Tags: Python 3
  • Uploaded using Trusted Publishing? No
  • Uploaded via: poetry/2.1.2 CPython/3.11.0 Linux/6.5.0-27-generic

File hashes

Hashes for langchain_jina-0.0.1.dev0-py3-none-any.whl
Algorithm Hash digest
SHA256 b7dbe6992365c99674497f11869d2e128d932617f51a949e9602b440997de41a
MD5 ed62e34a739a8477d860fbe54b88071f
BLAKE2b-256 8fed91871e1dd3f91b322d6024524eab51fb2ecd0373f6fe0fced7b49a7c97e1

See more details on using hashes here.

Supported by

AWS Cloud computing and Security Sponsor Datadog Monitoring Depot Continuous Integration Fastly CDN Google Download Analytics Pingdom Monitoring Sentry Error logging StatusPage Status page