No project description provided
Project description
Semantic Search Without LLMs
This package provides an AI-powered document semantic search system that doesn't rely on large language models (LLMs). It allows you to process documents, web content, and scanned images, and then perform efficient semantic searches using cosine similarity.
Installation
Install the package using pip:
pip install semantic-search-no-llms
System Architecture
Here's a high-level overview of the system architecture:
graph TD
A[User] -->|Upload Document/Scan Image/Enter URL| B[Streamlit UI]
B -->|Process Document| C[DataLoader]
B -->|Crawl Web| H[WebCrawler]
C -->|Chunk Text| D[Text Splitter]
H -->|Extract Content| D
D -->|Generate Embeddings| E[HuggingFace Embeddings]
E -->|Store Vectors| F[FAISS Vector Store]
A -->|Enter Query| B
B -->|Retrieve Similar Chunks| F
F -->|Calculate Cosine Similarity| G[Semantic Search]
G -->|Return Results| B
B -->|Display Results| A
Usage
Document Processing
Load a Document
To load a document (PDF, DOCX, TXT, XLS, XLSX):
from semantic_search_no_llms.data_loader import DataLoader
# Load a document
data_loader = DataLoader("path/to/document.pdf")
documents = data_loader.load_document()
Chunk the Document
To chunk the document into smaller pieces:
# Chunk the document into smaller pieces
chunks = data_loader.chunk_document(documents, chunk_size=1024, chunk_overlap=80)
Process the Chunks
This step creates embeddings for the document chunks and builds a FAISS index for efficient similarity search:
from semantic_search_no_llms.utils import process_chunks
process_chunks(chunks)
Web Crawling
Crawl a Website
To crawl a website and fetch its content:
from semantic_search_no_llms.web_crawler import WebCrawler
# Crawl a website
crawler = WebCrawler("https://www.example.com")
content = crawler.fetch_content()
Process the Crawled Content
To process the content fetched from the website:
from langchain.schema import Document as LangChainDocument
document = LangChainDocument(page_content=content, metadata={"source": "https://www.example.com"})
chunks = data_loader.chunk_document([document], chunk_size=1024, chunk_overlap=80)
process_chunks(chunks)
Semantic Search
Perform a Semantic Search
To perform a semantic search using the FAISS index and retrieve the top matching document chunks:
from semantic_search_no_llms.utils import cosine_similarity, load_embeddings
import numpy as np
# Load the embeddings
embeddings = load_embeddings()
# Define a query
query = "What is the main topic of the document?"
query_embedding = embeddings.embed_query(query)
# Reconstruct the document embeddings
from semantic_search_no_llms.utils import FAISS
vectorstore = FAISS.from_documents(documents=chunks, embedding=embeddings)
document_embeddings = vectorstore.index.reconstruct_n(0, vectorstore.index.ntotal)
# Calculate similarities and get the top results
similarities = [cosine_similarity(query_embedding, doc_embedding) for doc_embedding in document_embeddings]
top_k_indices = np.argsort(similarities)[-5:][::-1]
for i, idx in enumerate(top_k_indices):
doc = vectorstore.docstore.search(vectorstore.index_to_docstore_id[idx])
print(f"Match {i+1} - Similarity: {similarities[idx]:.4f}")
print(doc.page_content)
This example shows how to load the embeddings, perform a semantic search using the FAISS index, and retrieve the top matching document chunks.
Customization
You can customize the behavior of the document processing by adjusting the chunk_size
and chunk_overlap
parameters when calling the chunk_document()
method. Larger chunk sizes provide more context, while smaller chunks can improve search precision.
Contributing
Contributions are welcome! If you encounter any issues or have suggestions for improvements, please feel free to open an issue or submit a pull request.
License
This project is licensed under the MIT License.
Project details
Release history Release notifications | RSS feed
Download files
Download the file for your platform. If you're not sure which to choose, learn more about installing packages.
Source Distribution
Built Distribution
File details
Details for the file semantic_search_nollms-0.3.tar.gz
.
File metadata
- Download URL: semantic_search_nollms-0.3.tar.gz
- Upload date:
- Size: 5.7 kB
- Tags: Source
- Uploaded using Trusted Publishing? No
- Uploaded via: twine/5.1.1 CPython/3.11.9
File hashes
Algorithm | Hash digest | |
---|---|---|
SHA256 | 29028553d7c1e1ab86d8bb1c75d93966a34d69e54b06d1d48a2bd401bdb29e6e |
|
MD5 | 3089b27aa89464d1a2aa8f47b221722c |
|
BLAKE2b-256 | 382477095efc80000eedec55a97a00caeeedb297933ed76f5ded706e543fd439 |
File details
Details for the file semantic_search_nollms-0.3-py3-none-any.whl
.
File metadata
- Download URL: semantic_search_nollms-0.3-py3-none-any.whl
- Upload date:
- Size: 7.3 kB
- Tags: Python 3
- Uploaded using Trusted Publishing? No
- Uploaded via: twine/5.1.1 CPython/3.11.9
File hashes
Algorithm | Hash digest | |
---|---|---|
SHA256 | 69ec5150d9734042e76f169bea9c7e04639d2bc0e1346bd84ac727e53a9b13f1 |
|
MD5 | 7ea9ee25fdcec7f1916211facd54b4cf |
|
BLAKE2b-256 | 2e487601d0c2b546ac0f75561fede486c9e40aaa54662672bb96e9eb47da2dec |