Skip to main content

No project description provided

Project description

Semantic Search Without LLMs

This package provides an AI-powered document semantic search system that doesn't rely on large language models (LLMs). It allows you to process documents, web content, and scanned images, and then perform efficient semantic searches using cosine similarity.

Installation

Install the package using pip:

pip install semantic-search-no-llms

System Architecture

Here's a high-level overview of the system architecture:

graph TD
    A[User] -->|Upload Document/Scan Image/Enter URL| B[Streamlit UI]
    B -->|Process Document| C[DataLoader]
    B -->|Crawl Web| H[WebCrawler]
    C -->|Chunk Text| D[Text Splitter]
    H -->|Extract Content| D
    D -->|Generate Embeddings| E[HuggingFace Embeddings]
    E -->|Store Vectors| F[FAISS Vector Store]
    A -->|Enter Query| B
    B -->|Retrieve Similar Chunks| F
    F -->|Calculate Cosine Similarity| G[Semantic Search]
    G -->|Return Results| B
    B -->|Display Results| A

Usage

Document Processing

Load a Document

To load a document (PDF, DOCX, TXT, XLS, XLSX):

from semantic_search_no_llms.data_loader import DataLoader

# Load a document
data_loader = DataLoader("path/to/document.pdf")
documents = data_loader.load_document()

Chunk the Document

To chunk the document into smaller pieces:

# Chunk the document into smaller pieces
chunks = data_loader.chunk_document(documents, chunk_size=1024, chunk_overlap=80)

Process the Chunks

This step creates embeddings for the document chunks and builds a FAISS index for efficient similarity search:

from semantic_search_no_llms.utils import process_chunks

process_chunks(chunks)

Web Crawling

Crawl a Website

To crawl a website and fetch its content:

from semantic_search_no_llms.web_crawler import WebCrawler

# Crawl a website
crawler = WebCrawler("https://www.example.com")
content = crawler.fetch_content()

Process the Crawled Content

To process the content fetched from the website:

from langchain.schema import Document as LangChainDocument

document = LangChainDocument(page_content=content, metadata={"source": "https://www.example.com"})
chunks = data_loader.chunk_document([document], chunk_size=1024, chunk_overlap=80)
process_chunks(chunks)

Semantic Search

Perform a Semantic Search

To perform a semantic search using the FAISS index and retrieve the top matching document chunks:

from semantic_search_no_llms.utils import cosine_similarity, load_embeddings
import numpy as np

# Load the embeddings
embeddings = load_embeddings()

# Define a query
query = "What is the main topic of the document?"
query_embedding = embeddings.embed_query(query)

# Reconstruct the document embeddings
from semantic_search_no_llms.utils import FAISS
vectorstore = FAISS.from_documents(documents=chunks, embedding=embeddings)
document_embeddings = vectorstore.index.reconstruct_n(0, vectorstore.index.ntotal)

# Calculate similarities and get the top results
similarities = [cosine_similarity(query_embedding, doc_embedding) for doc_embedding in document_embeddings]
top_k_indices = np.argsort(similarities)[-5:][::-1]

for i, idx in enumerate(top_k_indices):
    doc = vectorstore.docstore.search(vectorstore.index_to_docstore_id[idx])
    print(f"Match {i+1} - Similarity: {similarities[idx]:.4f}")
    print(doc.page_content)

This example shows how to load the embeddings, perform a semantic search using the FAISS index, and retrieve the top matching document chunks.

Customization

You can customize the behavior of the document processing by adjusting the chunk_size and chunk_overlap parameters when calling the chunk_document() method. Larger chunk sizes provide more context, while smaller chunks can improve search precision.

Contributing

Contributions are welcome! If you encounter any issues or have suggestions for improvements, please feel free to open an issue or submit a pull request.

License

This project is licensed under the MIT License.

Project details


Download files

Download the file for your platform. If you're not sure which to choose, learn more about installing packages.

Source Distribution

semantic_search_nollms-0.3.tar.gz (5.7 kB view details)

Uploaded Source

Built Distribution

semantic_search_nollms-0.3-py3-none-any.whl (7.3 kB view details)

Uploaded Python 3

File details

Details for the file semantic_search_nollms-0.3.tar.gz.

File metadata

  • Download URL: semantic_search_nollms-0.3.tar.gz
  • Upload date:
  • Size: 5.7 kB
  • Tags: Source
  • Uploaded using Trusted Publishing? No
  • Uploaded via: twine/5.1.1 CPython/3.11.9

File hashes

Hashes for semantic_search_nollms-0.3.tar.gz
Algorithm Hash digest
SHA256 29028553d7c1e1ab86d8bb1c75d93966a34d69e54b06d1d48a2bd401bdb29e6e
MD5 3089b27aa89464d1a2aa8f47b221722c
BLAKE2b-256 382477095efc80000eedec55a97a00caeeedb297933ed76f5ded706e543fd439

See more details on using hashes here.

File details

Details for the file semantic_search_nollms-0.3-py3-none-any.whl.

File metadata

File hashes

Hashes for semantic_search_nollms-0.3-py3-none-any.whl
Algorithm Hash digest
SHA256 69ec5150d9734042e76f169bea9c7e04639d2bc0e1346bd84ac727e53a9b13f1
MD5 7ea9ee25fdcec7f1916211facd54b4cf
BLAKE2b-256 2e487601d0c2b546ac0f75561fede486c9e40aaa54662672bb96e9eb47da2dec

See more details on using hashes here.

Supported by

AWS AWS Cloud computing and Security Sponsor Datadog Datadog Monitoring Fastly Fastly CDN Google Google Download Analytics Microsoft Microsoft PSF Sponsor Pingdom Pingdom Monitoring Sentry Sentry Error logging StatusPage StatusPage Status page