Skip to main content

No project description provided

Project description

AI-Powered Document Semantic Search with Web Crawling

This project implements an AI-powered document semantic search system using Streamlit, LangChain, and various document processing techniques. It allows users to upload documents, scan images, crawl web pages, process the content, and perform semantic searches without relying on a large language model.

Table of Contents

Features

  • Document upload and processing (PDF, DOCX, TXT, XLS, XLSX)
  • Image scanning and OCR for text extraction
  • Web crawling for content retrieval
  • Text chunking and embedding using Hugging Face's sentence transformers
  • FAISS vector store for efficient similarity search
  • Cosine similarity-based semantic search without LLM
  • Customizable chunk size and overlap
  • Streamlit-based user interface for easy interaction

Setup

  1. Clone the repository:
    https://github.com/FastianAbdullah/Semantic-Search-Without-LLM.git
    
  2. Install the required dependencies:
    pip install -r requirements.txt
    

Usage:

To start the application, run:

streamlit run doc_search_and_crawl.py

The application will open in your default web browser.

API Endpoints

This project is primarily a Streamlit web application and does not expose traditional API endpoints. However, the main functionalities are accessible through the Streamlit UI:

1. Document Upload
  • Upload documents (PDF, DOCX, TXT, XLS, XLSX) using the file uploader in the "Document Upload" tab.
2. Document Scanning
  • Use your camera to scan documents in the "Document Scan" tab.
3. Semantic Search
  • Enter your query in the search bar after processing a document or scanned image.
4. Web Crawling
  • Enter a URL in the "Web Crawl" tab to fetch and process web content.

Contributing

Contributions are welcome! Please feel free to submit a Pull Request.

License

This project is licensed under the MIT License - see the LICENSE file for details.

Acknowledgements

View Acknowledgements
  • Streamlit for the web app framework
  • LangChain for document processing utilities
  • Hugging Face for sentence transformers
  • FAISS for efficient similarity search
  • BeautifulSoup for web scraping

Dependencies

View Dependencies
  • streamlit
  • langchain-text-splitters
  • langchain-community
  • sentence-transformers
  • faiss-cpu
  • pdfplumber
  • pandas
  • docx2txt
  • numpy
  • pillow
  • pytesseract
  • PyPDF2

Code Overview

The main script doc_search_and_crawl.py contains the following key components:

View Components
  1. DataLoader class: Handles document loading and chunking for various file types.
  2. WebCrawler class: Handles web content retrieval and parsing.
  3. cosine_similarity function: Calculates the cosine similarity between two vectors.
  4. scan_document function: Uses OCR to extract text from scanned images.
  5. main function: Sets up the Streamlit interface and manages the overall flow of the application.
  6. process_chunks function: Creates embeddings, builds the FAISS index, and performs the semantic search.

Future Improvements

View Future Improvements
  • Add support for more file types
  • Implement multi-language support
  • Optimize performance for larger documents
  • Integrate with cloud storage services for document management
  • Implement advanced web crawling features (depth control, multiple page crawling)

Feel free to contribute to these improvements or suggest new features!

Project details


Download files

Download the file for your platform. If you're not sure which to choose, learn more about installing packages.

Source Distribution

semantic_search_nollms-0.2.tar.gz (5.9 kB view details)

Uploaded Source

Built Distribution

semantic_search_nollms-0.2-py3-none-any.whl (7.2 kB view details)

Uploaded Python 3

File details

Details for the file semantic_search_nollms-0.2.tar.gz.

File metadata

  • Download URL: semantic_search_nollms-0.2.tar.gz
  • Upload date:
  • Size: 5.9 kB
  • Tags: Source
  • Uploaded using Trusted Publishing? No
  • Uploaded via: twine/5.1.1 CPython/3.12.1

File hashes

Hashes for semantic_search_nollms-0.2.tar.gz
Algorithm Hash digest
SHA256 c243435a31ac1ccc7c9ea24a15223547e14731529600a16785ad739ad57ed20b
MD5 af54220177ca01bf27ecf0986af45d0a
BLAKE2b-256 cbe3a3fd3b64ca9b8c8f907398230de1e3c78ff1b4553838ed81b97231edb813

See more details on using hashes here.

File details

Details for the file semantic_search_nollms-0.2-py3-none-any.whl.

File metadata

File hashes

Hashes for semantic_search_nollms-0.2-py3-none-any.whl
Algorithm Hash digest
SHA256 3885aca0ad2d7ddd9de765953e0bee01018c1a77bc88770b6a8ab2db798302e9
MD5 948c899a1629702bc328da314fd31b06
BLAKE2b-256 90e63f17f1aea708fbd3befed8f6862793a3504eb34136e8ab02f65c5e154016

See more details on using hashes here.

Supported by

AWS AWS Cloud computing and Security Sponsor Datadog Datadog Monitoring Fastly Fastly CDN Google Google Download Analytics Microsoft Microsoft PSF Sponsor Pingdom Pingdom Monitoring Sentry Sentry Error logging StatusPage StatusPage Status page