No project description provided
Project description
AI-Powered Document Semantic Search with Web Crawling
This project implements an AI-powered document semantic search system using Streamlit, LangChain, and various document processing techniques. It allows users to upload documents, scan images, crawl web pages, process the content, and perform semantic searches without relying on a large language model.
Table of Contents
Features
- Document upload and processing (PDF, DOCX, TXT, XLS, XLSX)
- Image scanning and OCR for text extraction
- Web crawling for content retrieval
- Text chunking and embedding using Hugging Face's sentence transformers
- FAISS vector store for efficient similarity search
- Cosine similarity-based semantic search without LLM
- Customizable chunk size and overlap
- Streamlit-based user interface for easy interaction
Setup
- Clone the repository:
https://github.com/FastianAbdullah/Semantic-Search-Without-LLM.git
- Install the required dependencies:
pip install -r requirements.txt
Usage:
To start the application, run:
streamlit run doc_search_and_crawl.py
The application will open in your default web browser.
API Endpoints
This project is primarily a Streamlit web application and does not expose traditional API endpoints. However, the main functionalities are accessible through the Streamlit UI:
1. Document Upload
- Upload documents (PDF, DOCX, TXT, XLS, XLSX) using the file uploader in the "Document Upload" tab.
2. Document Scanning
- Use your camera to scan documents in the "Document Scan" tab.
3. Semantic Search
- Enter your query in the search bar after processing a document or scanned image.
4. Web Crawling
- Enter a URL in the "Web Crawl" tab to fetch and process web content.
Contributing
Contributions are welcome! Please feel free to submit a Pull Request.
License
This project is licensed under the MIT License - see the LICENSE file for details.
Acknowledgements
View Acknowledgements
- Streamlit for the web app framework
- LangChain for document processing utilities
- Hugging Face for sentence transformers
- FAISS for efficient similarity search
- BeautifulSoup for web scraping
Dependencies
View Dependencies
- streamlit
- langchain-text-splitters
- langchain-community
- sentence-transformers
- faiss-cpu
- pdfplumber
- pandas
- docx2txt
- numpy
- pillow
- pytesseract
- PyPDF2
Code Overview
The main script doc_search_and_crawl.py
contains the following key components:
View Components
DataLoader
class: Handles document loading and chunking for various file types.WebCrawler
class: Handles web content retrieval and parsing.cosine_similarity
function: Calculates the cosine similarity between two vectors.scan_document
function: Uses OCR to extract text from scanned images.main
function: Sets up the Streamlit interface and manages the overall flow of the application.process_chunks
function: Creates embeddings, builds the FAISS index, and performs the semantic search.
Future Improvements
View Future Improvements
- Add support for more file types
- Implement multi-language support
- Optimize performance for larger documents
- Integrate with cloud storage services for document management
- Implement advanced web crawling features (depth control, multiple page crawling)
Feel free to contribute to these improvements or suggest new features!
Project details
Release history Release notifications | RSS feed
Download files
Download the file for your platform. If you're not sure which to choose, learn more about installing packages.
Source Distribution
Built Distribution
File details
Details for the file semantic_search_nollms-0.2.tar.gz
.
File metadata
- Download URL: semantic_search_nollms-0.2.tar.gz
- Upload date:
- Size: 5.9 kB
- Tags: Source
- Uploaded using Trusted Publishing? No
- Uploaded via: twine/5.1.1 CPython/3.12.1
File hashes
Algorithm | Hash digest | |
---|---|---|
SHA256 | c243435a31ac1ccc7c9ea24a15223547e14731529600a16785ad739ad57ed20b |
|
MD5 | af54220177ca01bf27ecf0986af45d0a |
|
BLAKE2b-256 | cbe3a3fd3b64ca9b8c8f907398230de1e3c78ff1b4553838ed81b97231edb813 |
File details
Details for the file semantic_search_nollms-0.2-py3-none-any.whl
.
File metadata
- Download URL: semantic_search_nollms-0.2-py3-none-any.whl
- Upload date:
- Size: 7.2 kB
- Tags: Python 3
- Uploaded using Trusted Publishing? No
- Uploaded via: twine/5.1.1 CPython/3.12.1
File hashes
Algorithm | Hash digest | |
---|---|---|
SHA256 | 3885aca0ad2d7ddd9de765953e0bee01018c1a77bc88770b6a8ab2db798302e9 |
|
MD5 | 948c899a1629702bc328da314fd31b06 |
|
BLAKE2b-256 | 90e63f17f1aea708fbd3befed8f6862793a3504eb34136e8ab02f65c5e154016 |