An assistant helping you to index webpages into structured datasets.
Project description
SearchFlow
SearchFlow is an assistant designed to help you index webpages into structured datasets. It leverages various tools and models to scrape, process, and store web content efficiently.
Features
- Web Scraping: Uses
trafilatura
for focused crawling and web scraping. - Document Processing: Supports chunking and processing of various document types.
- Database Management: Manages projects, documents, and prompts using PostgreSQL.
- Vector Search: Utilizes vector search for document retrieval.
- LLM Integration: Integrates with language models for question answering and document grading.
Installation
To set up the development environment, use the provided Dockerfile
and .devcontainer/devcontainer.json
for a consistent development setup.
Prerequisites
- Docker
- Python 3.11 or higher
Steps
-
Clone the repository:
git clone https://github.com/yourusername/searchflow.git cd searchflow
-
Build the Docker container:
docker build -t searchflow .
-
Run the Docker container:
docker run -it -p 8501:8501 searchflow
Usage
Web Scraping
To scrape a website and index the links:
from searchflow.importers.webscraper import WebScraper
scraper = WebScraper(project_name="example_project")
scraper.get_all_links(base_url="https://example.com")
Document Processing
To upload and process files:
from searchflow.importers.file_importer import FileImporter
files = Files()
files.upload_file(document_data=[(b"file_content", "example.pdf")], project_name="example_project")
Vector Search
To perform a similarity search:
from searchflow.db.postgresql import DB
db = DB()
results = db.similarity_search(project_name="example_project", query="example query"
Project details
Download files
Download the file for your platform. If you're not sure which to choose, learn more about installing packages.
Source Distribution
searchflow-0.0.77.tar.gz
(28.3 kB
view hashes)
Built Distribution
Close
Hashes for searchflow-0.0.77-py3-none-any.whl
Algorithm | Hash digest | |
---|---|---|
SHA256 | 6faa9a7162f5d8812d5cbe17ab95a8371b7a2dac734a7d620accb9af5586bc63 |
|
MD5 | d9cf20041011f8f4061f82b23379e7d2 |
|
BLAKE2b-256 | 762e415a43c00fd213b4d321ff4b91f6cb414e7f2d0ce6ea1c1a0e4522d00b10 |