An assistant helping you to index webpages into structured datasets.
Project description
SearchFlow
SearchFlow is an assistant designed to help you index webpages into structured datasets. It leverages various tools and models to scrape, process, and store web content efficiently.
Features
- Web Scraping: Uses
trafilaturafor focused crawling and web scraping. - Document Processing: Supports chunking and processing of various document types.
- Database Management: Manages projects, documents, and prompts using PostgreSQL.
- Vector Search: Utilizes vector search for document retrieval.
- LLM Integration: Integrates with language models for question answering and document grading.
Installation
To set up the development environment, use the provided Dockerfile and .devcontainer/devcontainer.json for a consistent development setup.
Prerequisites
- Docker
- Python 3.11 or higher
Steps
Usage
Install SearchFlow via pip:
pip install searchflow
Quickstart
- Initialize the Database
from searchflow.db.postgresql import DB
db = DB()
db.create_project(project_name="example_project")
- Create a project
db.create_project(project_name="example_project")
- Import Data from a URL
from searchflow.importers import WebScraper
scraper = WebScraper(project_name='MyProject', db=db)
scraper.full_import("https://example.com", max_pages=100)
- ** Upload a file to the project **
from searchflow.importers import Files
with open("path/to/your/file.pdf", "rb") as f:
bytes_data = f.read()
files = Files()
files.upload_file(
document_data=[(bytes_data, "file.pdf")],
project_name="MyProject",
inference_type="local"
)
- List Files in a Project
files.list_files(project_name="MyProject")
- Remove a File from a Project
files.remove_file(project_name="MyProject", file_name="file.pdf")
Question Answering
Vector Search
To perform a similarity search:
from searchflow.db.postgresql import DB
db = DB()
results = db.similarity_search(project_name="example_project", query="example query"
Project details
Download files
Download the file for your platform. If you're not sure which to choose, learn more about installing packages.
Source Distribution
Built Distribution
Filter files by name, interpreter, ABI, and platform.
If you're not sure about the file name format, learn more about wheel file names.
Copy a direct link to the current filters
File details
Details for the file searchflow-0.0.111.tar.gz.
File metadata
- Download URL: searchflow-0.0.111.tar.gz
- Upload date:
- Size: 33.3 kB
- Tags: Source
- Uploaded using Trusted Publishing? No
- Uploaded via: poetry/1.8.3 CPython/3.12.7 Linux/6.5.0-1025-azure
File hashes
| Algorithm | Hash digest | |
|---|---|---|
| SHA256 |
dde33cd007a8c56babc12bf973759a0f1e5a617fc6dc89fe8505b1b0b6116e7f
|
|
| MD5 |
c574704340392a7defd678c68c83f9a8
|
|
| BLAKE2b-256 |
78c55b747bfe7c1f5059d066d268fa7037e7980c6ab3ed54e78a9bddb114fda9
|
File details
Details for the file searchflow-0.0.111-py3-none-any.whl.
File metadata
- Download URL: searchflow-0.0.111-py3-none-any.whl
- Upload date:
- Size: 41.4 kB
- Tags: Python 3
- Uploaded using Trusted Publishing? No
- Uploaded via: poetry/1.8.3 CPython/3.12.7 Linux/6.5.0-1025-azure
File hashes
| Algorithm | Hash digest | |
|---|---|---|
| SHA256 |
8fabd2cad973a663b4386aad11595cd9f59146d624ad3d0009193d112ad82fcc
|
|
| MD5 |
5d763292febda2073be0edc981f047f0
|
|
| BLAKE2b-256 |
4436cc6c73b5e3633563f6a3c148d91416726a03cb61a761ddf9020d4203e724
|