Convert knowledge base websites into vectors to load them intovector databases
Project description
Description
Convert your knowledge base website (or any page) into a text file that you can store as a vector in any of the existing vector databases. Uses AI browser automation to scrape and process web content.
The databases supported are:
- Pinecone
- ChromaDB
- Milvus
Installation
pip install page2vec
Example Usage
Command Line Interface
After installation, you can use the page2vec command. All commands require an OpenAI API key for the LLM that browser automation agent uses.
Pinecone
Update all values with the ones from your account: API key, index name, namespace, etc.
page2vec \
--database pinecone \
--url "[INSERT A URL]" \
--openai-api-key "[YOUR OPENAI API KEY]" \
--pinecone-api-key "[YOUR PINECONE API KEY]" \
--pinecone-index "page2vec-testing" \
--pinecone-namespace "page2vec-default"
ChromaDB
page2vec \
--url "[INSERT A URL]" \
--database chromadb \
--openai-api-key "[YOUR OPENAI API KEY]" \
--chromadb-api-key "[YOUR CHROMADB API KEY]" \
--chromadb-database-name "page2vec-testing"
Milvus
page2vec \
--url "[INSERT A URL]" \
--database milvus \
--openai-api-key "[YOUR OPENAI API KEY]" \
--milvus-output-file "milvus_data.db" \
--milvus-collection-name "page2vec-testing"
Custom prompt example
page2vec
--database pinecone
--url "[INSERT A URL]"
--openai-api-key "[YOUR OPENAI API KEY]"
--pinecone-api-key "[YOUR PINECONE API KEY]"
--pinecone-index "page2vec-testing"
--pinecone-namespace "page2vec-default"
--custom-prompt "Find all the paragraphs of the documentation in https://docs.trychroma.com/docs/overview/contributing. Store only the first 3 paragraphs in a separate row in a CSV."
Python API
You can also use page2vec programmatically:
import asyncio
from page2vec import async_main
from argparse import Namespace
# Create arguments
args = Namespace(
database="pinecone",
url="https://docs.example.com",
openai_api_key="your-openai-api-key",
pinecone_api_key="your-pinecone-api-key",
pinecone_index="your-index",
pinecone_namespace="your-namespace",
test_mode=False
)
# Run the process
asyncio.run(async_main(args))
Dependencies
This project uses:
- browser-use: AI-powered browser automation for intelligent web scraping
- Database SDKs: Pinecone, ChromaDB, and Milvus clients
Future Support
These databases will be supported in the future (or upon request):
- PostgreSQL
- Elasticsearch
Development Installation
If you'd like to debug something or contribute:
git clone <repository-url>
cd page2vec
pip install -e ".[dev]"
Project details
Release history Release notifications | RSS feed
Download files
Download the file for your platform. If you're not sure which to choose, learn more about installing packages.
Source Distribution
Built Distribution
Filter files by name, interpreter, ABI, and platform.
If you're not sure about the file name format, learn more about wheel file names.
Copy a direct link to the current filters
File details
Details for the file page2vec-0.1.0.tar.gz.
File metadata
- Download URL: page2vec-0.1.0.tar.gz
- Upload date:
- Size: 126.6 kB
- Tags: Source
- Uploaded using Trusted Publishing? No
- Uploaded via: twine/6.2.0 CPython/3.11.3
File hashes
| Algorithm | Hash digest | |
|---|---|---|
| SHA256 |
049ca62ac552ca66a0a350431beed9584ab654faebc9b521a37a66eed3db06c0
|
|
| MD5 |
1d31b9292dc85c7e1e50e519d1bd4d70
|
|
| BLAKE2b-256 |
baa4b4a0abc18f6ad9df162bb44524d7460a1cc9310aadbdd954a183e30a8f38
|
File details
Details for the file page2vec-0.1.0-py3-none-any.whl.
File metadata
- Download URL: page2vec-0.1.0-py3-none-any.whl
- Upload date:
- Size: 9.2 kB
- Tags: Python 3
- Uploaded using Trusted Publishing? No
- Uploaded via: twine/6.2.0 CPython/3.11.3
File hashes
| Algorithm | Hash digest | |
|---|---|---|
| SHA256 |
b5bd6fc825b6d45c5aa3aac7ee9d3920dddd3eae4ac58fdb7c958a6c389558cd
|
|
| MD5 |
db9051d35b4c65bd858255ea170cf80b
|
|
| BLAKE2b-256 |
b8058f950e733e81ff16c867375910c7c13d953eba685606c0f7c3556816c489
|