Skip to main content

Convert knowledge base websites into vectors to load them intovector databases

Project description

Description

Convert your knowledge base website (or any page) into a text file that you can store as a vector in any of the existing vector databases. Uses AI browser automation to scrape and process web content.

The databases supported are:

  • Pinecone
  • ChromaDB
  • Milvus

Installation

pip install page2vec

Example Usage

Command Line Interface

After installation, you can use the page2vec command. All commands require an OpenAI API key for the LLM that browser automation agent uses.

Pinecone

Update all values with the ones from your account: API key, index name, namespace, etc.

page2vec \
  --database pinecone \
  --url "[INSERT A URL]"  \
  --openai-api-key "[YOUR OPENAI API KEY]"  \
  --pinecone-api-key "[YOUR PINECONE API KEY]"  \
  --pinecone-index "page2vec-testing" \
  --pinecone-namespace "page2vec-default"

ChromaDB

page2vec \
  --url "[INSERT A URL]"  \
  --database chromadb \
  --openai-api-key "[YOUR OPENAI API KEY]"  \
  --chromadb-api-key "[YOUR CHROMADB API KEY]"  \
  --chromadb-database-name "page2vec-testing"

Milvus

page2vec \
  --url "[INSERT A URL]"  \
  --database milvus \
  --openai-api-key "[YOUR OPENAI API KEY]"  \
  --milvus-output-file "milvus_data.db"  \
  --milvus-collection-name "page2vec-testing"

Custom prompt example

page2vec
--database pinecone
--url "[INSERT A URL]"
--openai-api-key "[YOUR OPENAI API KEY]"
--pinecone-api-key "[YOUR PINECONE API KEY]"
--pinecone-index "page2vec-testing"
--pinecone-namespace "page2vec-default" --custom-prompt "Find all the paragraphs of the documentation in https://docs.trychroma.com/docs/overview/contributing. Store only the first 3 paragraphs in a separate row in a CSV."

Python API

You can also use page2vec programmatically:

import asyncio
from page2vec import async_main
from argparse import Namespace

# Create arguments
args = Namespace(
    database="pinecone",
    url="https://docs.example.com",
    openai_api_key="your-openai-api-key",
    pinecone_api_key="your-pinecone-api-key",
    pinecone_index="your-index",
    pinecone_namespace="your-namespace",
    test_mode=False
)

# Run the process
asyncio.run(async_main(args))

Dependencies

This project uses:

  • browser-use: AI-powered browser automation for intelligent web scraping
  • Database SDKs: Pinecone, ChromaDB, and Milvus clients

Future Support

These databases will be supported in the future (or upon request):

  • PostgreSQL
  • Elasticsearch

Development Installation

If you'd like to debug something or contribute:

git clone <repository-url>
cd page2vec
pip install -e ".[dev]"

Project details


Download files

Download the file for your platform. If you're not sure which to choose, learn more about installing packages.

Source Distribution

page2vec-0.1.0.tar.gz (126.6 kB view details)

Uploaded Source

Built Distribution

If you're not sure about the file name format, learn more about wheel file names.

page2vec-0.1.0-py3-none-any.whl (9.2 kB view details)

Uploaded Python 3

File details

Details for the file page2vec-0.1.0.tar.gz.

File metadata

  • Download URL: page2vec-0.1.0.tar.gz
  • Upload date:
  • Size: 126.6 kB
  • Tags: Source
  • Uploaded using Trusted Publishing? No
  • Uploaded via: twine/6.2.0 CPython/3.11.3

File hashes

Hashes for page2vec-0.1.0.tar.gz
Algorithm Hash digest
SHA256 049ca62ac552ca66a0a350431beed9584ab654faebc9b521a37a66eed3db06c0
MD5 1d31b9292dc85c7e1e50e519d1bd4d70
BLAKE2b-256 baa4b4a0abc18f6ad9df162bb44524d7460a1cc9310aadbdd954a183e30a8f38

See more details on using hashes here.

File details

Details for the file page2vec-0.1.0-py3-none-any.whl.

File metadata

  • Download URL: page2vec-0.1.0-py3-none-any.whl
  • Upload date:
  • Size: 9.2 kB
  • Tags: Python 3
  • Uploaded using Trusted Publishing? No
  • Uploaded via: twine/6.2.0 CPython/3.11.3

File hashes

Hashes for page2vec-0.1.0-py3-none-any.whl
Algorithm Hash digest
SHA256 b5bd6fc825b6d45c5aa3aac7ee9d3920dddd3eae4ac58fdb7c958a6c389558cd
MD5 db9051d35b4c65bd858255ea170cf80b
BLAKE2b-256 b8058f950e733e81ff16c867375910c7c13d953eba685606c0f7c3556816c489

See more details on using hashes here.

Supported by

AWS Cloud computing and Security Sponsor Datadog Monitoring Depot Continuous Integration Fastly CDN Google Download Analytics Pingdom Monitoring Sentry Error logging StatusPage Status page