page2vec

Convert knowledge base websites into vectors to load them intovector databases

These details have not been verified by PyPI

Project links

Project description

Description

Convert your knowledge base website (or any page) into a text file that you can store as a vector in any of the existing vector databases. Uses AI browser automation to scrape and process web content.

The databases supported are:

Pinecone
ChromaDB
Milvus

Installation

pip install page2vec

Example Usage

Command Line Interface

After installation, you can use the page2vec command. All commands require an OpenAI API key for the LLM that browser automation agent uses.

Pinecone

Update all values with the ones from your account: API key, index name, namespace, etc.

page2vec \
  --database pinecone \
  --url "[INSERT A URL]"  \
  --openai-api-key "[YOUR OPENAI API KEY]"  \
  --pinecone-api-key "[YOUR PINECONE API KEY]"  \
  --pinecone-index "page2vec-testing" \
  --pinecone-namespace "page2vec-default"

ChromaDB

page2vec \
  --url "[INSERT A URL]"  \
  --database chromadb \
  --openai-api-key "[YOUR OPENAI API KEY]"  \
  --chromadb-api-key "[YOUR CHROMADB API KEY]"  \
  --chromadb-database-name "page2vec-testing"

Milvus

page2vec \
  --url "[INSERT A URL]"  \
  --database milvus \
  --openai-api-key "[YOUR OPENAI API KEY]"  \
  --milvus-output-file "milvus_data.db"  \
  --milvus-collection-name "page2vec-testing"

Custom prompt example

page2vec
--database pinecone
--url "[INSERT A URL]"
--openai-api-key "[YOUR OPENAI API KEY]"
--pinecone-api-key "[YOUR PINECONE API KEY]"
--pinecone-index "page2vec-testing"
--pinecone-namespace "page2vec-default" --custom-prompt "Find all the paragraphs of the documentation in https://docs.trychroma.com/docs/overview/contributing. Store only the first 3 paragraphs in a separate row in a CSV."

Python API

You can also use page2vec programmatically:

import asyncio
from page2vec import async_main
from argparse import Namespace

# Create arguments
args = Namespace(
    database="pinecone",
    url="https://docs.example.com",
    openai_api_key="your-openai-api-key",
    pinecone_api_key="your-pinecone-api-key",
    pinecone_index="your-index",
    pinecone_namespace="your-namespace",
    test_mode=False
)

# Run the process
asyncio.run(async_main(args))

Dependencies

This project uses:

browser-use: AI-powered browser automation for intelligent web scraping
Database SDKs: Pinecone, ChromaDB, and Milvus clients

Future Support

These databases will be supported in the future (or upon request):

PostgreSQL
Elasticsearch

Development Installation

If you'd like to debug something or contribute:

git clone <repository-url>
cd page2vec
pip install -e ".[dev]"

Project details

These details have not been verified by PyPI

Project links

Release history Release notifications | RSS feed

This version

0.1.0

Oct 5, 2025

Download files

Download the file for your platform. If you're not sure which to choose, learn more about installing packages.

Source Distribution

page2vec-0.1.0.tar.gz (126.6 kB view details)

Uploaded Oct 5, 2025 Source

Built Distribution

If you're not sure about the file name format, learn more about wheel file names.

The dropdown lists show the available interpreters, ABIs, and platforms. Enable javascript to be able to filter the list of wheel files.

page2vec-0.1.0-py3-none-any.whl (9.2 kB view details)

Uploaded Oct 5, 2025 Python 3

File details

Details for the file page2vec-0.1.0.tar.gz.

File metadata

Download URL: page2vec-0.1.0.tar.gz
Upload date: Oct 5, 2025
Size: 126.6 kB
Tags: Source
Uploaded using Trusted Publishing? No
Uploaded via: twine/6.2.0 CPython/3.11.3

File hashes

Hashes for page2vec-0.1.0.tar.gz
Algorithm	Hash digest
SHA256	`049ca62ac552ca66a0a350431beed9584ab654faebc9b521a37a66eed3db06c0`
MD5	`1d31b9292dc85c7e1e50e519d1bd4d70`
BLAKE2b-256	`baa4b4a0abc18f6ad9df162bb44524d7460a1cc9310aadbdd954a183e30a8f38`

See more details on using hashes here.

File details

Details for the file page2vec-0.1.0-py3-none-any.whl.

File metadata

Download URL: page2vec-0.1.0-py3-none-any.whl
Upload date: Oct 5, 2025
Size: 9.2 kB
Tags: Python 3
Uploaded using Trusted Publishing? No
Uploaded via: twine/6.2.0 CPython/3.11.3

File hashes

Hashes for page2vec-0.1.0-py3-none-any.whl
Algorithm	Hash digest
SHA256	`b5bd6fc825b6d45c5aa3aac7ee9d3920dddd3eae4ac58fdb7c958a6c389558cd`
MD5	`db9051d35b4c65bd858255ea170cf80b`
BLAKE2b-256	`b8058f950e733e81ff16c867375910c7c13d953eba685606c0f7c3556816c489`

See more details on using hashes here.

page2vec 0.1.0

Navigation

Verified details

Maintainers

Unverified details

Project links

Meta

Classifiers

Project description

Description

Installation

Example Usage

Command Line Interface

Custom prompt example

Python API

Dependencies

Future Support

Development Installation

Project details

Verified details

Maintainers

Unverified details

Project links

Meta

Classifiers

Release history Release notifications | RSS feed

Download files

Source Distribution

Built Distribution

File details

File metadata

File hashes

File details

File metadata

File hashes