LlamaIndex-powered web content extractor for RAG applications

These details have not been verified by PyPI

Project links

Project description

webpage-to-text

LlamaIndex-powered web content extractor for RAG applications

Extract clean, structured text from web pages using LlamaIndex's powerful HTML parsing capabilities. Perfect for preparing content for RAG (Retrieval-Augmented Generation) systems, vector databases, and knowledge bases.

Features

🚀 LlamaIndex Integration: Leverages LlamaIndex's SimpleWebPageReader for high-quality text extraction
📄 Clean Text Output: Converts HTML to structured, readable text with preserved formatting
⚙️ Configuration-Driven: Use YAML/JSON files to define extraction jobs
🔧 CLI Interface: Simple command-line tool for batch processing
📊 Batch Processing: Extract from multiple URLs with automatic rate limiting
🎯 RAG-Ready: Output format optimized for vector databases and RAG applications
🔄 Flexible Output: Support for custom filenames and directory structures

Installation

From PyPI (Coming Soon)

pip install webpage-to-text

From Source

git clone https://github.com/yourusername/webpage-to-text.git
cd webpage-to-text
pip install -e .

Quick Start

Command Line Usage

Extract from a single URL:

webpage-to-text --url https://example.com --output ./texts/

Extract from multiple URLs:

webpage-to-text --url https://example.com --url https://example.com/about --output ./texts/

Use a configuration file:

webpage-to-text --config sites.yaml

Create a sample configuration:

webpage-to-text --create-config sample.yaml

Python API Usage

from webpage_to_text import WebPageExtractor, Config

# Basic usage
extractor = WebPageExtractor(output_dir="./texts")
result = extractor.extract_url("https://example.com")

# Batch processing
urls = ["https://example.com", "https://example.com/about"]
results = extractor.extract_urls(urls)

# Using configuration
config = Config("sites.yaml")
results = extractor.extract_from_config(config.config)

Configuration

YAML Configuration Example

name: "Hotel Chain Extraction"
description: "Extract content from hotel websites"
output_dir: "./hotel_texts"
rate_limit: 1.0

urls:
  - "https://www.hotel.com/"
  - "https://www.hotel.com/rooms"
  - "https://www.hotel.com/amenities"
  - "https://www.hotel.com/contact"

filenames:
  - "001_home.txt"
  - "002_rooms.txt"
  - "003_amenities.txt"
  - "004_contact.txt"

JSON Configuration Example

{
  "name": "E-commerce Site Extraction",
  "description": "Extract product and category pages",
  "output_dir": "./ecommerce_texts",
  "rate_limit": 2.0,
  "urls": [
    "https://shop.example.com/",
    "https://shop.example.com/categories/electronics",
    "https://shop.example.com/categories/clothing"
  ]
}

Use Cases

RAG Applications

Perfect for creating knowledge bases for chatbots and Q&A systems:

extractor = WebPageExtractor(output_dir="./knowledge_base")
results = extractor.extract_urls([
    "https://company.com/faq",
    "https://company.com/documentation",
    "https://company.com/support"
])

Content Migration

Move content between systems while preserving structure:

# Extract from old site
extractor = WebPageExtractor(output_dir="./migrated_content")
config = Config("old_site_pages.yaml")
results = extractor.extract_from_config(config.config)

Research Data Collection

Collect structured data for analysis:

# Research paper extraction
extractor = WebPageExtractor(output_dir="./research_papers")
urls = ["https://arxiv.org/abs/1234.5678", "https://arxiv.org/abs/8765.4321"]
results = extractor.extract_urls(urls)

Output Format

The extracted text maintains structure while being clean and readable:

# Page Title

## Section Header

Content paragraph with proper formatting.

* List item 1
* List item 2

[Link Text](https://example.com)

### Subsection

More content here...

CLI Options

webpage-to-text --help

options:
  -h, --help            show this help message and exit
  --config CONFIG, -c CONFIG
                        Path to configuration file (YAML or JSON)
  --url URL, -u URL     URL to extract (can be used multiple times)
  --create-config CREATE_CONFIG
                        Create a sample configuration file
  --output OUTPUT, -o OUTPUT
                        Output directory for extracted text files
  --rate-limit RATE_LIMIT, -r RATE_LIMIT
                        Rate limit between requests in seconds (default: 1.0)
  --filename FILENAME, -f FILENAME
                        Custom filename for output
  --verbose, -v         Enable verbose output

Development

Setup Development Environment

git clone https://github.com/yourusername/webpage-to-text.git
cd webpage-to-text
pip install -e .[dev]

Run Tests

pytest

Code Formatting

black src/
flake8 src/
mypy src/

Contributing

Fork the repository
Create a feature branch (git checkout -b feature/amazing-feature)
Commit your changes (git commit -m 'Add amazing feature')
Push to the branch (git push origin feature/amazing-feature)
Open a Pull Request

License

This project is licensed under the MIT License - see the LICENSE file for details.

Acknowledgments

Built on LlamaIndex for robust web content extraction
Inspired by the need for high-quality text extraction for RAG applications
Thanks to the open-source community for foundational libraries

Support

Made with ❤️ for the RAG and AI community

Project details

These details have not been verified by PyPI

Project links

Release history Release notifications | RSS feed

This version

0.1.0

Jul 9, 2025

Download files

Download the file for your platform. If you're not sure which to choose, learn more about installing packages.

Source Distribution

webpage_to_text-0.1.0.tar.gz (14.1 kB view details)

Uploaded Jul 9, 2025 Source

Built Distribution

If you're not sure about the file name format, learn more about wheel file names.

The dropdown lists show the available interpreters, ABIs, and platforms. Enable javascript to be able to filter the list of wheel files.

webpage_to_text-0.1.0-py3-none-any.whl (9.6 kB view details)

Uploaded Jul 9, 2025 Python 3

File details

Details for the file webpage_to_text-0.1.0.tar.gz.

File metadata

Download URL: webpage_to_text-0.1.0.tar.gz
Upload date: Jul 9, 2025
Size: 14.1 kB
Tags: Source
Uploaded using Trusted Publishing? No
Uploaded via: twine/6.1.0 CPython/3.13.3

File hashes

Hashes for webpage_to_text-0.1.0.tar.gz
Algorithm	Hash digest
SHA256	`4f693dc746aa8aa6467d7eb927260bd59f08c674c327033da039ff12f7439bc3`
MD5	`4445bb5b84b14408cae21b45b24449c8`
BLAKE2b-256	`3d7a38777b88c83203b3a0f16140e0f9b75882420955a6028624a1dd04bf42c8`

See more details on using hashes here.

File details

Details for the file webpage_to_text-0.1.0-py3-none-any.whl.

File metadata

Download URL: webpage_to_text-0.1.0-py3-none-any.whl
Upload date: Jul 9, 2025
Size: 9.6 kB
Tags: Python 3
Uploaded using Trusted Publishing? No
Uploaded via: twine/6.1.0 CPython/3.13.3

File hashes

Hashes for webpage_to_text-0.1.0-py3-none-any.whl
Algorithm	Hash digest
SHA256	`a8cd485a570b9d665818d9eac025317ff95143088e4e6c92ccb285da90a778c6`
MD5	`5de810dc1436672c017d524ee6e2cdbd`
BLAKE2b-256	`b8180dfca4655014f0d6f78817871c53c3667533386411dacfe63e803430bec6`

See more details on using hashes here.

webpage-to-text 0.1.0

Navigation

Verified details

Maintainers

Unverified details

Project links

Meta

Classifiers

Project description

webpage-to-text

Features

Installation

From PyPI (Coming Soon)

From Source

Quick Start

Command Line Usage

Python API Usage

Configuration

YAML Configuration Example

JSON Configuration Example

Use Cases

RAG Applications

Content Migration

Research Data Collection

Output Format

CLI Options

Development

Setup Development Environment

Run Tests

Code Formatting

Contributing

License

Acknowledgments

Support

Project details

Verified details

Maintainers

Unverified details

Project links

Meta

Classifiers

Release history Release notifications | RSS feed

Download files

Source Distribution

Built Distribution

File details

File metadata

File hashes

File details

File metadata

File hashes