A comprehensive web scraping helper package with XPath selectors, regex patterns, and CSS selectors

These details have not been verified by PyPI

Project links

Project description

WebExtractionHelper

A comprehensive Python package providing XPath selectors, regex patterns, and CSS selectors for web scraping various web content including Google search features, featured snippets, related questions, and other SERP elements.

🚀 Features

95+ Pre-built Selectors: Comprehensive collection of XPath selectors for web scraping
Google Search Support: Specialized selectors for Google SERP features
Multiple Content Types: Support for featured snippets, related questions, images, links, and more
Easy to Use: Simple API with clear explanations for each selector
Well Documented: Each selector includes detailed explanations and usage examples

📦 Installation

From PyPI (Recommended)

pip install webextractionhelper

From Source

git clone https://github.com/Artistotle-ai/webextractionhelper.git
cd webextractionhelper
pip install -e .

🔧 Requirements

Python 3.7+
lxml >= 4.6.0

📚 Quick Start

from webextractionhelper import Selectors

# Create a Selectors instance
selectors = Selectors()

# Access Google featured snippet selectors
featured_title_xpath = selectors.selectors['google.featured_snippet_title']['xpath']
featured_text_xpath = selectors.selectors['google.featured_snippet_text']['xpath']

# Access related questions selectors
related_questions_xpath = selectors.selectors['google.related_questions_all']['xpath']

print(f"Featured snippet title XPath: {featured_title_xpath}")
print(f"Featured snippet text XPath: {featured_text_xpath}")
print(f"Related questions XPath: {related_questions_xpath}")

🎯 Available Selector Categories

Google Search Selectors (21 selectors)

Featured Snippets: Title, text, bullet points, numbered lists, tables, URLs, images
Related Questions: Individual questions, all questions, answer snippets, source titles/URLs
Search Results: Main containers, links, titles, descriptions

Meta & Open Graph Selectors (11 selectors)

Meta Tags: Title, description, keywords, robots, viewport
Open Graph: Title, description, image, URL, type, site name

Social Media Selectors (6 selectors)

Twitter/X: Card type, title, description, image, creator, site

Content Selectors (10 selectors)

Headings: H1, H2, H3, H4
Text Content: Paragraphs, lists, blockquotes
Forms: Input fields, buttons, labels

Media Selectors (5 selectors)

Images: Source, alt text, title, dimensions
Videos: Source, poster, dimensions

Link Selectors (7 selectors)

Navigation: Main nav, footer links, breadcrumbs
Content Links: Internal, external, download links

🔍 Usage Examples

Example 1: Extract Google Featured Snippet

from webextractionhelper import Selectors
import requests
from lxml import html

selectors = Selectors()

# Get the page content
url = "https://www.google.com/search?q=python+programming"
response = requests.get(url)
tree = html.fromstring(response.content)

# Extract featured snippet title
title_xpath = selectors.selectors['google.featured_snippet_title']['xpath']
title_elements = tree.xpath(title_xpath)

if title_elements:
    title = title_elements[0].text_content()
    print(f"Featured snippet title: {title}")

Example 2: Extract All Related Questions

# Get all related questions
questions_xpath = selectors.selectors['google.related_questions_all']['xpath']
question_elements = tree.xpath(questions_xpath)

for i, question in enumerate(question_elements, 1):
    print(f"Question {i}: {question.text_content()}")

Example 3: Extract Meta Information

# Get page meta description
meta_desc_xpath = selectors.selectors['meta.description']['xpath']
meta_desc_elements = tree.xpath(meta_desc_xpath)

if meta_desc_elements:
    description = meta_desc_elements[0].get('content')
    print(f"Meta description: {description}")

📋 Selector Structure

Each selector in the package follows this structure:

{
    'explanation': 'Human-readable description of what this selector extracts',
    'xpath': 'The XPath expression to extract the content',
    'regex': 'Optional regex pattern for text processing',
    'css': 'Optional CSS selector alternative'
}

🛠️ Development

Setting up development environment

git clone https://github.com/Artistotle-ai/webextractionhelper.git
cd webextractionhelper
pip install -e ".[dev]"

Running tests

python test_package.py
python example_usage.py

Building the package

python -m build

📄 License

This project is licensed under the Creative Commons Attribution-ShareAlike 4.0 International License - see the LICENSE.txt file for details.

👨‍💻 Author

Jens Verneuer

🤝 Contributing

Contributions are welcome! Please feel free to submit a Pull Request. For major changes, please open an issue first to discuss what you would like to change.

📞 Support

If you have any questions or need help, please:

Check the GitHub Issues
Create a new issue if your problem isn't already addressed

🔗 Links

GitHub Repository: https://github.com/Artistotle-ai/webextractionhelper
PyPI Package: https://pypi.org/project/webextractionhelper/
Documentation: https://github.com/Artistotle-ai/webextractionhelper#readme

📈 Version History

0.1.0 - Initial release with 95+ selectors for web scraping

Note: This package is designed to help with web scraping tasks. Please ensure you comply with the terms of service of the websites you're scraping and respect robots.txt files.

Project details

These details have not been verified by PyPI

Project links

Release history Release notifications | RSS feed

This version

0.1.1

Sep 3, 2025

Download files

Download the file for your platform. If you're not sure which to choose, learn more about installing packages.

Source Distribution

webextractionhelper-0.1.1.tar.gz (19.6 kB view details)

Uploaded Sep 3, 2025 Source

File details

Details for the file webextractionhelper-0.1.1.tar.gz.

File metadata

Download URL: webextractionhelper-0.1.1.tar.gz
Upload date: Sep 3, 2025
Size: 19.6 kB
Tags: Source
Uploaded using Trusted Publishing? No
Uploaded via: twine/6.1.0 CPython/3.12.8

File hashes

Hashes for webextractionhelper-0.1.1.tar.gz
Algorithm	Hash digest
SHA256	`39604ee86e0b04ea0c30dd86ccb0ffb285d3320d35c56528f3aed7a9ccbcfba9`
MD5	`d569a5dcbe8ab31d150ba2a31f773053`
BLAKE2b-256	`3ecaae28b7b4901ab1b1cb6c091620b9f6241ce85fe2d4cbf28cbb26c711bf2e`

See more details on using hashes here.

webextractionhelper 0.1.1

Navigation

Verified details

Maintainers

Unverified details

Project links

Meta

Classifiers

Project description

WebExtractionHelper

🚀 Features

📦 Installation

From PyPI (Recommended)

From Source

🔧 Requirements

📚 Quick Start

🎯 Available Selector Categories

Google Search Selectors (21 selectors)

Meta & Open Graph Selectors (11 selectors)

Social Media Selectors (6 selectors)

Content Selectors (10 selectors)

Media Selectors (5 selectors)

Link Selectors (7 selectors)

🔍 Usage Examples

Example 1: Extract Google Featured Snippet

Example 2: Extract All Related Questions

Example 3: Extract Meta Information

📋 Selector Structure

🛠️ Development

Setting up development environment

Running tests

Building the package

📄 License

👨‍💻 Author

🤝 Contributing

📞 Support

🔗 Links

📈 Version History

Project details

Verified details

Maintainers

Unverified details

Project links

Meta

Classifiers

Release history Release notifications | RSS feed

Download files

Source Distribution

File details

File metadata

File hashes