A comprehensive web scraping helper package with XPath selectors, regex patterns, and CSS selectors
Project description
WebExtractionHelper
A comprehensive Python package providing XPath selectors, regex patterns, and CSS selectors for web scraping various web content including Google search features, featured snippets, related questions, and other SERP elements.
🚀 Features
- 95+ Pre-built Selectors: Comprehensive collection of XPath selectors for web scraping
- Google Search Support: Specialized selectors for Google SERP features
- Multiple Content Types: Support for featured snippets, related questions, images, links, and more
- Easy to Use: Simple API with clear explanations for each selector
- Well Documented: Each selector includes detailed explanations and usage examples
📦 Installation
From PyPI (Recommended)
pip install webextractionhelper
From Source
git clone https://github.com/Artistotle-ai/webextractionhelper.git
cd webextractionhelper
pip install -e .
🔧 Requirements
- Python 3.7+
- lxml >= 4.6.0
📚 Quick Start
from webextractionhelper import Selectors
# Create a Selectors instance
selectors = Selectors()
# Access Google featured snippet selectors
featured_title_xpath = selectors.selectors['google.featured_snippet_title']['xpath']
featured_text_xpath = selectors.selectors['google.featured_snippet_text']['xpath']
# Access related questions selectors
related_questions_xpath = selectors.selectors['google.related_questions_all']['xpath']
print(f"Featured snippet title XPath: {featured_title_xpath}")
print(f"Featured snippet text XPath: {featured_text_xpath}")
print(f"Related questions XPath: {related_questions_xpath}")
🎯 Available Selector Categories
Google Search Selectors (21 selectors)
- Featured Snippets: Title, text, bullet points, numbered lists, tables, URLs, images
- Related Questions: Individual questions, all questions, answer snippets, source titles/URLs
- Search Results: Main containers, links, titles, descriptions
Meta & Open Graph Selectors (11 selectors)
- Meta Tags: Title, description, keywords, robots, viewport
- Open Graph: Title, description, image, URL, type, site name
Social Media Selectors (6 selectors)
- Twitter/X: Card type, title, description, image, creator, site
Content Selectors (10 selectors)
- Headings: H1, H2, H3, H4
- Text Content: Paragraphs, lists, blockquotes
- Forms: Input fields, buttons, labels
Media Selectors (5 selectors)
- Images: Source, alt text, title, dimensions
- Videos: Source, poster, dimensions
Link Selectors (7 selectors)
- Navigation: Main nav, footer links, breadcrumbs
- Content Links: Internal, external, download links
🔍 Usage Examples
Example 1: Extract Google Featured Snippet
from webextractionhelper import Selectors
import requests
from lxml import html
selectors = Selectors()
# Get the page content
url = "https://www.google.com/search?q=python+programming"
response = requests.get(url)
tree = html.fromstring(response.content)
# Extract featured snippet title
title_xpath = selectors.selectors['google.featured_snippet_title']['xpath']
title_elements = tree.xpath(title_xpath)
if title_elements:
title = title_elements[0].text_content()
print(f"Featured snippet title: {title}")
Example 2: Extract All Related Questions
# Get all related questions
questions_xpath = selectors.selectors['google.related_questions_all']['xpath']
question_elements = tree.xpath(questions_xpath)
for i, question in enumerate(question_elements, 1):
print(f"Question {i}: {question.text_content()}")
Example 3: Extract Meta Information
# Get page meta description
meta_desc_xpath = selectors.selectors['meta.description']['xpath']
meta_desc_elements = tree.xpath(meta_desc_xpath)
if meta_desc_elements:
description = meta_desc_elements[0].get('content')
print(f"Meta description: {description}")
📋 Selector Structure
Each selector in the package follows this structure:
{
'explanation': 'Human-readable description of what this selector extracts',
'xpath': 'The XPath expression to extract the content',
'regex': 'Optional regex pattern for text processing',
'css': 'Optional CSS selector alternative'
}
🛠️ Development
Setting up development environment
git clone https://github.com/Artistotle-ai/webextractionhelper.git
cd webextractionhelper
pip install -e ".[dev]"
Running tests
python test_package.py
python example_usage.py
Building the package
python -m build
📄 License
This project is licensed under the Creative Commons Attribution-ShareAlike 4.0 International License - see the LICENSE.txt file for details.
👨💻 Author
Jens Verneuer
🤝 Contributing
Contributions are welcome! Please feel free to submit a Pull Request. For major changes, please open an issue first to discuss what you would like to change.
📞 Support
If you have any questions or need help, please:
- Check the GitHub Issues
- Create a new issue if your problem isn't already addressed
🔗 Links
- GitHub Repository: https://github.com/Artistotle-ai/webextractionhelper
- PyPI Package: https://pypi.org/project/webextractionhelper/
- Documentation: https://github.com/Artistotle-ai/webextractionhelper#readme
📈 Version History
- 0.1.0 - Initial release with 95+ selectors for web scraping
Note: This package is designed to help with web scraping tasks. Please ensure you comply with the terms of service of the websites you're scraping and respect robots.txt files.
Project details
Release history Release notifications | RSS feed
Download files
Download the file for your platform. If you're not sure which to choose, learn more about installing packages.
Source Distribution
File details
Details for the file webextractionhelper-0.1.1.tar.gz.
File metadata
- Download URL: webextractionhelper-0.1.1.tar.gz
- Upload date:
- Size: 19.6 kB
- Tags: Source
- Uploaded using Trusted Publishing? No
- Uploaded via: twine/6.1.0 CPython/3.12.8
File hashes
| Algorithm | Hash digest | |
|---|---|---|
| SHA256 |
39604ee86e0b04ea0c30dd86ccb0ffb285d3320d35c56528f3aed7a9ccbcfba9
|
|
| MD5 |
d569a5dcbe8ab31d150ba2a31f773053
|
|
| BLAKE2b-256 |
3ecaae28b7b4901ab1b1cb6c091620b9f6241ce85fe2d4cbf28cbb26c711bf2e
|