ScrapySub is a Python library designed to recursively scrape website content, including subpages. It fetches the visible text from web pages and stores it in a structured format for easy access and analysis. This library is particularly useful for NLP and AI developers who need to gather large amounts of web content for their projects.
Project description
Here's a detailed documentation for ScrapySub
to include in your README.md
file for GitHub and PyPI:
scrapysub
ScrapySub is a Python library designed to recursively scrape website content, including subpages. It fetches the visible text from web pages and stores it in a structured format for easy access and analysis. This library is particularly useful for NLP and AI developers who need to gather large amounts of web content for their projects.
Features
- Recursive Scraping: Automatically follows and scrapes links within the same domain.
- Custom User-Agent: Mimics browser requests to avoid being blocked by websites.
- Error Handling: Retries failed requests and handles common HTTP errors.
- Metadata Storage: Stores additional metadata about the scraped content.
- Politeness: Adds a delay between requests to avoid overwhelming servers.
Installation
Install ScrapySub using pip:
pip install scrapysub
Usage
Here's a quick example to get you started with ScrapySub:
from scrapysub import ScrapWeb
# Initialize the scraper
scraper = ScrapWeb()
# Start scraping from the given URL
url = "https://myportfolio-five-tau.vercel.app/"
scraper.scrape(url)
# Get all the scraped documents
documents = scraper.get_all_documents()
# Print the content of each document
for doc in documents:
print(f"URL: {doc.metadata['url']}")
print(f"Content: {doc.page_content[:200]}...") # Print the first 200 characters
print()
Detailed Example
Importing Required Libraries
from scrapysub import ScrapWeb, Document
Initializing the Scraper
scraper = ScrapWeb()
Starting the Scraping Process
url = "https://myportfolio-five-tau.vercel.app/"
scraper.scrape(url)
Accessing Scraped Documents
documents = scraper.get_all_documents()
for doc in documents:
print(f"URL: {doc.metadata['url']}")
print(f"Content: {doc.page_content[:200]}...") # Print the first 200 characters
print()
Class and Method Details
ScrapWeb Class
__init__(self)
: Initializes the scraper with a session and custom headers.fetch_page(self, url)
: Fetches the HTML content of the given URL with retries and error handling.scrape_text(self, html_content)
: Extracts visible text from the HTML content.tag_visible(self, element)
: Helper method to filter out non-visible elements.get_links(self, url, html_content)
: Finds all valid links on the page within the same domain.is_valid_url(self, url, base_url)
: Checks if a URL is valid and belongs to the same domain.scrape(self, url)
: Recursively scrapes the given URL and its subpages.get_all_documents(self)
: Returns all scraped documents.
Document Class
__init__(self, page_content, **kwargs)
: Stores the text content and metadata of a web page.
Error Handling
ScrapySub handles common HTTP errors by retrying failed requests with a delay. If a request fails multiple times, it logs the error and continues with the next URL.
Contributing
Contributions are welcome! Please submit a pull request or open an issue to discuss your ideas.
License
This project is licensed under the MIT License. See the LICENSE file for details.
Contact
For any questions or suggestions, feel free to reach out to the maintainer.
Project details
Release history Release notifications | RSS feed
Download files
Download the file for your platform. If you're not sure which to choose, learn more about installing packages.
Source Distribution
Built Distribution
File details
Details for the file scrapysub-0.1.2.tar.gz
.
File metadata
- Download URL: scrapysub-0.1.2.tar.gz
- Upload date:
- Size: 4.8 kB
- Tags: Source
- Uploaded using Trusted Publishing? No
- Uploaded via: twine/5.1.1 CPython/3.10.9
File hashes
Algorithm | Hash digest | |
---|---|---|
SHA256 | 01a4417f5bb582104504c4982ce470ba952226aa40379f64c3ca06fdb2c0d8d9 |
|
MD5 | 250fd60e3570c9352fa229d9772aaf4c |
|
BLAKE2b-256 | 22220baf6f4e9f907132de69d5dd23fdf6cb2e78499da88f09247076b7c1863c |
File details
Details for the file scrapysub-0.1.2-py3-none-any.whl
.
File metadata
- Download URL: scrapysub-0.1.2-py3-none-any.whl
- Upload date:
- Size: 4.7 kB
- Tags: Python 3
- Uploaded using Trusted Publishing? No
- Uploaded via: twine/5.1.1 CPython/3.10.9
File hashes
Algorithm | Hash digest | |
---|---|---|
SHA256 | af46ff0e22b9ac249cd611ca4ae0b6933811e8218ca9ab5d807359f93a123aa2 |
|
MD5 | bcc44342e7a13fa23e55717e57273565 |
|
BLAKE2b-256 | 04eaffb79c0c7eb0ed8de84ca69cfce88f1a1ea436da859f6106a131ba68af90 |