Skip to main content

ScrapySub is a Python library designed to recursively scrape website content, including subpages. It fetches the visible text from web pages and stores it in a structured format for easy access and analysis. This library is particularly useful for NLP and AI developers who need to gather large amounts of web content for their projects.

Project description

Here's a detailed documentation for ScrapySub to include in your README.md file for GitHub and PyPI:


scrapysub

ScrapySub is a Python library designed to recursively scrape website content, including subpages. It fetches the visible text from web pages and stores it in a structured format for easy access and analysis. This library is particularly useful for NLP and AI developers who need to gather large amounts of web content for their projects.

Features

  • Recursive Scraping: Automatically follows and scrapes links within the same domain.
  • Custom User-Agent: Mimics browser requests to avoid being blocked by websites.
  • Error Handling: Retries failed requests and handles common HTTP errors.
  • Metadata Storage: Stores additional metadata about the scraped content.
  • Politeness: Adds a delay between requests to avoid overwhelming servers.

Installation

Install ScrapySub using pip:

pip install scrapysub

Usage

Here's a quick example to get you started with ScrapySub:

from scrapysub import ScrapWeb

# Initialize the scraper
scraper = ScrapWeb()

# Start scraping from the given URL
url = "https://myportfolio-five-tau.vercel.app/"
scraper.scrape(url)

# Get all the scraped documents
documents = scraper.get_all_documents()

# Print the content of each document
for doc in documents:
    print(f"URL: {doc.metadata['url']}")
    print(f"Content: {doc.page_content[:200]}...")  # Print the first 200 characters
    print()

Detailed Example

Importing Required Libraries

from scrapysub import ScrapWeb, Document

Initializing the Scraper

scraper = ScrapWeb()

Starting the Scraping Process

url = "https://myportfolio-five-tau.vercel.app/"
scraper.scrape(url)

Accessing Scraped Documents

documents = scraper.get_all_documents()

for doc in documents:
    print(f"URL: {doc.metadata['url']}")
    print(f"Content: {doc.page_content[:200]}...")  # Print the first 200 characters
    print()

Class and Method Details

ScrapWeb Class

  • __init__(self): Initializes the scraper with a session and custom headers.
  • fetch_page(self, url): Fetches the HTML content of the given URL with retries and error handling.
  • scrape_text(self, html_content): Extracts visible text from the HTML content.
  • tag_visible(self, element): Helper method to filter out non-visible elements.
  • get_links(self, url, html_content): Finds all valid links on the page within the same domain.
  • is_valid_url(self, url, base_url): Checks if a URL is valid and belongs to the same domain.
  • scrape(self, url): Recursively scrapes the given URL and its subpages.
  • get_all_documents(self): Returns all scraped documents.

Document Class

  • __init__(self, page_content, **kwargs): Stores the text content and metadata of a web page.

Error Handling

ScrapySub handles common HTTP errors by retrying failed requests with a delay. If a request fails multiple times, it logs the error and continues with the next URL.

Contributing

Contributions are welcome! Please submit a pull request or open an issue to discuss your ideas.

License

This project is licensed under the MIT License. See the LICENSE file for details.

Contact

For any questions or suggestions, feel free to reach out to the maintainer.

Project details


Download files

Download the file for your platform. If you're not sure which to choose, learn more about installing packages.

Source Distribution

scrapysub-0.1.2.tar.gz (4.8 kB view details)

Uploaded Source

Built Distribution

scrapysub-0.1.2-py3-none-any.whl (4.7 kB view details)

Uploaded Python 3

File details

Details for the file scrapysub-0.1.2.tar.gz.

File metadata

  • Download URL: scrapysub-0.1.2.tar.gz
  • Upload date:
  • Size: 4.8 kB
  • Tags: Source
  • Uploaded using Trusted Publishing? No
  • Uploaded via: twine/5.1.1 CPython/3.10.9

File hashes

Hashes for scrapysub-0.1.2.tar.gz
Algorithm Hash digest
SHA256 01a4417f5bb582104504c4982ce470ba952226aa40379f64c3ca06fdb2c0d8d9
MD5 250fd60e3570c9352fa229d9772aaf4c
BLAKE2b-256 22220baf6f4e9f907132de69d5dd23fdf6cb2e78499da88f09247076b7c1863c

See more details on using hashes here.

File details

Details for the file scrapysub-0.1.2-py3-none-any.whl.

File metadata

  • Download URL: scrapysub-0.1.2-py3-none-any.whl
  • Upload date:
  • Size: 4.7 kB
  • Tags: Python 3
  • Uploaded using Trusted Publishing? No
  • Uploaded via: twine/5.1.1 CPython/3.10.9

File hashes

Hashes for scrapysub-0.1.2-py3-none-any.whl
Algorithm Hash digest
SHA256 af46ff0e22b9ac249cd611ca4ae0b6933811e8218ca9ab5d807359f93a123aa2
MD5 bcc44342e7a13fa23e55717e57273565
BLAKE2b-256 04eaffb79c0c7eb0ed8de84ca69cfce88f1a1ea436da859f6106a131ba68af90

See more details on using hashes here.

Supported by

AWS AWS Cloud computing and Security Sponsor Datadog Datadog Monitoring Fastly Fastly CDN Google Google Download Analytics Microsoft Microsoft PSF Sponsor Pingdom Pingdom Monitoring Sentry Sentry Error logging StatusPage StatusPage Status page