Skip to main content

A tool to scrape websites and update Google Sheets

Project description

SiteToSheet

SiteToSheet is a Python project that combines web scraping, natural language processing, and Google API integration to extract data from websites and store it in Google Sheets.

python workflow Pylint

Installation

This project requires Python 3.7 or later. To set up the project environment:

  1. Clone the repository:
  2. Create a virtual environment:
  3. Activate the virtual environment:
  • On Windows:
    venv\Scripts\activate
    
  • On macOS and Linux:
    source venv/bin/activate
    
  1. Install the required packages:

Dependencies

This project relies on several key libraries:

  • Web Scraping: BeautifulSoup4
  • Natural Language Processing: spaCy (with en_core_web_sm model)
  • Google API Integration: google-api-python-client, gspread
  • Data Manipulation: pandas, numpy
  • Mapping: googlemaps
  • Environment Management: python-dotenv
  • Rate Limiting: ratelimit

For a complete list of dependencies, see the requirements.txt file.

Configuration

  1. Set up Google Cloud Project and enable necessary APIs (Sheets, Maps).
  2. Create and download a credentials.json file for Google API authentication.
  3. Create a .env file in the project root and add your API keys:

Usage

An example on how to use the tool will be provided on my portfolio website. For now you just have to ask :)

Features

  • Web scraping with BeautifulSoup4
  • Natural language processing with spaCy
  • Google Sheets integration for data storage
  • Google Maps API for geolocation services
  • Rate limiting to respect API usage limits

License

This project is licensed under the MIT License - see the LICENSE file for details.

This project uses the following third-party services and libraries:

Users of this software are responsible for ensuring their own compliance with the terms of these services and libraries.

Acknowledgments

This project makes use of several open-source libraries and APIs. We thank the maintainers and contributors of these projects.

Project details


Download files

Download the file for your platform. If you're not sure which to choose, learn more about installing packages.

Source Distribution

sitetosheet-1.0.0.tar.gz (18.6 kB view details)

Uploaded Source

Built Distribution

SiteToSheet-1.0.0-py3-none-any.whl (18.3 kB view details)

Uploaded Python 3

File details

Details for the file sitetosheet-1.0.0.tar.gz.

File metadata

  • Download URL: sitetosheet-1.0.0.tar.gz
  • Upload date:
  • Size: 18.6 kB
  • Tags: Source
  • Uploaded using Trusted Publishing? No
  • Uploaded via: twine/5.1.1 CPython/3.12.2

File hashes

Hashes for sitetosheet-1.0.0.tar.gz
Algorithm Hash digest
SHA256 796b8ef2e97cb495f8a40b1cd76c09fb72b27142ab0b5fa7f194ff42a3ac9f1f
MD5 1d0a5ccb1c4a8c7d4ba6ea5a15e29398
BLAKE2b-256 0e86f51d1a83936265cc837687fb2ec01c61a1e0c21fe1fa8702a0cd9970ed64

See more details on using hashes here.

File details

Details for the file SiteToSheet-1.0.0-py3-none-any.whl.

File metadata

  • Download URL: SiteToSheet-1.0.0-py3-none-any.whl
  • Upload date:
  • Size: 18.3 kB
  • Tags: Python 3
  • Uploaded using Trusted Publishing? No
  • Uploaded via: twine/5.1.1 CPython/3.12.2

File hashes

Hashes for SiteToSheet-1.0.0-py3-none-any.whl
Algorithm Hash digest
SHA256 638a6a31165e521b7fd1311aff0adf5480faab79072bd2ded3543e8510a7be11
MD5 0215e673262bc9fd4a635295acd4f744
BLAKE2b-256 459ac7976d9c4865f24825002357af535026f7ff5563dd6a93c45890d88e14fe

See more details on using hashes here.

Supported by

AWS AWS Cloud computing and Security Sponsor Datadog Datadog Monitoring Fastly Fastly CDN Google Google Download Analytics Microsoft Microsoft PSF Sponsor Pingdom Pingdom Monitoring Sentry Sentry Error logging StatusPage StatusPage Status page