A tool to scrape websites and update Google Sheets
Project description
SiteToSheet
SiteToSheet is a Python project that combines web scraping, natural language processing, and Google API integration to extract data from websites and store it in Google Sheets.
Installation
This project requires Python 3.7 or later. To set up the project environment:
- Clone the repository:
- Create a virtual environment:
- Activate the virtual environment:
- On Windows:
venv\Scripts\activate
- On macOS and Linux:
source venv/bin/activate
- Install the required packages:
Dependencies
This project relies on several key libraries:
- Web Scraping: BeautifulSoup4
- Natural Language Processing: spaCy (with en_core_web_sm model)
- Google API Integration: google-api-python-client, gspread
- Data Manipulation: pandas, numpy
- Mapping: googlemaps
- Environment Management: python-dotenv
- Rate Limiting: ratelimit
For a complete list of dependencies, see the requirements.txt
file.
Configuration
- Set up Google Cloud Project and enable necessary APIs (Sheets, Maps).
- Create and download a
credentials.json
file for Google API authentication. - Create a
.env
file in the project root and add your API keys:
Usage
An example on how to use the tool will be provided on my portfolio website. For now you just have to ask :)
Features
- Web scraping with BeautifulSoup4
- Natural language processing with spaCy
- Google Sheets integration for data storage
- Google Maps API for geolocation services
- Rate limiting to respect API usage limits
License
This project is licensed under the MIT License - see the LICENSE file for details.
This project uses the following third-party services and libraries:
- Google Maps Distance Matrix API: Subject to the Google Maps Platform Terms of Service
- Google Sheets API: Subject to the Google APIs Terms of Service
- spaCy: Licensed under the MIT License
Users of this software are responsible for ensuring their own compliance with the terms of these services and libraries.
Acknowledgments
This project makes use of several open-source libraries and APIs. We thank the maintainers and contributors of these projects.
Project details
Release history Release notifications | RSS feed
Download files
Download the file for your platform. If you're not sure which to choose, learn more about installing packages.
Source Distribution
Built Distribution
File details
Details for the file sitetosheet-1.0.0.tar.gz
.
File metadata
- Download URL: sitetosheet-1.0.0.tar.gz
- Upload date:
- Size: 18.6 kB
- Tags: Source
- Uploaded using Trusted Publishing? No
- Uploaded via: twine/5.1.1 CPython/3.12.2
File hashes
Algorithm | Hash digest | |
---|---|---|
SHA256 | 796b8ef2e97cb495f8a40b1cd76c09fb72b27142ab0b5fa7f194ff42a3ac9f1f |
|
MD5 | 1d0a5ccb1c4a8c7d4ba6ea5a15e29398 |
|
BLAKE2b-256 | 0e86f51d1a83936265cc837687fb2ec01c61a1e0c21fe1fa8702a0cd9970ed64 |
File details
Details for the file SiteToSheet-1.0.0-py3-none-any.whl
.
File metadata
- Download URL: SiteToSheet-1.0.0-py3-none-any.whl
- Upload date:
- Size: 18.3 kB
- Tags: Python 3
- Uploaded using Trusted Publishing? No
- Uploaded via: twine/5.1.1 CPython/3.12.2
File hashes
Algorithm | Hash digest | |
---|---|---|
SHA256 | 638a6a31165e521b7fd1311aff0adf5480faab79072bd2ded3543e8510a7be11 |
|
MD5 | 0215e673262bc9fd4a635295acd4f744 |
|
BLAKE2b-256 | 459ac7976d9c4865f24825002357af535026f7ff5563dd6a93c45890d88e14fe |