A powerful, recursive URL-smart web scraping tool
Project description
🕷️ Officely AI Web Scraper
A powerful, recursive URL-smart web scraping tool designed to efficiently collect and organize content from websites. This tool is perfect for developers, researchers, and data enthusiasts who need to extract large amounts of textual data from web pages.
Features
- 🌐 Recursive URL Crawling: Intelligently traverses websites to discover and scrape linked pages.
- 🎯 Configurable Depth: Set the maximum depth for URL recursion to control the scope of your scraping.
- 🔍 Smart URL Filtering: Include or exclude URLs based on keywords or prefixes.
- 📁 Organized Output: Automatically creates a directory structure based on the domain being scraped.
- 🛡️ Respectful Scraping: Implements user-agent rotation and retry logic with exponential backoff to respect website policies.
- ⚙️ Highly Configurable: Easy-to-use configuration file for customizing scraping behavior.
Prerequisites
- Python 3.7 or higher
- pip (Python package installer)
Installation and Setup
-
Clone this repository:
git clone https://github.com/Royofficely/Web-Scraper.git
-
Change to the project directory:
cd Web-Scraper
-
(Optional but recommended) Create and activate a virtual environment:
python -m venv venv source venv/bin/activate # On Windows, use `venv\Scripts\activate`
-
Install the scraper and its dependencies:
python agentim.py install
This command will install the package, its dependencies, and create the initial configuration.
Usage
After installation, you can run the scraper from the project directory:
python agentim.py run
Configuration
The scraper's behavior can be customized by editing the config.py
file in the officely_web_scraper
directory:
config = {
"domain": "https://www.example.com", # The main domain URL for scraping
"include_keywords": None, # List of keywords to include in URLs
"exclude_keywords": None, # List of keywords to exclude from URLs
"max_depth": 1, # Maximum recursion depth (None for unlimited)
"target_div": None, # Specific div class to target (None for whole page)
"start_with": None, # Filter by "start with" the url. For example: ["https://example.com/blog"]
}
Adjust these settings according to your scraping needs.
Output
The scraped content will be saved in a directory named after the domain you're scraping, with each page's content stored in a separate text file.
Troubleshooting
If you encounter any issues:
- Ensure you're in the project directory when running the install and run commands.
- Check that all required files are present in the project directory.
- Verify that you have the necessary permissions to install packages and write to the directory.
- Make sure your virtual environment is activated if you're using one.
- If you encounter 503 errors or other connection issues, the scraper will automatically retry with exponential backoff.
- Check the console output for any error messages or debugging information.
Development
To set up the project for development:
- Follow the installation steps above, using
python agentim.py install
for installation. - Make your changes to the code.
- Run tests (if available) to ensure functionality.
Project Structure
.
├── LICENSE
├── README.md
├── agentim.py
├── install.sh
├── officely-scraper
├── officely_web_scraper
│ ├── __init__.py
│ ├── config.py
│ └── scan.py
├── requirements.txt
└── setup.py
Contributing
Contributions are welcome! Please feel free to submit a Pull Request.
License
This project is licensed under the MIT License - see the LICENSE file for details.
Created with ❤️ by Roy Nativ/Officely AI
For any questions or support, please open an issue on the GitHub repository.
Project details
Release history Release notifications | RSS feed
Download files
Download the file for your platform. If you're not sure which to choose, learn more about installing packages.
Source Distribution
Built Distribution
File details
Details for the file officely-web-scraper-0.1.0.tar.gz
.
File metadata
- Download URL: officely-web-scraper-0.1.0.tar.gz
- Upload date:
- Size: 6.3 kB
- Tags: Source
- Uploaded using Trusted Publishing? No
- Uploaded via: twine/5.1.1 CPython/3.11.9
File hashes
Algorithm | Hash digest | |
---|---|---|
SHA256 | 60b8d09e80a866b6f608868ee4e4768f714cc591d44701b4250d887b0d6b15d2 |
|
MD5 | ba34e8e89b59e1f1d02018009a05cfa3 |
|
BLAKE2b-256 | 84c0a1646984b631f08d8444a07f51f186280a6c46701f2743bc30329313039a |
File details
Details for the file officely_web_scraper-0.1.0-py3-none-any.whl
.
File metadata
- Download URL: officely_web_scraper-0.1.0-py3-none-any.whl
- Upload date:
- Size: 7.4 kB
- Tags: Python 3
- Uploaded using Trusted Publishing? No
- Uploaded via: twine/5.1.1 CPython/3.11.9
File hashes
Algorithm | Hash digest | |
---|---|---|
SHA256 | a22e2dee898a46a5ad79f9c4f40032782d41f92ea333fc526f344345f87ecabf |
|
MD5 | bc582fe7b4fedee57b92256f90f2e3ec |
|
BLAKE2b-256 | 82bf1c2505bd51fddbda562b870b96fd3373921c179fb3ed9eaf67c03bc1e96f |