Skip to main content

A powerful, recursive URL-smart web scraping tool

Project description

🕷️ Officely AI Web Scraper

A powerful, recursive URL-smart web scraping tool designed to efficiently collect and organize content from websites. This tool is perfect for developers, researchers, and data enthusiasts who need to extract large amounts of textual data from web pages.

Features

  • 🌐 Recursive URL Crawling: Intelligently traverses websites to discover and scrape linked pages.
  • 🎯 Configurable Depth: Set the maximum depth for URL recursion to control the scope of your scraping.
  • 🔍 Smart URL Filtering: Include or exclude URLs based on keywords or prefixes.
  • 📁 Organized Output: Automatically creates a directory structure based on the domain being scraped.
  • 🛡️ Respectful Scraping: Implements user-agent rotation and retry logic with exponential backoff to respect website policies.
  • ⚙️ Highly Configurable: Easy-to-use configuration file for customizing scraping behavior.

Prerequisites

  • Python 3.7 or higher
  • pip (Python package installer)

Installation and Setup

  1. Clone this repository:

    git clone https://github.com/Royofficely/Web-Scraper.git
    
  2. Change to the project directory:

    cd Web-Scraper
    
  3. (Optional but recommended) Create and activate a virtual environment:

    python -m venv venv
    source venv/bin/activate  # On Windows, use `venv\Scripts\activate`
    
  4. Install the scraper and its dependencies:

    python agentim.py install
    

    This command will install the package, its dependencies, and create the initial configuration.

Usage

After installation, you can run the scraper from the project directory:

python agentim.py run

Configuration

The scraper's behavior can be customized by editing the config.py file in the officely_web_scraper directory:

config = {
    "domain": "https://www.example.com",  # The main domain URL for scraping
    "include_keywords": None,  # List of keywords to include in URLs
    "exclude_keywords": None,  # List of keywords to exclude from URLs
    "max_depth": 1,  # Maximum recursion depth (None for unlimited)
    "target_div": None,  # Specific div class to target (None for whole page)
    "start_with": None,  # Filter by "start with" the url. For example: ["https://example.com/blog"]
}

Adjust these settings according to your scraping needs.

Output

The scraped content will be saved in a directory named after the domain you're scraping, with each page's content stored in a separate text file.

Troubleshooting

If you encounter any issues:

  1. Ensure you're in the project directory when running the install and run commands.
  2. Check that all required files are present in the project directory.
  3. Verify that you have the necessary permissions to install packages and write to the directory.
  4. Make sure your virtual environment is activated if you're using one.
  5. If you encounter 503 errors or other connection issues, the scraper will automatically retry with exponential backoff.
  6. Check the console output for any error messages or debugging information.

Development

To set up the project for development:

  1. Follow the installation steps above, using python agentim.py install for installation.
  2. Make your changes to the code.
  3. Run tests (if available) to ensure functionality.

Project Structure

.
├── LICENSE
├── README.md
├── agentim.py
├── install.sh
├── officely-scraper
├── officely_web_scraper
│   ├── __init__.py
│   ├── config.py
│   └── scan.py
├── requirements.txt
└── setup.py

Contributing

Contributions are welcome! Please feel free to submit a Pull Request.

License

This project is licensed under the MIT License - see the LICENSE file for details.


Created with ❤️ by Roy Nativ/Officely AI

For any questions or support, please open an issue on the GitHub repository.

Project details


Download files

Download the file for your platform. If you're not sure which to choose, learn more about installing packages.

Source Distribution

officely-web-scraper-0.1.0.tar.gz (6.3 kB view details)

Uploaded Source

Built Distribution

officely_web_scraper-0.1.0-py3-none-any.whl (7.4 kB view details)

Uploaded Python 3

File details

Details for the file officely-web-scraper-0.1.0.tar.gz.

File metadata

  • Download URL: officely-web-scraper-0.1.0.tar.gz
  • Upload date:
  • Size: 6.3 kB
  • Tags: Source
  • Uploaded using Trusted Publishing? No
  • Uploaded via: twine/5.1.1 CPython/3.11.9

File hashes

Hashes for officely-web-scraper-0.1.0.tar.gz
Algorithm Hash digest
SHA256 60b8d09e80a866b6f608868ee4e4768f714cc591d44701b4250d887b0d6b15d2
MD5 ba34e8e89b59e1f1d02018009a05cfa3
BLAKE2b-256 84c0a1646984b631f08d8444a07f51f186280a6c46701f2743bc30329313039a

See more details on using hashes here.

File details

Details for the file officely_web_scraper-0.1.0-py3-none-any.whl.

File metadata

File hashes

Hashes for officely_web_scraper-0.1.0-py3-none-any.whl
Algorithm Hash digest
SHA256 a22e2dee898a46a5ad79f9c4f40032782d41f92ea333fc526f344345f87ecabf
MD5 bc582fe7b4fedee57b92256f90f2e3ec
BLAKE2b-256 82bf1c2505bd51fddbda562b870b96fd3373921c179fb3ed9eaf67c03bc1e96f

See more details on using hashes here.

Supported by

AWS AWS Cloud computing and Security Sponsor Datadog Datadog Monitoring Fastly Fastly CDN Google Google Download Analytics Microsoft Microsoft PSF Sponsor Pingdom Pingdom Monitoring Sentry Sentry Error logging StatusPage StatusPage Status page