Skip to main content

AI web scraping workflow.

Project description

Scraipe

Scraipe is a high performance asynchronous scraping and analysis framework that leverages Large Language Models (LLMs) to extract structured information.

Installation

Ensure you have Python 3.10+ installed. Install Scraipe with powerful scrapers and analyzers:

pip install scraipe[extras]

Features

  • High Performance: IO-bound tasks such as scraping and querying LLMs are fully asynchronous under the hood.
  • Custom Scraping: Scraipe comes with
  • LLM Analysis: Process text using OpenAI’s API with built-in validation via Pydantic.
  • Workflow Management: Combine scraping and analysis in a single workflow--ideal for work in Jupyter notebooks.

Usage Example

  1. Setup:

    • Import the required modules:
    from scraipe import Workflow
    from scraipe.extras import NewsScraper, OpenAiAnalyzer
    
  2. Configure Scraper and Analyzer:

    # Configure the scraper
    scraper = NewsScraper()
    
    # Define an instruction and optional Pydantic schema for the analyzer
    instruction = '''
    Extract a list of celebrities mentioned in the article text.
    Return a JSON dictionary with the schema: {"celebrities": ["celebrity1", "celebrity2", ...]}
    '''
    
    from pydantic import BaseModel
    from typing import List
    class ExpectedOutput(BaseModel):
        celebrities: List[str]
    
    analyzer = OpenAiAnalyzer("YOUR_OPENAI_API_KEY", instruction, pydantic_schema=ExpectedOutput)
    
  3. Use the Workflow:

    workflow = Workflow(scraper, analyzer)
    
    # Provide a list of URLs to scrape
    news_links = ["https://example.com/article1", "https://example.com/article2"]
    workflow.scrape(news_links)
    
    # Analyze the scraped content
    workflow.analyze()
    
    # Export results as a CSV file
    export_df = workflow.export()
    export_df.to_csv('celebrities.csv', index=False)
    

Contributing

Contributions are welcome. Please open an issue or submit a pull request for improvements.

License

This project is licensed under the MIT License.

Maintainer

This project is maintained by Nibs

Project details


Download files

Download the file for your platform. If you're not sure which to choose, learn more about installing packages.

Source Distribution

scraipe-0.1.24.tar.gz (14.3 kB view details)

Uploaded Source

Built Distribution

If you're not sure about the file name format, learn more about wheel file names.

scraipe-0.1.24-py3-none-any.whl (20.2 kB view details)

Uploaded Python 3

File details

Details for the file scraipe-0.1.24.tar.gz.

File metadata

  • Download URL: scraipe-0.1.24.tar.gz
  • Upload date:
  • Size: 14.3 kB
  • Tags: Source
  • Uploaded using Trusted Publishing? No
  • Uploaded via: poetry/2.1.2 CPython/3.10.16 Linux/5.15.167.4-microsoft-standard-WSL2

File hashes

Hashes for scraipe-0.1.24.tar.gz
Algorithm Hash digest
SHA256 eec004f0472e5bcd39b4f6fc71a02d5ec7a50a64bae3854bcc4b529e16c5b624
MD5 ee6af846511800dfecb61d66d2beae9f
BLAKE2b-256 fe08c9ab03b204187eb11544b6c624cfec1e93ff14385f659867bcada6bcefd5

See more details on using hashes here.

File details

Details for the file scraipe-0.1.24-py3-none-any.whl.

File metadata

  • Download URL: scraipe-0.1.24-py3-none-any.whl
  • Upload date:
  • Size: 20.2 kB
  • Tags: Python 3
  • Uploaded using Trusted Publishing? No
  • Uploaded via: poetry/2.1.2 CPython/3.10.16 Linux/5.15.167.4-microsoft-standard-WSL2

File hashes

Hashes for scraipe-0.1.24-py3-none-any.whl
Algorithm Hash digest
SHA256 52c92697771ec125d52fbb5c226d81d784e314c24e962ae978d6aae8ef0f54e6
MD5 1b3b7811e8eec2076646bafb8562b387
BLAKE2b-256 487cf533f07baf5f938f2ee622f0a40f63476bde484d7b378cbfea3cf8c770dd

See more details on using hashes here.

Supported by

AWS Cloud computing and Security Sponsor Datadog Monitoring Depot Continuous Integration Fastly CDN Google Download Analytics Pingdom Monitoring Sentry Error logging StatusPage Status page