Skip to main content

AI web scraping workflow.

Project description

Scraipe

Scraipe is a high performance asynchronous scraping and analysis framework that leverages Large Language Models (LLMs) to extract structured information.

Installation

Ensure you have Python 3.10+ installed. Install Scraipe with all built-in scrapers/analyzers:

pip install scraipe[extended]

Alternatively, install the core library and develop your own scrapers/analyzers with:

pip install scraipe

Features

  • Versatile Scraping: Leverage custom scrapers that handle Telegram messages, news articles, and links that require multiple ingress rules.
  • LLM Analysis: Process text using OpenAI models with built-in Pydantic validation.
  • Workflow Management: Combine scraping and analysis in a single fault-tolerant workflow--ideal for Jupyter notebooks.
  • High Performance: Asynchronous IO-bound tasks are seamlessly integrated in the synchronous API.
  • Modular: Extend the framework with new scrapers or analyzers as your data sources evolve.
  • Customizable Ingress: Easily define and update rules to route different types of links to their appropriate scrapers.
  • Detailed Logging: Monitor scraping and analysis operations through comprehensive logging for improved debugging and transparency.

Usage Example

  1. Setup:

    • Import the required modules:
    from scraipe import Workflow
    from scraipe.extended import NewsScraper, OpenAiAnalyzer
    
  2. Configure Scraper and Analyzer:

    # Configure the scraper
    scraper = NewsScraper()
    
    # Define an instruction for the analyzer
    instruction = '''
    Extract a list of celebrities mentioned in the article text.
    Return a JSON dictionary with the schema: {"celebrities": ["celebrity1", "celebrity2", ...]}
    '''   
    analyzer = OpenAiAnalyzer("YOUR_OPENAI_API_KEY", instruction)
    
  3. Use the Workflow:

    workflow = Workflow(scraper, analyzer)
    
    # Provide a list of URLs to scrape
    news_links = ["https://example.com/article1", "https://example.com/article2"]
    workflow.scrape(news_links)
    
    # Analyze the scraped content
    workflow.analyze()
    
    # Export results as a CSV file
    export_df = workflow.export()
    export_df.to_csv('celebrities.csv', index=False)
    

Contributing

Contributions are welcome. Please open an issue or submit a pull request for improvements.

License

This project is licensed under the MIT License.

Maintainer

This project is maintained by Nibs

Project details


Download files

Download the file for your platform. If you're not sure which to choose, learn more about installing packages.

Source Distribution

scraipe-0.1.27.tar.gz (15.0 kB view details)

Uploaded Source

Built Distribution

If you're not sure about the file name format, learn more about wheel file names.

scraipe-0.1.27-py3-none-any.whl (20.2 kB view details)

Uploaded Python 3

File details

Details for the file scraipe-0.1.27.tar.gz.

File metadata

  • Download URL: scraipe-0.1.27.tar.gz
  • Upload date:
  • Size: 15.0 kB
  • Tags: Source
  • Uploaded using Trusted Publishing? No
  • Uploaded via: poetry/2.1.2 CPython/3.10.16 Linux/5.15.167.4-microsoft-standard-WSL2

File hashes

Hashes for scraipe-0.1.27.tar.gz
Algorithm Hash digest
SHA256 0ad76438fab9194160beb86f9ade56ec2bc7b985c1563e8e19e0badfab577ee3
MD5 f7f72d4d5dd82535e283d00592b57736
BLAKE2b-256 f7582dd9e762b58dc24b554423409240d92dd03f2ce13e1dbf9fc81228169795

See more details on using hashes here.

File details

Details for the file scraipe-0.1.27-py3-none-any.whl.

File metadata

  • Download URL: scraipe-0.1.27-py3-none-any.whl
  • Upload date:
  • Size: 20.2 kB
  • Tags: Python 3
  • Uploaded using Trusted Publishing? No
  • Uploaded via: poetry/2.1.2 CPython/3.10.16 Linux/5.15.167.4-microsoft-standard-WSL2

File hashes

Hashes for scraipe-0.1.27-py3-none-any.whl
Algorithm Hash digest
SHA256 dbb5f857a084d9d54bfa9bd8ae53e2275c51ee011a5fae785950da6939e31dc5
MD5 349ec1d95c284c32cddf0b1dbc77edd2
BLAKE2b-256 1f6ad48865e06a22330f25ab6a61f737a42a327b7049028d66613a2d4e3ab65f

See more details on using hashes here.

Supported by

AWS Cloud computing and Security Sponsor Datadog Monitoring Depot Continuous Integration Fastly CDN Google Download Analytics Pingdom Monitoring Sentry Error logging StatusPage Status page