Skip to main content

Package containing custom cognee tasks for scraping web content using ScrapeGraphAI

Project description

cognee-community-tasks-scrapegraph

Custom cognee tasks for scraping web content using ScrapeGraphAI.

Overview

This package provides two async tasks:

  • scrape_urls – scrape a list of URLs with a natural language prompt and return structured results.
  • scrape_and_add – scrape URLs and ingest the content directly into a cognee dataset.

Installation

uv pip install cognee-community-tasks-scrapegraph

Or install locally with all dependencies:

cd packages/task/scrapegraph_tasks
uv sync --all-extras
# OR
poetry install

Requirements

You need two API keys:

Variable Description
LLM_API_KEY OpenAI (or other LLM provider) API key used by cognee
SGAI_API_KEY ScrapeGraphAI API key

Set them in your environment or in a .env file:

export LLM_API_KEY="sk-..."
export SGAI_API_KEY="sgai-..."

Usage

Scrape only

import asyncio
from cognee_community_tasks_scrapegraph import scrape_urls

results = asyncio.run(
    scrape_urls(
        urls=["https://cognee.ai", "https://docs.cognee.ai"],
        user_prompt="Extract the main content, title, and key information from this page",
    )
)

for item in results:
    print(item["url"], item["content"])

Scrape and add to cognee

import asyncio
from cognee_community_tasks_scrapegraph import scrape_and_add

asyncio.run(
    scrape_and_add(
        urls=["https://cognee.ai"],
        user_prompt="Extract the main content and key information",
        dataset_name="web_scrape",
    )
)

Run the example

cd packages/task/scrapegraph_tasks
uv run python examples/example.py
# OR
poetry run python examples/example.py

API Reference

scrape_urls

async def scrape_urls(
    urls: List[str],
    user_prompt: str = "Extract the main content, title, and key information from this page",
    api_key: Optional[str] = None,
) -> List[dict]

Returns a list of dicts:

[
    {"url": "https://example.com", "content": {...}},           # success
    {"url": "https://bad.invalid", "content": "", "error": "..."}, # failure
]

scrape_and_add

async def scrape_and_add(
    urls: List[str],
    user_prompt: str = "Extract the main content, title, and key information from this page",
    api_key: Optional[str] = None,
    dataset_name: str = "scrapegraph",
) -> Any

Scrapes all URLs, combines the successful results into a single text document, calls cognee.add, and then cognee.cognify. Returns the cognify result.

Project details


Download files

Download the file for your platform. If you're not sure which to choose, learn more about installing packages.

Source Distribution

cognee_community_tasks_scrapegraph-0.1.1.tar.gz (341.0 kB view details)

Uploaded Source

Built Distribution

If you're not sure about the file name format, learn more about wheel file names.

File details

Details for the file cognee_community_tasks_scrapegraph-0.1.1.tar.gz.

File metadata

File hashes

Hashes for cognee_community_tasks_scrapegraph-0.1.1.tar.gz
Algorithm Hash digest
SHA256 9e90e67c7ed5545fc39c4897fa7fda7ca3a484c2c1ad962db59dbc3a157bb3cf
MD5 eb8bebbe521a117e275f6972a21fea8e
BLAKE2b-256 78e2811dd2e43ee378d99390f8c28854da45ce5b62636c3f16085588fa33e166

See more details on using hashes here.

File details

Details for the file cognee_community_tasks_scrapegraph-0.1.1-py3-none-any.whl.

File metadata

File hashes

Hashes for cognee_community_tasks_scrapegraph-0.1.1-py3-none-any.whl
Algorithm Hash digest
SHA256 99dc74db3b8e7b951b980bfc05ad3f948db1e4deb76f2b9992b46d150a07b320
MD5 d7a41c2b628c52569dc458554b076f93
BLAKE2b-256 54e91a90f5a84c614efb0e16f830c8e0732a50e01541dc9a409f4c1e0e832679

See more details on using hashes here.

Supported by

AWS Cloud computing and Security Sponsor Datadog Monitoring Depot Continuous Integration Fastly CDN Google Download Analytics Pingdom Monitoring Sentry Error logging StatusPage Status page