Skip to main content

Package containing custom cognee tasks for scraping web content using ScrapeGraphAI

Project description

cognee-community-tasks-scrapegraph

Custom cognee tasks for scraping web content using ScrapeGraphAI.

Overview

This package provides two async tasks:

  • scrape_urls – scrape a list of URLs with a natural language prompt and return structured results.
  • scrape_and_add – scrape URLs and ingest the content directly into a cognee dataset.

Installation

uv pip install cognee-community-tasks-scrapegraph

Or install locally with all dependencies:

cd packages/task/scrapegraph_tasks
uv sync --all-extras
# OR
poetry install

Requirements

You need two API keys:

Variable Description
LLM_API_KEY OpenAI (or other LLM provider) API key used by cognee
SGAI_API_KEY ScrapeGraphAI API key

Set them in your environment or in a .env file:

export LLM_API_KEY="sk-..."
export SGAI_API_KEY="sgai-..."

Usage

Scrape only

import asyncio
from cognee_community_tasks_scrapegraph import scrape_urls

results = asyncio.run(
    scrape_urls(
        urls=["https://cognee.ai", "https://docs.cognee.ai"],
        user_prompt="Extract the main content, title, and key information from this page",
    )
)

for item in results:
    print(item["url"], item["content"])

Scrape and add to cognee

import asyncio
from cognee_community_tasks_scrapegraph import scrape_and_add

asyncio.run(
    scrape_and_add(
        urls=["https://cognee.ai"],
        user_prompt="Extract the main content and key information",
        dataset_name="web_scrape",
    )
)

Run the example

cd packages/task/scrapegraph_tasks
uv run python examples/example.py
# OR
poetry run python examples/example.py

API Reference

scrape_urls

async def scrape_urls(
    urls: List[str],
    user_prompt: str = "Extract the main content, title, and key information from this page",
    api_key: Optional[str] = None,
) -> List[dict]

Returns a list of dicts:

[
    {"url": "https://example.com", "content": {...}},           # success
    {"url": "https://bad.invalid", "content": "", "error": "..."}, # failure
]

scrape_and_add

async def scrape_and_add(
    urls: List[str],
    user_prompt: str = "Extract the main content, title, and key information from this page",
    api_key: Optional[str] = None,
    dataset_name: str = "scrapegraph",
) -> Any

Scrapes all URLs, combines the successful results into a single text document, calls cognee.add, and then cognee.cognify. Returns the cognify result.

Project details


Download files

Download the file for your platform. If you're not sure which to choose, learn more about installing packages.

Source Distribution

cognee_community_tasks_scrapegraph-0.1.0.tar.gz (325.7 kB view details)

Uploaded Source

Built Distribution

If you're not sure about the file name format, learn more about wheel file names.

File details

Details for the file cognee_community_tasks_scrapegraph-0.1.0.tar.gz.

File metadata

File hashes

Hashes for cognee_community_tasks_scrapegraph-0.1.0.tar.gz
Algorithm Hash digest
SHA256 eb4da39f8798268f12d89a034e092a254e58dd8bcbca72c370c646061b192d98
MD5 b0a36ae66043e39090aaf318d46ce4a7
BLAKE2b-256 e8723898c3338c656c615adc2140f1dae5f2ad31ac19805ab03800cf2882b259

See more details on using hashes here.

File details

Details for the file cognee_community_tasks_scrapegraph-0.1.0-py3-none-any.whl.

File metadata

File hashes

Hashes for cognee_community_tasks_scrapegraph-0.1.0-py3-none-any.whl
Algorithm Hash digest
SHA256 f63c7c498980a53e75bef83974c48d543e66325aa50c44e7aef26dfa92ae0352
MD5 c64eb40f1cd2547614d4c88459962d94
BLAKE2b-256 ea617aade5d71d0a2d880253d2ddbba2235c4dfa3917692208e0a76e4f633e47

See more details on using hashes here.

Supported by

AWS Cloud computing and Security Sponsor Datadog Monitoring Depot Continuous Integration Fastly CDN Google Download Analytics Pingdom Monitoring Sentry Error logging StatusPage Status page