Library for extracting structured data from websites using ScrapeGraphAI

These details have not been verified by PyPI

Project links

Project description

🕷️🦜 langchain-scrapegraph

Supercharge your LangChain agents with AI-powered web scraping capabilities. LangChain-ScrapeGraph provides a seamless integration between LangChain and ScrapeGraph AI, enabling your agents to extract structured data from websites using natural language.

🔗 ScrapeGraph API & SDKs

If you are looking for a quick solution to integrate ScrapeGraph in your system, check out our powerful API here!

ScrapeGraph API Banner

We offer SDKs in both Python and Node.js, making it easy to integrate into your projects. Check them out below:

SDK	Language	GitHub Link
Python SDK	Python	scrapegraph-py
Node.js SDK	Node.js	scrapegraph-js

📦 Installation

pip install langchain-scrapegraph

🛠️ Available Tools

📝 MarkdownifyTool

Convert any webpage into clean, formatted markdown.

from langchain_scrapegraph.tools import MarkdownifyTool

tool = MarkdownifyTool()
markdown = tool.invoke({"website_url": "https://example.com"})

print(markdown)

🔍 SmartscraperTool

Extract structured data from any webpage using natural language prompts.

from langchain_scrapegraph.tools import SmartScraperTool

# Initialize the tool (uses SGAI_API_KEY from environment)
tool = SmartscraperTool()

# Extract information using natural language
result = tool.invoke({
    "website_url": "https://www.example.com",
    "user_prompt": "Extract the main heading and first paragraph"
})

print(result)

🌐 SearchscraperTool

Search and extract structured information from the web using natural language prompts.

from langchain_scrapegraph.tools import SearchScraperTool

# Initialize the tool (uses SGAI_API_KEY from environment)
tool = SearchScraperTool()

# Search and extract information using natural language
result = tool.invoke({
    "user_prompt": "What are the key features and pricing of ChatGPT Plus?"
})

print(result)
# {
#     "product": {
#         "name": "ChatGPT Plus",
#         "description": "Premium version of ChatGPT..."
#     },
#     "features": [...],
#     "pricing": {...},
#     "reference_urls": [
#         "https://openai.com/chatgpt",
#         ...
#     ]
# }

🔍 Using Output Schemas with SearchscraperTool

You can define the structure of the output using Pydantic models:

from typing import List, Dict
from pydantic import BaseModel, Field
from langchain_scrapegraph.tools import SearchScraperTool

class ProductInfo(BaseModel):
    name: str = Field(description="Product name")
    features: List[str] = Field(description="List of product features")
    pricing: Dict[str, Any] = Field(description="Pricing information")
    reference_urls: List[str] = Field(description="Source URLs for the information")

# Initialize with schema
tool = SearchScraperTool(llm_output_schema=ProductInfo)

# The output will conform to the ProductInfo schema
result = tool.invoke({
    "user_prompt": "What are the key features and pricing of ChatGPT Plus?"
})

print(result)
# {
#     "name": "ChatGPT Plus",
#     "features": [
#         "GPT-4 access",
#         "Faster response speed",
#         ...
#     ],
#     "pricing": {
#         "amount": 20,
#         "currency": "USD",
#         "period": "monthly"
#     },
#     "reference_urls": [
#         "https://openai.com/chatgpt",
#         ...
#     ]
# }

🌟 Key Features

🐦 LangChain Integration: Seamlessly works with LangChain agents and chains
🔍 AI-Powered Extraction: Use natural language to describe what data to extract
📊 Structured Output: Get clean, structured data ready for your agents
🔄 Flexible Tools: Choose from multiple specialized scraping tools
⚡ Async Support: Built-in support for async operations

💡 Use Cases

📖 Research Agents: Create agents that gather and analyze web data
📊 Data Collection: Automate structured data extraction from websites
📝 Content Processing: Convert web content into markdown for further processing
🔍 Information Extraction: Extract specific data points using natural language

🤖 Example Agent

from langchain.agents import initialize_agent, AgentType
from langchain_scrapegraph.tools import SmartScraperTool
from langchain_openai import ChatOpenAI

# Initialize tools
tools = [
    SmartScraperTool(),
]

# Create an agent
agent = initialize_agent(
    tools=tools,
    llm=ChatOpenAI(temperature=0),
    agent=AgentType.ZERO_SHOT_REACT_DESCRIPTION,
    verbose=True
)

# Use the agent
response = agent.run("""
    Visit example.com, make a summary of the content and extract the main heading and first paragraph
""")

⚙️ Configuration

Set your ScrapeGraph API key in your environment:

export SGAI_API_KEY="your-api-key-here"

Or set it programmatically:

import os
os.environ["SGAI_API_KEY"] = "your-api-key-here"

📚 Documentation

💬 Support & Feedback

📧 Email: support@scrapegraphai.com
💻 GitHub Issues: Create an issue
🌟 Feature Requests: Request a feature

📄 License

This project is licensed under the MIT License - see the LICENSE file for details.

🙏 Acknowledgments

This project is built on top of:

Made with ❤️ by ScrapeGraph AI

Project details

These details have not been verified by PyPI

Project links

Release history Release notifications | RSS feed

This version

1.3.0

Feb 22, 2025

1.3.0b1 pre-release

Feb 22, 2025

1.2.1b1 pre-release

Jan 2, 2025

1.2.0

Dec 18, 2024

1.2.0b1 pre-release

Dec 18, 2024

1.1.0

Dec 5, 2024

1.1.0b2 pre-release

Dec 18, 2024

1.1.0b1 pre-release

Dec 5, 2024

1.0.0

Dec 5, 2024

1.0.0b1 pre-release

Dec 5, 2024

Download files

Download the file for your platform. If you're not sure which to choose, learn more about installing packages.

Source Distribution

langchain_scrapegraph-1.3.0.tar.gz (8.6 kB view details)

Uploaded Feb 22, 2025 Source

Built Distribution

langchain_scrapegraph-1.3.0-py3-none-any.whl (11.4 kB view details)

Uploaded Feb 22, 2025 Python 3

File details

Details for the file langchain_scrapegraph-1.3.0.tar.gz.

File metadata

Download URL: langchain_scrapegraph-1.3.0.tar.gz
Upload date: Feb 22, 2025
Size: 8.6 kB
Tags: Source
Uploaded using Trusted Publishing? No
Uploaded via: twine/5.1.1 CPython/3.10.16

File hashes

Hashes for langchain_scrapegraph-1.3.0.tar.gz
Algorithm	Hash digest
SHA256	`9aafa15d331cf1458c9403fd9b7a7ecc9b7258b46380d9a9f8651a0ebddc232e`
MD5	`9a43f2bdf6e61ace5d31a45119350673`
BLAKE2b-256	`eb17c7d0519a5bb5fda29ea703193d489247167f00d7c9c2a2997421b4f64c9f`

See more details on using hashes here.

File details

Details for the file langchain_scrapegraph-1.3.0-py3-none-any.whl.

File metadata

Download URL: langchain_scrapegraph-1.3.0-py3-none-any.whl
Upload date: Feb 22, 2025
Size: 11.4 kB
Tags: Python 3
Uploaded using Trusted Publishing? No
Uploaded via: twine/5.1.1 CPython/3.10.16

File hashes

Hashes for langchain_scrapegraph-1.3.0-py3-none-any.whl
Algorithm	Hash digest
SHA256	`9752acc0b0b8e9796c3b133e7375d517d2cf7b9cca9c2602f7d193bd43758d54`
MD5	`33dcb7f0b02f82e4b91b7f014bc09b2a`
BLAKE2b-256	`55c28b07f80ca518585b8ad8607be2cf32c5ee4ec7af63a1659d1028b449a001`

See more details on using hashes here.

langchain-scrapegraph 1.3.0

Navigation

Verified details

Maintainers

Unverified details

Project links

Meta

Classifiers

Project description

🕷️🦜 langchain-scrapegraph

🔗 ScrapeGraph API & SDKs

📦 Installation

🛠️ Available Tools

📝 MarkdownifyTool

🔍 SmartscraperTool

🌐 SearchscraperTool

🌟 Key Features

💡 Use Cases

🤖 Example Agent

⚙️ Configuration

📚 Documentation

💬 Support & Feedback

📄 License

🙏 Acknowledgments

Project details

Verified details

Maintainers

Unverified details

Project links

Meta

Classifiers

Release history Release notifications | RSS feed

Download files

Source Distribution

Built Distribution

File details

File metadata

File hashes

File details

File metadata

File hashes