LangChain integration for Ujeebu Extract API
Project description
LangChain Ujeebu Integration
Official LangChain integration for Ujeebu Extract API - Extract clean, structured content from news articles and blog posts for use with Large Language Models (LLMs) and AI applications.
Features
- Easy Integration: Seamlessly integrate Ujeebu Extract API with LangChain agents and chains
- Document Loaders: Load articles as LangChain Documents for use with vector stores and retrievers
- Agent Tools: Use Ujeebu Extract as a tool in LangChain agents
- Rich Metadata: Extract article text, HTML, author, publication date, images, and more
- Quick Mode: Optional fast extraction mode (30-60% faster)
- Type Safe: Full type hints and Pydantic validation
What is Ujeebu Extract?
Ujeebu Extract converts news and blog articles into clean, structured JSON data. It extracts:
- Clean article text and HTML
- Author and publication date
- Title and summary
- Images and media
- RSS feeds
- Site metadata
Perfect for RAG (Retrieval-Augmented Generation) applications, content analysis, and LLM training data.
Installation
pip install langchain-ujeebu
Requirements
- Python 3.8 or higher
- LangChain 0.1.0 or higher
- An Ujeebu API key (Get one here)
Quick Start
Set up your API key
export UJEEBU_API_KEY="your-api-key"
Or set it programmatically:
import os
os.environ["UJEEBU_API_KEY"] = "your-api-key"
Using as an Agent Tool
from langchain_ujeebu import UjeebuExtractTool
from langchain.agents import initialize_agent, AgentType
from langchain_openai import ChatOpenAI
# Initialize the tool
ujeebu_tool = UjeebuExtractTool()
# Create an agent
llm = ChatOpenAI(temperature=0)
agent = initialize_agent(
tools=[ujeebu_tool],
llm=llm,
agent=AgentType.OPENAI_FUNCTIONS,
verbose=True
)
# Use the agent
response = agent.invoke({
"input": "Extract the article from https://example.com/article and summarize it"
})
print(response)
Using the Document Loader
from langchain_ujeebu import UjeebuLoader
from langchain.vectorstores import FAISS
from langchain.embeddings import OpenAIEmbeddings
# Load articles
loader = UjeebuLoader(
urls=[
"https://example.com/article1",
"https://example.com/article2",
"https://example.com/article3"
]
)
documents = loader.load()
# Create a vector store
embeddings = OpenAIEmbeddings()
vectorstore = FAISS.from_documents(documents, embeddings)
# Query the documents
results = vectorstore.similarity_search("What are the main topics?")
Usage Examples
Basic Article Extraction
from langchain_ujeebu import UjeebuExtractTool
tool = UjeebuExtractTool()
result = tool._run(
url="https://example.com/article",
text=True,
author=True,
pub_date=True
)
print(result)
Extract with Images
from langchain_ujeebu import UjeebuExtractTool
tool = UjeebuExtractTool()
result = tool._run(
url="https://example.com/article",
images=True # Extract article images
)
Quick Mode for Faster Extraction
from langchain_ujeebu import UjeebuLoader
loader = UjeebuLoader(
urls=["https://example.com/article"],
quick_mode=True # 30-60% faster, slightly less accurate
)
documents = loader.load()
Load with HTML Content
from langchain_ujeebu import UjeebuLoader
loader = UjeebuLoader(
urls=["https://example.com/article"],
extract_html=True, # Include HTML content
extract_images=True # Include images
)
documents = loader.load()
# Access metadata
doc = documents[0]
print(f"Title: {doc.metadata['title']}")
print(f"Author: {doc.metadata['author']}")
print(f"Images: {doc.metadata['images']}")
Build a QA System
from langchain_ujeebu import UjeebuLoader
from langchain.chains import RetrievalQA
from langchain.vectorstores import FAISS
from langchain.embeddings import OpenAIEmbeddings
from langchain_openai import ChatOpenAI
# Load articles
loader = UjeebuLoader(
urls=[
"https://example.com/article1",
"https://example.com/article2"
]
)
documents = loader.load()
# Create vector store
embeddings = OpenAIEmbeddings()
vectorstore = FAISS.from_documents(documents, embeddings)
# Create QA chain
qa_chain = RetrievalQA.from_chain_type(
llm=ChatOpenAI(temperature=0),
chain_type="stuff",
retriever=vectorstore.as_retriever()
)
# Query
result = qa_chain.invoke({"query": "What are the main points?"})
print(result["result"])
API Reference
UjeebuExtractTool
A LangChain tool for extracting article content.
Parameters:
api_key(str, optional): Ujeebu API key. Defaults toUJEEBU_API_KEYenvironment variable.
Tool Parameters:
url(str, required): URL of the article to extracttext(bool): Extract article text (default: True)html(bool): Extract article HTML (default: False)author(bool): Extract article author (default: True)pub_date(bool): Extract publication date (default: True)images(bool): Extract images (default: False)quick_mode(bool): Use quick mode for faster extraction (default: False)
UjeebuLoader
A LangChain document loader for articles.
Parameters:
urls(List[str], required): List of article URLs to loadapi_key(str, optional): Ujeebu API keyextract_text(bool): Extract article text (default: True)extract_html(bool): Extract article HTML (default: False)extract_author(bool): Extract author (default: True)extract_pub_date(bool): Extract publication date (default: True)extract_images(bool): Extract images (default: False)quick_mode(bool): Use quick mode (default: False)
Methods:
load(): Load all documentslazy_load(): Lazy load documents (same as load for this implementation)
Document Metadata:
source: Original URLurl: Resolved URLcanonical_url: Canonical URLtitle: Article titleauthor: Article authorpub_date: Publication datelanguage: Article languagesite_name: Site namesummary: Article summaryimage: Main image URLimages: List of all image URLs (if extract_images=True)
Advanced Usage
Custom API Endpoint
from langchain_ujeebu import UjeebuLoader
loader = UjeebuLoader(
urls=["https://example.com/article"],
base_url="https://custom-api.ujeebu.com/extract"
)
Error Handling
from langchain_ujeebu import UjeebuLoader
loader = UjeebuLoader(urls=["https://example.com/article"])
try:
documents = loader.load()
print(f"Loaded {len(documents)} documents")
except ValueError as e:
print(f"API key error: {e}")
except Exception as e:
print(f"Error loading documents: {e}")
Testing
Run the test suite:
# Install dev dependencies
pip install -e ".[dev]"
# Run tests
pytest
# Run tests with coverage
pytest --cov=langchain_ujeebu --cov-report=html
# Run type checking
mypy langchain_ujeebu
# Run linting
flake8 langchain_ujeebu
black langchain_ujeebu
Examples
Check out the examples directory for more usage examples:
- agent_example.py - Using Ujeebu with LangChain agents
- document_loader_example.py - Using the document loader with vector stores
Pricing
Ujeebu Extract API pricing is based on usage. Check the pricing page for details.
Support
- Documentation: https://ujeebu.com/docs/extract
- API Reference: https://ujeebu.com/docs
- Support: support@ujeebu.com
- GitHub Issues: Report a bug
Contributing
Contributions are welcome! Please feel free to submit a Pull Request.
- Fork the repository
- Create your feature branch (
git checkout -b feature/amazing-feature) - Commit your changes (
git commit -m 'Add amazing feature') - Push to the branch (
git push origin feature/amazing-feature) - Open a Pull Request
License
This project is licensed under the MIT License - see the LICENSE file for details.
Related Projects
- LangChain - Build applications with LLMs through composability
- Ujeebu API - Web scraping and content extraction API
Changelog
0.1.0 (2024-12-30)
- Initial release
- UjeebuExtractTool for LangChain agents
- UjeebuLoader document loader
- Full test coverage
- Comprehensive documentation
Project details
Release history Release notifications | RSS feed
Download files
Download the file for your platform. If you're not sure which to choose, learn more about installing packages.
Source Distribution
Built Distribution
Filter files by name, interpreter, ABI, and platform.
If you're not sure about the file name format, learn more about wheel file names.
Copy a direct link to the current filters
File details
Details for the file langchain_ujeebu-0.1.1.tar.gz.
File metadata
- Download URL: langchain_ujeebu-0.1.1.tar.gz
- Upload date:
- Size: 15.7 kB
- Tags: Source
- Uploaded using Trusted Publishing? No
- Uploaded via: twine/6.2.0 CPython/3.13.7
File hashes
| Algorithm | Hash digest | |
|---|---|---|
| SHA256 |
1169d346a4af731975b2c5cfbc3c0e624074e8104a3f4fd7e1c84d9873dca8a1
|
|
| MD5 |
83db2979e8c52d3ac6699e668073baa4
|
|
| BLAKE2b-256 |
d14200818461999a0e20d62aa1b7a76e45946f11313c3f3b0fc946209962c6b9
|
File details
Details for the file langchain_ujeebu-0.1.1-py3-none-any.whl.
File metadata
- Download URL: langchain_ujeebu-0.1.1-py3-none-any.whl
- Upload date:
- Size: 11.4 kB
- Tags: Python 3
- Uploaded using Trusted Publishing? No
- Uploaded via: twine/6.2.0 CPython/3.13.7
File hashes
| Algorithm | Hash digest | |
|---|---|---|
| SHA256 |
1c4454d335766f1c4919f56b199a1fdbcf9a20b6423a2fb2a23b21e0351b9892
|
|
| MD5 |
72addfa18996635d1bb3f1d64312be68
|
|
| BLAKE2b-256 |
7b4a00c06bf22cb95b35197e33622e6481e69858ba3d5ccfbee2729173a9c754
|