LangChain integration for the Diffbot Knowledge Graph and Extract APIs
Project description
langchain-diffbot
A thin LangChain integration over the official diffbot-python SDK. Every Diffbot API gets the closest LangChain primitive:
| Diffbot API | LangChain class(es) |
|---|---|
| Knowledge Graph (DQL) | DiffbotKnowledgeGraphRetriever, DiffbotKnowledgeGraphTool |
| Web Search | DiffbotWebSearchRetriever, DiffbotWebSearchTool |
| Extract (Analyze) | DiffbotExtractTool, DiffbotExtractLoader |
| NLP entities | DiffbotEntitiesTool |
| Crawl | DiffbotCrawlLoader |
LLM RAG (ask) |
ChatDiffbot (with native streaming) |
Installation
pip install langchain-diffbot
Authentication
Get an API token at https://app.diffbot.com/get-started/ and export it:
export DIFFBOT_API_TOKEN="..."
Every class also accepts diffbot_api_token=... directly, or a pre-built diffbot.Diffbot client via client=... (see Bring-your-own-client below).
Quickstart — Knowledge Graph retriever
from langchain_diffbot import DiffbotKnowledgeGraphRetriever
retriever = DiffbotKnowledgeGraphRetriever(k=5)
docs = retriever.invoke("type:Organization industries:\"Artificial Intelligence\" location.city.name:\"Boston\"")
for d in docs:
print(d.metadata["name"], "—", d.page_content[:120])
The query string is a DQL (Diffbot Query Language) expression.
Shaping the output
Diffbot KG entities and web-search results are large. Dumping them straight into an LLM prompt can blow past per-minute input-token limits in a single call. Both retrievers expose three shaping knobs:
from langchain_core.documents import Document
from langchain_diffbot import DiffbotKnowledgeGraphRetriever
# 1. Project only the top-level fields you care about. Drops everything else
# from `metadata`. Recommended for agent / tool-use scenarios.
retriever = DiffbotKnowledgeGraphRetriever(
k=5,
fields=["id", "type", "name", "homepageUri", "nbEmployees"],
)
# 2. Choose which field becomes `page_content`. First non-empty value wins.
retriever = DiffbotKnowledgeGraphRetriever(
content_fields=["summary", "description", "name"],
)
# 3. For total control, pass a `document_mapper` that turns a raw entity
# dict into whatever Document shape you want.
def mapper(entity: dict) -> Document:
return Document(
page_content=entity.get("summary", ""),
metadata={"id": entity["id"], "name": entity["name"]},
)
retriever = DiffbotKnowledgeGraphRetriever(document_mapper=mapper)
fields and content_fields are ignored when document_mapper is set. The same knobs work on DiffbotWebSearchRetriever.
Web search
from langchain_diffbot import DiffbotWebSearchRetriever
web = DiffbotWebSearchRetriever(k=5, fields=["title", "pageUrl", "score"])
docs = web.invoke("diffbot knowledge graph llm grounding")
Extract a URL
from langchain_diffbot import DiffbotExtractTool, DiffbotExtractLoader
# Single URL
tool = DiffbotExtractTool()
page = tool.invoke({"url": "https://www.diffbot.com/products/extract/"})
# Batch — yields one Document per URL, sync or async
loader = DiffbotExtractLoader(urls=["https://example.com", "https://diffbot.com"])
for doc in loader.lazy_load():
print(doc.metadata["title"], doc.page_content[:200])
DiffbotExtractTool returns a structured {"error": ..., "errorCode": ...} dict when Diffbot reports an extraction failure (200 with errorCode), so agents can react and try another URL instead of catching an exception. Auth / rate-limit errors propagate as diffbot.errors.AuthError / RateLimitError.
ChatDiffbot
from langchain_core.messages import HumanMessage
from langchain_diffbot import ChatDiffbot
llm = ChatDiffbot()
for chunk in llm.stream([HumanMessage(content="What is the Diffbot Knowledge Graph?")]):
print(chunk.content, end="", flush=True)
_stream / _astream are native — no thread-pool fallback. .invoke() aggregates the stream into a single message.
Using a retriever in a chain
The retrievers are standard BaseRetrievers, so they slot into LCEL like any other:
from langchain_anthropic import ChatAnthropic
from langchain_core.output_parsers import StrOutputParser
from langchain_core.prompts import ChatPromptTemplate
from langchain_core.runnables import RunnablePassthrough
from langchain_diffbot import DiffbotKnowledgeGraphRetriever
retriever = DiffbotKnowledgeGraphRetriever(
k=5,
fields=["id", "name", "homepageUri", "nbEmployees", "industries"],
)
prompt = ChatPromptTemplate.from_template(
"Answer using only this Diffbot KG context:\n\n{context}\n\nQuestion: {question}"
)
def _format(docs):
return "\n---\n".join(
f"{d.metadata.get('name')} (id={d.metadata.get('id')}): {d.page_content}"
for d in docs
)
chain = (
{"context": retriever | _format, "question": RunnablePassthrough()}
| prompt
| ChatAnthropic(model="claude-sonnet-4-6")
| StrOutputParser()
)
chain.invoke('type:Organization location.city.name:"Boston" industries:"Biotech"')
Bring-your-own-client
Every class accepts a pre-built diffbot.Diffbot (or diffbot.DiffbotAsync) via client / async_client. The package uses it as-is and does not close it — you own the lifecycle. This is the escape hatch for anything the SDK supports that's not re-exposed as a field (custom URLs, transport=, shared connection pools, custom headers).
from diffbot import Diffbot
from langchain_diffbot import DiffbotKnowledgeGraphRetriever
# One client shared across many retriever calls (no per-call httpx pool churn)
shared = Diffbot(token="...", timeout=60.0)
retriever = DiffbotKnowledgeGraphRetriever(client=shared, k=5)
Examples
The examples/ folder has runnable demos:
examples/quickstart/— full tour: every public class, output shaping, async, and a multi-tool research agent.examples/company_research/— the same multi-tool agent as a one-shot CLI:cd examples && python -m company_research "your question". The agent combines KG search + web search + URL extract.
Both need langchain + langchain-anthropic on top of the base package — install the extra:
pip install "langchain-diffbot[examples]"
Development
uv sync --all-groups
uv run pytest tests/unit_tests
Releasing
Tokens are stored in macOS Keychain so the Makefile can pull them automatically — no plaintext on disk, no shell-history leaks. First-time setup:
make set-token-testpypi # prompts; input is hidden as you paste
make set-token-pypi # same, for real PyPI
Both targets read with bash read -rsp (hidden input), overwrite any existing entry, and never put the token in make output or shell history. Re-run either one any time you rotate a token.
Then the release flow per version:
# 1. Bump the version (edits pyproject.toml in place via `uv version --bump`)
make bump-patch # 0.1.0 → 0.1.1
# or: make bump-minor # 0.1.0 → 0.2.0
# or: make bump-major # 0.1.0 → 1.0.0
# 2. Publish to TestPyPI and verify installable
make release-test
make verify-release-test
# 3. Publish to real PyPI (prompts for the version to confirm)
make release
make verify-release
make release-test and make release will refuse to publish if the current pyproject.toml version is already on the target index — so the workflow is "bump → release-test → verify → release → verify".
To rotate: revoke the old token at https://pypi.org/manage/account/token/ (or the TestPyPI equivalent), then run make set-token-pypi / make set-token-testpypi again — it overwrites the existing Keychain entry without prompting.
Integration tests hit the live Diffbot API and require DIFFBOT_API_TOKEN:
uv run pytest tests/integration_tests
Project details
Release history Release notifications | RSS feed
Download files
Download the file for your platform. If you're not sure which to choose, learn more about installing packages.
Source Distribution
Built Distribution
Filter files by name, interpreter, ABI, and platform.
If you're not sure about the file name format, learn more about wheel file names.
Copy a direct link to the current filters
File details
Details for the file langchain_diffbot-0.1.0.tar.gz.
File metadata
- Download URL: langchain_diffbot-0.1.0.tar.gz
- Upload date:
- Size: 15.2 kB
- Tags: Source
- Uploaded using Trusted Publishing? No
- Uploaded via: uv/0.11.5 {"installer":{"name":"uv","version":"0.11.5","subcommand":["publish"]},"python":null,"implementation":{"name":null,"version":null},"distro":{"name":"macOS","version":null,"id":null,"libc":null},"system":{"name":null,"release":null},"cpu":null,"openssl_version":null,"setuptools_version":null,"rustc_version":null,"ci":null}
File hashes
| Algorithm | Hash digest | |
|---|---|---|
| SHA256 |
ae92cdee51e67221118e69569aa0ee4d8d4f4761184cfd3bc460ea65133aaa2b
|
|
| MD5 |
cc48b636a3f1352cf367e594d335314c
|
|
| BLAKE2b-256 |
c99e20c30f132c72cf2ce6ba9b0429043064013634e491d658095d9bfb8850b1
|
File details
Details for the file langchain_diffbot-0.1.0-py3-none-any.whl.
File metadata
- Download URL: langchain_diffbot-0.1.0-py3-none-any.whl
- Upload date:
- Size: 18.2 kB
- Tags: Python 3
- Uploaded using Trusted Publishing? No
- Uploaded via: uv/0.11.5 {"installer":{"name":"uv","version":"0.11.5","subcommand":["publish"]},"python":null,"implementation":{"name":null,"version":null},"distro":{"name":"macOS","version":null,"id":null,"libc":null},"system":{"name":null,"release":null},"cpu":null,"openssl_version":null,"setuptools_version":null,"rustc_version":null,"ci":null}
File hashes
| Algorithm | Hash digest | |
|---|---|---|
| SHA256 |
de1e3d1d8457926a41fe5e3cbad1f9196eeeed7798545da9be5ac4b02ce8af35
|
|
| MD5 |
b99a1ab29453c476fbb56692562ed500
|
|
| BLAKE2b-256 |
a04a4819942bd23d18a68b6eb9376ae5a50b1524628f8f84739ab567baca288b
|