Skip to main content

LlamaIndex integration for xAI Grok models with full structured output support, token counting, and native xai-sdk integration

Project description

llama-index-llms-grok

LlamaIndex integration for xAI's Grok models using the official xai-sdk.

This library provides native support for the latest Grok models (including Grok 4 and Grok 4.1 fast models with and without reasoning) using xAI's modern Chat API, unlike the older OpenAI-compatible completions endpoint.

Installation

pip install llama-index-llms-grok

Setup

Get your API key from console.x.ai and set it as an environment variable:

export XAI_API_KEY=your_api_key_here

Quick Start

Basic Chat

from llama_index_llms_grok import Grok
from llama_index.core.llms import ChatMessage

# Initialize with default Grok 4.1 model
llm = Grok(api_key="your_api_key")  # or set XAI_API_KEY env var

# Chat
messages = [
    ChatMessage(role="system", content="You are a helpful assistant."),
    ChatMessage(role="user", content="Explain quantum computing briefly."),
]
response = llm.chat(messages)
print(response.message.content)

Using Grok Fast (Non-Reasoning)

from llama_index_llms_grok import GrokFast

llm = GrokFast()  # Uses grok-4-1-fast-non-reasoning model
response = llm.complete("What is the capital of France?")
print(response.text)

Using Grok with Reasoning Mode

from llama_index_llms_grok import GrokReasoning

# Reasoning models may take longer, so timeout is set to 3600s by default
llm = GrokReasoning(show_reasoning=True)  # Set to True to see thinking process
response = llm.complete("Solve this logic puzzle: ...")
print(response.text)

Using Grok for Code

from llama_index_llms_grok import GrokCode

llm = GrokCode()  # Uses grok-code-fast-1 model
response = llm.complete("Write a Python function to calculate fibonacci numbers.")
print(response.text)

Using Grok Vision

from llama_index_llms_grok import GrokVision

llm = GrokVision()  # Uses grok-2-vision-1212 model
# Vision capabilities for image understanding

Using Grok 3 Models

from llama_index_llms_grok import Grok3, Grok3Mini

# Full Grok 3 model
llm = Grok3()

# Or lightweight Grok 3 Mini
llm_mini = Grok3Mini()

Streaming

from llama_index_llms_grok import Grok
from llama_index.core.llms import ChatMessage

llm = Grok()
messages = [ChatMessage(role="user", content="Tell me a story about AI.")]

for chunk in llm.stream_chat(messages):
    print(chunk.delta, end="", flush=True)

Custom Parameters

from llama_index_llms_grok import Grok

llm = Grok(
    model="grok-4-1-fast-reasoning",
    temperature=0.7,
    max_tokens=1024,
    timeout=600,
)

Token Counting

Token usage is available in response.additional_kwargs:

from llama_index_llms_grok import Grok

llm = Grok()
response = llm.complete("Hello, world!")

# Get token counts
print(f"Prompt tokens: {response.additional_kwargs.get('prompt_tokens')}")
print(f"Completion tokens: {response.additional_kwargs.get('completion_tokens')}")
print(f"Total tokens: {response.additional_kwargs.get('total_tokens')}")

See examples/token_counting_example.py for more examples.

Structured Outputs

Grok now supports structured outputs with Pydantic models:

from pydantic import BaseModel
from llama_index.core.prompts import PromptTemplate
from llama_index.core.program import LLMTextCompletionProgram
from llama_index_llms_grok import Grok

class Person(BaseModel):
    name: str
    age: int
    occupation: str

llm = Grok()

# Method 1: Using structured_predict
prompt = PromptTemplate("Extract person info: {text}")
person = llm.structured_predict(
    output_cls=Person,
    prompt=prompt,
    text="Alice is a 30-year-old engineer"
)

# Method 2: Using LLMTextCompletionProgram
program = LLMTextCompletionProgram.from_defaults(
    output_cls=Person,
    llm=llm,
    prompt_template_str="Extract person info: {text}"
)
person = program(text="Bob is a 25-year-old designer")

# Method 3: Using as_structured_llm
structured_llm = llm.as_structured_llm(output_cls=Person)
response = structured_llm.complete("Extract: Charlie, 35, doctor")
person = response.raw  # Pydantic model instance

See examples/structured_outputs_example.py for comprehensive examples.

Streaming Structured Outputs

from llama_index.core.prompts import PromptTemplate
from llama_index_llms_grok import Grok

llm = Grok()
prompt = PromptTemplate("Extract product info: {text}")

# Stream partial structured outputs
for partial_product in llm.stream_structured_predict(
    output_cls=Product,
    prompt=prompt,
    text="iPhone 15 costs $999 and is in stock"
):
    print(f"Update: {partial_product.name}")

Available Models

Language Models

Grok 4.1 (Latest - 2M Context Window)

  • grok-4-1-fast-reasoning - Fast model with reasoning (default)
  • grok-4-1-fast-non-reasoning - Fast model without reasoning (GrokFast)

Grok 4 (2M Context Window)

  • grok-4-fast-reasoning - Alternative fast with reasoning
  • grok-4-fast-non-reasoning - Alternative fast without reasoning

Specialized Models

  • grok-code-fast-1 - Optimized for code (256K context) (GrokCode)
  • grok-4-0709 - Specific version (256K context)

Grok 3 (131K Context Window)

  • grok-3 - Standard Grok 3 model (Grok3)
  • grok-3-mini - Lightweight Grok 3 (Grok3Mini)

Grok 2

  • grok-2-1212 - Grok 2 from December 2024 (131K context)
  • grok-2-vision-1212 - Vision-enabled Grok 2 (32K context) (GrokVision)

Image Generation Models

  • grok-2-image-1212 - Image generation (not yet supported in this package)

Features

  • ✅ Native xAI SDK integration using modern Chat API
  • ✅ Support for all Grok models (2, 3, 4, 4.1)
  • ✅ 2M context window support for Grok 4.1 models
  • ✅ Specialized models: Code, Vision
  • ✅ Reasoning and non-reasoning modes
  • ✅ Streaming responses
  • Structured outputs with LLMTextCompletionProgram and as_structured_llm()
  • ✅ Token counting via response.additional_kwargs
  • ✅ Automatic reasoning content handling
  • ✅ Full LlamaIndex LLM interface compatibility
  • ✅ Type hints and proper error handling
  • ✅ Configurable timeouts for long-running reasoning tasks
  • ✅ Async/await support

Note on TokenCountingHandler

  • ⚠️ TokenCountingHandler may not work perfectly - use response.additional_kwargs for reliable token counts

See COMPATIBILITY_NOTES.md for details.

Advanced Usage

Accessing Reasoning Content

When using reasoning models with show_reasoning=False (default), the thinking process is stripped from the response but accessible via additional_kwargs:

from llama_index_llms_grok import GrokReasoning
from llama_index.core.llms import ChatMessage

llm = GrokReasoning(show_reasoning=False)
response = llm.chat([ChatMessage(role="user", content="Complex question...")])

# Access reasoning if available
if "reasoning_content" in response.message.additional_kwargs:
    print("Thinking:", response.message.additional_kwargs["reasoning_content"])
print("Answer:", response.message.content)

Integration with LlamaIndex

from llama_index.core import VectorStoreIndex, SimpleDirectoryReader
from llama_index_llms_grok import Grok

# Load documents
documents = SimpleDirectoryReader("./data").load_data()

# Create index with Grok
llm = Grok(model="grok-4-1-fast-reasoning")
index = VectorStoreIndex.from_documents(documents)

# Query
query_engine = index.as_query_engine(llm=llm)
response = query_engine.query("What are the key points in these documents?")
print(response)

Examples

The package includes comprehensive examples demonstrating all features:

Available Example Files

  1. examples/basic_usage.py - Basic usage of all Grok models

    • Chat and completion
    • Fast and reasoning models
    • Streaming responses
    • Code generation with GrokCode
    • Grok 3 and Grok 3 Mini usage
    • Vision model information
  2. examples/token_counting_example.py - Token usage tracking

    • Basic token counting from responses
    • Multi-turn conversation tracking
    • Token counting with different models
    • Best practices for token tracking
  3. examples/structured_outputs_example.py - Structured outputs (NEW!)

    • Using structured_predict() method
    • Using LLMTextCompletionProgram
    • Using as_structured_llm()
    • Complex nested Pydantic models
    • Streaming structured outputs
    • Integration with query engines
    • Before/after comparisons

Running Examples

# Set your API key
export XAI_API_KEY=your_api_key_here

# Run basic examples
python examples/basic_usage.py

# Run token counting examples
python examples/token_counting_example.py

# Run structured outputs examples
python examples/structured_outputs_example.py

Requirements

  • Python >=3.10
  • xai-sdk>=1.4.0
  • llama-index-core>=0.14.8

License

MIT

Contributing

Contributions are welcome! Please feel free to submit a Pull Request.

Documentation

Main Documentation

Links

Comparison with Other Providers

Why Use This Integration?

This integration uses xAI's native SDK instead of OpenAI compatibility mode:

  • Latest Models: Access to newest Grok models immediately
  • Native Features: Full reasoning mode support with <thinking> tags
  • Structured Outputs: Complete LLMTextCompletionProgram support
  • Better Performance: 2M context window for Grok 4.1 models
  • Specialized Models: GrokCode, GrokVision, Grok3Mini
  • Token Counting: Built-in token usage tracking
  • Future-Proof: Native SDK ensures compatibility with new xAI features

Feature Comparison

Feature Grok (This Package) OpenAI Anthropic Gemini
Structured Outputs
LLMTextCompletionProgram
Streaming
Async Support
Token Counting
Reasoning Mode
2M Context Window
Code-Optimized Model

Changelog

Version 0.1.2 (2024-11-20)

Fixed

  • CRITICAL: Fixed 'Response' object is not iterable error in production usage with query engines
    • Root cause: LlamaIndex was trying to iterate over raw xai SDK Response object
    • Solution: Convert Response to dict before passing to LlamaIndex
    • Now fully compatible with all LlamaIndex features including query engines

Version 0.1.1 (2024-11-20)

Added

  • Full Structured Output Support: structured_predict(), LLMTextCompletionProgram, as_structured_llm()
  • Streaming Structured Outputs: stream_structured_predict() and async version
  • JSON Schema Generation: Automatic Pydantic model to JSON schema conversion
  • Response Parsing: Automatic markdown stripping and JSON validation

Fixed

  • ✅ Fixed 'Response' object is not iterable error with LLMTextCompletionProgram
  • ✅ Improved structured output reliability with better prompt engineering

Documentation

  • ✅ Added comprehensive structured outputs guide
  • ✅ Added structured outputs examples
  • ✅ Updated README with all new features
  • ✅ Added troubleshooting section

Version 0.1.0 (2024-11-20)

Added

  • New Models: Support for all Grok models (2, 3, 4, 4.1)
  • Convenience Classes: GrokFast, GrokReasoning, GrokCode, GrokVision, Grok3, Grok3Mini
  • Streaming Support: Full streaming for chat, completion, and structured outputs
  • Async Support: All async methods implemented
  • Comprehensive Documentation: 50+ pages of guides and examples

Fixed

  • ✅ Fixed 'Response' object is not iterable error with structured outputs
  • ✅ Fixed Pydantic v2 compatibility for llama-index-core 0.14.8+
  • ✅ Fixed model names to match official xAI API

Features

  • ✅ 2M context window support for Grok 4.1 models
  • ✅ Automatic reasoning content extraction
  • ✅ Dynamic context window detection per model
  • ✅ JSON schema generation from Pydantic models
  • ✅ Automatic response parsing and validation

Troubleshooting

Common Issues

Issue: ValueError: Trying to read the xAI API key from the XAI_API_KEY environment variable but it doesn't exist

Solution: Set your API key:

export XAI_API_KEY=your_api_key_here

Or pass it directly:

llm = Grok(api_key="your_api_key_here")

Issue: Token counting not working

Solution: Use response.additional_kwargs instead of TokenCountingHandler:

response = llm.complete("...")
tokens = response.additional_kwargs.get('total_tokens', 0)

Issue: Structured outputs failing

Solution: Make sure you're using the latest version with structured output support:

pip install --upgrade llama-index-llms-grok

For more issues and solutions, see COMPATIBILITY_NOTES.md.

Best Practices

Choosing the Right Model

  • Fast Responses: Use GrokFast() (grok-4-1-fast-non-reasoning)
  • Complex Reasoning: Use GrokReasoning() (grok-4-1-fast-reasoning)
  • Code Generation: Use GrokCode() (grok-code-fast-1)
  • Budget-Friendly: Use Grok3Mini() (grok-3-mini)

Token Management

# Always check token usage
response = llm.complete("...")
if response.additional_kwargs:
    tokens = response.additional_kwargs.get('total_tokens', 0)
    print(f"Used {tokens} tokens")

Structured Outputs

# Use descriptive field names and types
from pydantic import BaseModel, Field

class Person(BaseModel):  
    """Person information."""  # Helps the LLM understand
    name: str = Field(description="Full name")
    age: int = Field(description="Age in years", ge=0, le=150)

Error Handling

try:
    response = llm.complete("...")
except Exception as e:
    print(f"Error: {e}")
    # Handle error appropriately

Project details


Download files

Download the file for your platform. If you're not sure which to choose, learn more about installing packages.

Source Distribution

llama_index_llms_grok-0.1.2.tar.gz (14.2 kB view details)

Uploaded Source

Built Distribution

If you're not sure about the file name format, learn more about wheel file names.

llama_index_llms_grok-0.1.2-py3-none-any.whl (11.7 kB view details)

Uploaded Python 3

File details

Details for the file llama_index_llms_grok-0.1.2.tar.gz.

File metadata

  • Download URL: llama_index_llms_grok-0.1.2.tar.gz
  • Upload date:
  • Size: 14.2 kB
  • Tags: Source
  • Uploaded using Trusted Publishing? No
  • Uploaded via: twine/6.2.0 CPython/3.11.3

File hashes

Hashes for llama_index_llms_grok-0.1.2.tar.gz
Algorithm Hash digest
SHA256 05c62844583f1dd10e2dac8efd2e52273351952943c7b38831a3bd55d6303957
MD5 203e6db114d272fb136e5a3e6f7c2c5e
BLAKE2b-256 94453585dad4ad11a0a6476cc2f623ef31692294267a111427984d2035208292

See more details on using hashes here.

File details

Details for the file llama_index_llms_grok-0.1.2-py3-none-any.whl.

File metadata

File hashes

Hashes for llama_index_llms_grok-0.1.2-py3-none-any.whl
Algorithm Hash digest
SHA256 1d5c6cda15e75018d050935fc274d30f3638d11b6d1414aa0b0a54743146aa92
MD5 3476d316656664da80342a4f91ae8b6b
BLAKE2b-256 a272b41021032090b83ad2e14f1b3ff5c3aa42f64dd75a949716dc3de273b3e7

See more details on using hashes here.

Supported by

AWS Cloud computing and Security Sponsor Datadog Monitoring Depot Continuous Integration Fastly CDN Google Download Analytics Pingdom Monitoring Sentry Error logging StatusPage Status page