Skip to main content

Integrated LLM inference engine with a LangChain-Core-style interface across Kimi, GLM, MiniMax, DeepSeek, OpenAI, Anthropic, Hugging Face, and NVIDIA NIM.

Project description

ActLLMInfer

Integrated LLM inference engine with a LangChain-Core-style interface. One small package, one consistent API across the major Chinese and US chat-LLM providers — designed to be the inference layer for the actdecor package.

Supported providers

Provider Class Default model API key env var
Moonshot (Kimi) ChatMoonshot / ChatKimi moonshot-v1-8k MOONSHOT_API_KEY
ZhipuAI (GLM) ChatZhipuAI / ChatGLM glm-4-plus ZHIPUAI_API_KEY
MiniMax ChatMiniMax abab6.5s-chat MINIMAX_API_KEY
DeepSeek ChatDeepSeek deepseek-chat DEEPSEEK_API_KEY
OpenAI ChatOpenAI gpt-4o-mini OPENAI_API_KEY
Anthropic ChatAnthropic claude-sonnet-4-6 ANTHROPIC_API_KEY
Hugging Face ChatHuggingFace Hythcliff/canadian-address-checker-on HF_TOKEN (or HUGGINGFACEHUB_API_TOKEN)
NVIDIA NIM ChatNVIDIA meta/llama-3.3-70b-instruct NVIDIA_API_KEY (or NGC_API_KEY)

Install

pip install -e .

The only required dependency is requests. httpx is optional (for users who want async transports later).

Quick start

Direct invocation

from actllminfer import ChatKimi, HumanMessage, SystemMessage

llm = ChatKimi(model="moonshot-v1-8k", temperature=0.2)
reply = llm.invoke([
    SystemMessage(content="You are a concise assistant."),
    HumanMessage(content="Summarize the theory of relativity in one sentence."),
])
print(reply.content)

Unified completion() endpoint (LiteLLM-style)

For callers that just want a single function with an OpenAI-shaped response, completion() dispatches across every supported provider via a provider/model string. Providers can also be addressed with the legacy provider:model separator, and known prefixes (glm-, deepseek-, claude-, nemotron, …) are inferred automatically.

from actllminfer import completion

resp = completion(
    model="openai/gpt-4o-mini",
    messages=[{"role": "user", "content": "Summarize relativity in one line."}],
    temperature=0.2,
)
print(resp.choices[0].message.content)   # attribute access
print(resp["choices"][0]["message"]["content"])  # dict access
print(resp.usage["total_tokens"])

stream=True returns an iterator of OpenAI-shaped chat.completion.chunk objects:

for chunk in completion(model="kimi/moonshot-v1-8k", messages="Tell me a joke", stream=True):
    delta = chunk.choices[0].delta
    if delta.get("content"):
        print(delta.content, end="", flush=True)

Per-call generation knobs (temperature, max_tokens, tools, response_format, seed, stop, …) are forwarded to the provider. Constructor-time arguments (api_key, base_url, organization, request_timeout, default_headers, extra_body, …) build (and cache) the underlying client.

An embedding() counterpart and async variants (acompletion, aembedding) are provided too:

from actllminfer import embedding, acompletion

vecs = embedding(model="openai/text-embedding-3-small", input=["hi", "there"])
print(vecs.data[0].embedding[:4])

resp = await acompletion(model="anthropic/claude-sonnet-4-6", messages="Hi")

A primary-with-fallbacks Router retries on transient errors so a rate-limited primary transparently falls through to the next provider:

from actllminfer import Router

router = Router([
    "openai/gpt-4o-mini",
    "kimi/moonshot-v1-8k",
    "deepseek/deepseek-chat",
])
resp = router.completion(messages=[{"role": "user", "content": "Hi"}])

String spec via the factory

from actllminfer import init_chat_model

llm = init_chat_model("kimi:moonshot-v1-8k", temperature=0)
llm = init_chat_model("glm-4-plus")                 # provider inferred
llm = init_chat_model("deepseek-reasoner")          # provider inferred
llm = init_chat_model("anthropic:claude-sonnet-4-6")
llm = init_chat_model("abab6.5s-chat")              # MiniMax inferred
llm = init_chat_model("hf:meta-llama/Llama-3.3-70B-Instruct")
llm = init_chat_model("nvidia:meta/llama-3.3-70b-instruct")

Hugging Face

ChatHuggingFace defaults to the HF Inference Router (https://router.huggingface.co/v1/chat/completions), which is OpenAI-compatible and dispatches to whichever provider currently serves the model id you pass.

from actllminfer import ChatHuggingFace

llm = ChatHuggingFace(model="Qwen/Qwen2.5-72B-Instruct")
print(llm.invoke("Summarize the theory of relativity in one sentence.").content)

For a dedicated Inference Endpoint, a self-hosted TGI server, or any other OpenAI-compatible deployment, just point base_url at the /v1 root:

llm = ChatHuggingFace(
    model="tgi",  # placeholder; the endpoint already targets a single model
    base_url="https://my-endpoint.example.com/v1",
)

The class accepts HF_TOKEN or the older HUGGINGFACEHUB_API_TOKEN env var.

A worked example using Hythcliff/canadian-address-checker-on to validate a batch of Canadian addresses and parse a structured JSON response is in examples/canadian_address_checker.py.

NVIDIA NIM (free serverless inference)

ChatNVIDIA targets NVIDIA's free OpenAI-compatible NIM endpoint at https://integrate.api.nvidia.com/v1/chat/completions. Grab a free nvapi-... key from build.nvidia.com and set NVIDIA_API_KEY (the legacy NGC_API_KEY is also accepted).

from actllminfer import ChatNVIDIA

llm = ChatNVIDIA(model="meta/llama-3.3-70b-instruct", temperature=0.2)
print(llm.invoke("Summarize the theory of relativity in one sentence.").content)

The same key fans out to dozens of hosted models — Llama 3.x, Mixtral, Nemotron, Qwen, DeepSeek, Phi, Gemma, etc. To target a self-hosted NIM microservice instead, point base_url at any OpenAI-compatible /v1 root:

llm = ChatNVIDIA(
    model="meta/llama-3.1-8b-instruct",
    base_url="https://my-nim.example.com/v1",
)

Composable chains (LCEL-style)

from actllminfer import ChatPromptTemplate, StrOutputParser, init_chat_model

llm = init_chat_model("kimi:moonshot-v1-8k")
prompt = ChatPromptTemplate.from_messages([
    ("system", "You are a helpful translator."),
    ("user", "Translate to {language}: {text}"),
])
chain = prompt | llm | StrOutputParser()

print(chain.invoke({"language": "French", "text": "Good morning"}))

Streaming

for chunk in llm.stream("Tell me a short story about a robot."):
    print(chunk.text, end="", flush=True)

Tool / function calling (OpenAI-shaped)

tools = [{
    "type": "function",
    "function": {
        "name": "get_weather",
        "description": "Get current weather for a city.",
        "parameters": {
            "type": "object",
            "properties": {"city": {"type": "string"}},
            "required": ["city"],
        },
    },
}]

llm_with_tools = init_chat_model("kimi:moonshot-v1-8k").with_tools(tools)
ai_msg = llm_with_tools.invoke("What's the weather in Beijing?")
for call in ai_msg.tool_calls:
    print(call.name, call.args)

The same tools=[{"type": "function", ...}] spec is automatically translated to Anthropic's input_schema shape for ChatAnthropic.

JSON output

from actllminfer import JsonOutputParser

chain = prompt | llm | JsonOutputParser()
data = chain.invoke({"language": "JSON", "text": "Return {\"ok\": true} only."})

Architecture

actllminfer/
├── messages.py         # BaseMessage, SystemMessage, HumanMessage, AIMessage, ToolMessage, ...
├── outputs.py          # ChatGeneration, ChatResult, ChatGenerationChunk
├── prompts.py          # PromptTemplate, ChatPromptTemplate, MessagesPlaceholder
├── output_parsers.py   # StrOutputParser, JsonOutputParser, CommaSeparatedListOutputParser
├── runnables.py        # Runnable, RunnableLambda, RunnablePassthrough, RunnableSequence
├── callbacks.py        # BaseCallbackHandler, CallbackManager, StdOutCallbackHandler
├── language_models/
│   └── base.py         # BaseChatModel
├── chat_models/
│   ├── _openai_compat.py   # shared OpenAI /v1/chat/completions backend
│   ├── openai.py
│   ├── moonshot.py     # Kimi
│   ├── deepseek.py
│   ├── zhipuai.py      # GLM
│   ├── minimax.py
|   ├── anthropic.py    # Claude (different transport)
│   ├── huggingface.py  # HF Inference Router / TGI / dedicated endpoints
│   └── nvidia.py       # NVIDIA NIM serverless / self-hosted
├── factory.py          # init_chat_model("kimi:moonshot-v1-8k")
└── exceptions.py

Every provider implements the same BaseChatModel contract: invoke, batch, stream, generate, with_tools, bind. That means the actdecor package can keep one code path and switch providers via configuration.

License

Apache 2.0.

Project details


Download files

Download the file for your platform. If you're not sure which to choose, learn more about installing packages.

Source Distribution

actllminfer-0.2.0.tar.gz (49.6 kB view details)

Uploaded Source

Built Distribution

If you're not sure about the file name format, learn more about wheel file names.

actllminfer-0.2.0-py3-none-any.whl (48.6 kB view details)

Uploaded Python 3

File details

Details for the file actllminfer-0.2.0.tar.gz.

File metadata

  • Download URL: actllminfer-0.2.0.tar.gz
  • Upload date:
  • Size: 49.6 kB
  • Tags: Source
  • Uploaded using Trusted Publishing? No
  • Uploaded via: twine/6.2.0 CPython/3.12.2

File hashes

Hashes for actllminfer-0.2.0.tar.gz
Algorithm Hash digest
SHA256 eae603986524cc6330f49b123abc3ed29616acb2908da161240e802ebd066e20
MD5 3dbb5dacb0a9415d1e0d0d02b8475949
BLAKE2b-256 c122b06cbb1a480c0c5260fb946e518e1e1b6dd9f837e9e58b7eb4436c670a97

See more details on using hashes here.

File details

Details for the file actllminfer-0.2.0-py3-none-any.whl.

File metadata

  • Download URL: actllminfer-0.2.0-py3-none-any.whl
  • Upload date:
  • Size: 48.6 kB
  • Tags: Python 3
  • Uploaded using Trusted Publishing? No
  • Uploaded via: twine/6.2.0 CPython/3.12.2

File hashes

Hashes for actllminfer-0.2.0-py3-none-any.whl
Algorithm Hash digest
SHA256 028773c3cc372fb1872a71fef90624151e06549c9ced26d98cd08cb9707bc330
MD5 68f73534ca9c32a286b56a88b012dbed
BLAKE2b-256 d6995e359f484c03899acf83b428fc6bf4bc10e463ab64391dfc57fcdeafdc10

See more details on using hashes here.

Supported by

AWS Cloud computing and Security Sponsor Datadog Monitoring Depot Continuous Integration Fastly CDN Google Download Analytics Pingdom Monitoring Sentry Error logging StatusPage Status page