Integrated LLM inference engine with a LangChain-Core-style interface across Kimi, GLM, MiniMax, DeepSeek, OpenAI, Anthropic, Hugging Face, and NVIDIA NIM.

These details have not been verified by PyPI

Project links

Project description

ActLLMInfer

Integrated LLM inference engine with a LangChain-Core-style interface. One small package, one consistent API across the major Chinese and US chat-LLM providers — designed to be the inference layer for the actdecor package.

Supported providers

Provider	Class	Default model	API key env var
Moonshot (Kimi)	`ChatMoonshot` / `ChatKimi`	`moonshot-v1-8k`	`MOONSHOT_API_KEY`
ZhipuAI (GLM)	`ChatZhipuAI` / `ChatGLM`	`glm-4-plus`	`ZHIPUAI_API_KEY`
MiniMax	`ChatMiniMax`	`abab6.5s-chat`	`MINIMAX_API_KEY`
DeepSeek	`ChatDeepSeek`	`deepseek-chat`	`DEEPSEEK_API_KEY`
OpenAI	`ChatOpenAI`	`gpt-4o-mini`	`OPENAI_API_KEY`
Anthropic	`ChatAnthropic`	`claude-sonnet-4-6`	`ANTHROPIC_API_KEY`
Hugging Face	`ChatHuggingFace`	`Hythcliff/canadian-address-checker-on`	`HF_TOKEN` (or `HUGGINGFACEHUB_API_TOKEN`)
NVIDIA NIM	`ChatNVIDIA`	`meta/llama-3.3-70b-instruct`	`NVIDIA_API_KEY` (or `NGC_API_KEY`)

Install

pip install -e .

The only required dependency is requests. httpx is optional (for users who want async transports later).

Quick start

Direct invocation

from actllminfer import ChatKimi, HumanMessage, SystemMessage

llm = ChatKimi(model="moonshot-v1-8k", temperature=0.2)
reply = llm.invoke([
    SystemMessage(content="You are a concise assistant."),
    HumanMessage(content="Summarize the theory of relativity in one sentence."),
])
print(reply.content)

Unified `completion()` endpoint (LiteLLM-style)

For callers that just want a single function with an OpenAI-shaped response, completion() dispatches across every supported provider via a provider/model string. Providers can also be addressed with the legacy provider:model separator, and known prefixes (glm-, deepseek-, claude-, nemotron, …) are inferred automatically.

from actllminfer import completion

resp = completion(
    model="openai/gpt-4o-mini",
    messages=[{"role": "user", "content": "Summarize relativity in one line."}],
    temperature=0.2,
)
print(resp.choices[0].message.content)   # attribute access
print(resp["choices"][0]["message"]["content"])  # dict access
print(resp.usage["total_tokens"])

stream=True returns an iterator of OpenAI-shaped chat.completion.chunk objects:

for chunk in completion(model="kimi/moonshot-v1-8k", messages="Tell me a joke", stream=True):
    delta = chunk.choices[0].delta
    if delta.get("content"):
        print(delta.content, end="", flush=True)

Per-call generation knobs (temperature, max_tokens, tools, response_format, seed, stop, …) are forwarded to the provider. Constructor-time arguments (api_key, base_url, organization, request_timeout, default_headers, extra_body, …) build (and cache) the underlying client.

An embedding() counterpart and async variants (acompletion, aembedding) are provided too:

from actllminfer import embedding, acompletion

vecs = embedding(model="openai/text-embedding-3-small", input=["hi", "there"])
print(vecs.data[0].embedding[:4])

resp = await acompletion(model="anthropic/claude-sonnet-4-6", messages="Hi")

A primary-with-fallbacks Router retries on transient errors so a rate-limited primary transparently falls through to the next provider:

from actllminfer import Router

router = Router([
    "openai/gpt-4o-mini",
    "kimi/moonshot-v1-8k",
    "deepseek/deepseek-chat",
])
resp = router.completion(messages=[{"role": "user", "content": "Hi"}])

String spec via the factory

from actllminfer import init_chat_model

llm = init_chat_model("kimi:moonshot-v1-8k", temperature=0)
llm = init_chat_model("glm-4-plus")                 # provider inferred
llm = init_chat_model("deepseek-reasoner")          # provider inferred
llm = init_chat_model("anthropic:claude-sonnet-4-6")
llm = init_chat_model("abab6.5s-chat")              # MiniMax inferred
llm = init_chat_model("hf:meta-llama/Llama-3.3-70B-Instruct")
llm = init_chat_model("nvidia:meta/llama-3.3-70b-instruct")

Hugging Face

ChatHuggingFace defaults to the HF Inference Router (https://router.huggingface.co/v1/chat/completions), which is OpenAI-compatible and dispatches to whichever provider currently serves the model id you pass.

from actllminfer import ChatHuggingFace

llm = ChatHuggingFace(model="Qwen/Qwen2.5-72B-Instruct")
print(llm.invoke("Summarize the theory of relativity in one sentence.").content)

For a dedicated Inference Endpoint, a self-hosted TGI server, or any other OpenAI-compatible deployment, just point base_url at the /v1 root:

llm = ChatHuggingFace(
    model="tgi",  # placeholder; the endpoint already targets a single model
    base_url="https://my-endpoint.example.com/v1",
)

The class accepts HF_TOKEN or the older HUGGINGFACEHUB_API_TOKEN env var.

A worked example using Hythcliff/canadian-address-checker-on to validate a batch of Canadian addresses and parse a structured JSON response is in examples/canadian_address_checker.py.

NVIDIA NIM (free serverless inference)

ChatNVIDIA targets NVIDIA's free OpenAI-compatible NIM endpoint at https://integrate.api.nvidia.com/v1/chat/completions. Grab a free nvapi-... key from build.nvidia.com and set NVIDIA_API_KEY (the legacy NGC_API_KEY is also accepted).

from actllminfer import ChatNVIDIA

llm = ChatNVIDIA(model="meta/llama-3.3-70b-instruct", temperature=0.2)
print(llm.invoke("Summarize the theory of relativity in one sentence.").content)

The same key fans out to dozens of hosted models — Llama 3.x, Mixtral, Nemotron, Qwen, DeepSeek, Phi, Gemma, etc. To target a self-hosted NIM microservice instead, point base_url at any OpenAI-compatible /v1 root:

llm = ChatNVIDIA(
    model="meta/llama-3.1-8b-instruct",
    base_url="https://my-nim.example.com/v1",
)

Composable chains (LCEL-style)

from actllminfer import ChatPromptTemplate, StrOutputParser, init_chat_model

llm = init_chat_model("kimi:moonshot-v1-8k")
prompt = ChatPromptTemplate.from_messages([
    ("system", "You are a helpful translator."),
    ("user", "Translate to {language}: {text}"),
])
chain = prompt | llm | StrOutputParser()

print(chain.invoke({"language": "French", "text": "Good morning"}))

Streaming

for chunk in llm.stream("Tell me a short story about a robot."):
    print(chunk.text, end="", flush=True)

Tool / function calling (OpenAI-shaped)

tools = [{
    "type": "function",
    "function": {
        "name": "get_weather",
        "description": "Get current weather for a city.",
        "parameters": {
            "type": "object",
            "properties": {"city": {"type": "string"}},
            "required": ["city"],
        },
    },
}]

llm_with_tools = init_chat_model("kimi:moonshot-v1-8k").with_tools(tools)
ai_msg = llm_with_tools.invoke("What's the weather in Beijing?")
for call in ai_msg.tool_calls:
    print(call.name, call.args)

The same tools=[{"type": "function", ...}] spec is automatically translated to Anthropic's input_schema shape for ChatAnthropic.

JSON output

from actllminfer import JsonOutputParser

chain = prompt | llm | JsonOutputParser()
data = chain.invoke({"language": "JSON", "text": "Return {\"ok\": true} only."})

Architecture

actllminfer/
├── messages.py         # BaseMessage, SystemMessage, HumanMessage, AIMessage, ToolMessage, ...
├── outputs.py          # ChatGeneration, ChatResult, ChatGenerationChunk
├── prompts.py          # PromptTemplate, ChatPromptTemplate, MessagesPlaceholder
├── output_parsers.py   # StrOutputParser, JsonOutputParser, CommaSeparatedListOutputParser
├── runnables.py        # Runnable, RunnableLambda, RunnablePassthrough, RunnableSequence
├── callbacks.py        # BaseCallbackHandler, CallbackManager, StdOutCallbackHandler
├── language_models/
│   └── base.py         # BaseChatModel
├── chat_models/
│   ├── _openai_compat.py   # shared OpenAI /v1/chat/completions backend
│   ├── openai.py
│   ├── moonshot.py     # Kimi
│   ├── deepseek.py
│   ├── zhipuai.py      # GLM
│   ├── minimax.py
|   ├── anthropic.py    # Claude (different transport)
│   ├── huggingface.py  # HF Inference Router / TGI / dedicated endpoints
│   └── nvidia.py       # NVIDIA NIM serverless / self-hosted
├── factory.py          # init_chat_model("kimi:moonshot-v1-8k")
└── exceptions.py

Every provider implements the same BaseChatModel contract: invoke, batch, stream, generate, with_tools, bind. That means the actdecor package can keep one code path and switch providers via configuration.

License

Apache 2.0.

Project details

These details have not been verified by PyPI

Project links

Release history Release notifications | RSS feed

This version

0.2.0

May 19, 2026

Download files

Download the file for your platform. If you're not sure which to choose, learn more about installing packages.

Source Distribution

actllminfer-0.2.0.tar.gz (49.6 kB view details)

Uploaded May 19, 2026 Source

Built Distribution

If you're not sure about the file name format, learn more about wheel file names.

The dropdown lists show the available interpreters, ABIs, and platforms. Enable javascript to be able to filter the list of wheel files.

actllminfer-0.2.0-py3-none-any.whl (48.6 kB view details)

Uploaded May 19, 2026 Python 3

File details

Details for the file actllminfer-0.2.0.tar.gz.

File metadata

Download URL: actllminfer-0.2.0.tar.gz
Upload date: May 19, 2026
Size: 49.6 kB
Tags: Source
Uploaded using Trusted Publishing? No
Uploaded via: twine/6.2.0 CPython/3.12.2

File hashes

Hashes for actllminfer-0.2.0.tar.gz
Algorithm	Hash digest
SHA256	`eae603986524cc6330f49b123abc3ed29616acb2908da161240e802ebd066e20`
MD5	`3dbb5dacb0a9415d1e0d0d02b8475949`
BLAKE2b-256	`c122b06cbb1a480c0c5260fb946e518e1e1b6dd9f837e9e58b7eb4436c670a97`

See more details on using hashes here.

File details

Details for the file actllminfer-0.2.0-py3-none-any.whl.

File metadata

Download URL: actllminfer-0.2.0-py3-none-any.whl
Upload date: May 19, 2026
Size: 48.6 kB
Tags: Python 3
Uploaded using Trusted Publishing? No
Uploaded via: twine/6.2.0 CPython/3.12.2

File hashes

Hashes for actllminfer-0.2.0-py3-none-any.whl
Algorithm	Hash digest
SHA256	`028773c3cc372fb1872a71fef90624151e06549c9ced26d98cd08cb9707bc330`
MD5	`68f73534ca9c32a286b56a88b012dbed`
BLAKE2b-256	`d6995e359f484c03899acf83b428fc6bf4bc10e463ab64391dfc57fcdeafdc10`

See more details on using hashes here.

actllminfer 0.2.0

Navigation

Verified details

Maintainers

Unverified details

Project links

Meta

Classifiers

Project description

ActLLMInfer

Supported providers

Install

Quick start

Direct invocation

Unified completion() endpoint (LiteLLM-style)

String spec via the factory

Hugging Face

NVIDIA NIM (free serverless inference)

Composable chains (LCEL-style)

Streaming

Tool / function calling (OpenAI-shaped)

JSON output

Architecture

License

Project details

Verified details

Maintainers

Unverified details

Project links

Meta

Classifiers

Release history Release notifications | RSS feed

Download files

Source Distribution

Built Distribution

File details

File metadata

File hashes

File details

File metadata

File hashes

Unified `completion()` endpoint (LiteLLM-style)