A configurable local chatbot library with lightweight memory indexing.
Project description
llama-simple-chat-bot
A small Python library for configurable local chatbots.
llama-simple-chat-bot runs an open-source language model on the local machine and
adds a lightweight persistent memory index. It is designed for Windows and Linux
on amd64 machines, and it does not require a GPU. The default runtime path uses
GGUF models through llama-cpp-python with n_gpu_layers set to 0.
[!NOTE] The library does not call remote OpenAI APIs. Model inference is local, while memory indexing is handled with lightweight files on disk.
Features
- JSON bot profiles: name, description, personality, birthday, skills, species, model settings, and memory directory.
- Local LLM backends:
llama_cpp_python,llama_cpp_cli, and a deterministicechobackend for tests. - Persistent memory: every exchange is logged, split into segments, summarized, indexed, and searched during future conversations.
- Associative recall: related past segments can be injected into the prompt as context before the model answers.
- CLI and Python API.
- No required third-party dependencies for the core package. Local inference is
available through the optional
localextra.
Recommended Local Models
These presets are intentionally small enough for local GGUF use, with a few better-quality options for slower but more reliable CPU chat:
qwen2.5-0.5b-instruct-q4_k_m: about 491 MB, multilingual, good default for Chinese and English.qwen2.5-1.5b-instruct-q4_k_m: about 1120 MB, much better than 0.5B for identity stability, memory use, and ordinary chat quality.qwen2.5-3b-instruct-q4_k_m: about 2100 MB, a stronger choice for roleplay, Chinese chat, and basic reasoning if you can accept slower CPU inference.smollm2-360m-instruct-q4_k_m: about 271 MB, very small and fast for quick experiments.
The project can also use any local GGUF file supported by llama.cpp.
[!TIP] If you care about actual chat quality, start with
qwen2.5-1.5b-instruct-q4_k_m. Useqwen2.5-3b-instruct-q4_k_mwhen role consistency and answer quality matter more than speed. Keepqwen2.5-0.5b-instruct-q4_k_mfor lightweight testing, andsmollm2-360m-instruct-q4_k_monly for very small experiments.
Setup
Create a virtual environment before installing optional local inference dependencies.
[!IMPORTANT] Install optional dependencies inside a virtual environment. The project does not require modifying your base Python environment.
The recommended CPU-only path installs the core package first. The first real
local-model run then installs the prebuilt CPU llama-cpp-python wheel
automatically if it is missing:
python -m venv .venv
.\.venv\Scripts\Activate.ps1
python -m pip install -U pip
python -m pip install -e .
On Linux:
python3 -m venv .venv
source .venv/bin/activate
python -m pip install -U pip
python -m pip install -e .
[!TIP] The automatic installer runs
python -m pip install -r requirements-local-cpu.txt. That requirements file useshttps://abetlen.github.io/llama-cpp-python/whl/cpuplus--only-binary llama-cpp-python. If no compatible wheel exists for your Python version and platform, pip fails instead of starting a slow local build.
If the CPU wheel is unavailable on your machine, either install llama.cpp
separately and set "backend": "llama_cpp_cli" in the config, or use
python -m pip install -e '.[local]' when you intentionally want to build
llama-cpp-python from source.
[!NOTE] The
localextra is kept for packaging compatibility, but the documented quick path usesrequirements-local-cpu.txtbecause pip dependency metadata cannot store a custom wheel index URL.
[!WARNING] CPU-only inference is usable with small GGUF models, but it is still slower than GPU inference. Keep
model.n_gpu_layersat0when the machine has no compatible GPU.
Quick Start
Write an example config:
llama-simple-chat-bot init-config examples/my_bot.json
List model presets:
llama-simple-chat-bot models
Download a small GGUF model:
llama-simple-chat-bot download-model qwen2.5-0.5b-instruct-q4_k_m --models-dir models
For a better local chat model:
llama-simple-chat-bot download-model qwen2.5-1.5b-instruct-q4_k_m --models-dir models
Or a stronger 3B preset:
llama-simple-chat-bot download-model qwen2.5-3b-instruct-q4_k_m --models-dir models
Start chatting:
llama-simple-chat-bot chat --config examples/my_bot.json
Shortest Start
[!TIP] Use this section when you only want the shortest path from a fresh checkout to a running bot.
For a real local model run:
python -m venv .venv
source .venv/bin/activate
python -m pip install -e .
llama-simple-chat-bot init-config bot.json
llama-simple-chat-bot download-model qwen2.5-0.5b-instruct-q4_k_m --models-dir models
llama-simple-chat-bot chat --config bot.json
On Windows PowerShell, use:
python -m venv .venv
.\.venv\Scripts\Activate.ps1
python -m pip install -e .
llama-simple-chat-bot init-config bot.json
llama-simple-chat-bot download-model qwen2.5-0.5b-instruct-q4_k_m --models-dir models
llama-simple-chat-bot chat --config bot.json
If you want a more usable default on CPU, replace the download step with:
llama-simple-chat-bot download-model qwen2.5-1.5b-instruct-q4_k_m --models-dir models
For a dependency-free smoke test that does not load a model:
python -m pip install -e .
llama-simple-chat-bot ask --config examples/echo_config.json "hello"
[!NOTE] The
echobackend is only a deterministic smoke-test backend. It verifies the CLI, config loading, and memory plumbing without loading a language model.
Example Profiles
The examples/ directory includes ready-to-edit bot profiles:
examples/bot_config.json: general local assistant.examples/nekomimi_config.json: 中文猫娘聊天伙伴.examples/coding_mentor_config.json: pragmatic coding mentor.examples/study_partner_config.json: structured study partner.examples/storyteller_config.json: collaborative fiction and worldbuilding companion.examples/echo_config.json: dependency-free smoke-test bot.
Run any profile with:
llama-simple-chat-bot chat --config examples/nekomimi_config.json
To debug memory retrieval while chatting, add --verbose:
llama-simple-chat-bot chat --config examples/nekomimi_config.json --verbose
Verbose mode prints diagnostics before the model starts generating: the memory index path and detected encoding, the query terms, whether the turn used recent-overview or keyword retrieval, scored candidate segments, selected prompt hits, and the recalled memory block injected into the model context. The same diagnostics are available for one-shot asks:
llama-simple-chat-bot ask --config examples/my_bot.json --verbose "What do you remember about Python packaging?"
Send one message:
llama-simple-chat-bot ask --config examples/my_bot.json "What do you remember about me?"
Search memory without loading the model:
llama-simple-chat-bot memory-search --config examples/my_bot.json "Python packaging"
JSON Config
Important fields:
name,description,personality,birthday,skills, andspeciesare injected at runtime as authoritative system rules, so the bot knows its configured identity.memory_dircontrols whereindex.jsonand segment.jsonlfiles are stored.model.backendselectsllama_cpp_python,llama_cpp_cli, orecho.model.model_pathpoints to a local GGUF file.- Preset downloads are available through
llama-simple-chat-bot download-modelforqwen2.5-0.5b-instruct-q4_k_m,qwen2.5-1.5b-instruct-q4_k_m,qwen2.5-3b-instruct-q4_k_m, andsmollm2-360m-instruct-q4_k_m. model.n_gpu_layersdefaults to0, which keeps inference on CPU.system_rulesis the place to enforce speech style and role constraints, such as asking a catgirl profile to naturally end replies with喵.memory.segment_exchange_limitcontrols when a new memory segment starts.memory.summary_modecan beextractiveorllm.extractiveis faster;llmasks the local model to rewrite the segment summary.
Relative paths inside a config file are resolved relative to that config file.
JSON config files can be encoded as UTF-8, UTF-8 with BOM, GB2312, or GBK.
Memory index.json and segment .jsonl files are read with the same encoding
fallbacks and are written back as UTF-8.
[!NOTE] GB2312 and GBK support is intended for Chinese JSON config files produced by older Windows editors or tooling.
Python API
from llama_simple_chat_bot import BotConfig, ChatBot
config = BotConfig.from_file("examples/bot_config.json")
bot = ChatBot(config)
reply = bot.ask("Remember that I prefer SQLite for small apps.")
print(reply)
for hit in bot.search_memory("SQLite"):
print(hit.summary)
You can also build the config in code:
from llama_simple_chat_bot import BotConfig, ChatBot, MemoryConfig, ModelConfig
config = BotConfig(
name="Mira",
description="A practical local assistant with persistent memory.",
personality="warm, curious, and concise",
birthday="2026-06-02",
skills=["Python", "summarization"],
species="local digital companion",
memory_dir="./memory/mira",
model=ModelConfig(
backend="llama_cpp_python",
model_path="./models/qwen2.5-0.5b-instruct-q4_k_m.gguf",
chat_format="chatml",
n_gpu_layers=0,
),
memory=MemoryConfig(summary_mode="extractive"),
)
bot = ChatBot(config)
print(bot.ask("Hello."))
Memory Layout
The configured memory directory contains:
index.json: all conversation log entries plus segment metadata, summaries, keywords, and log file references.segments/*.jsonl: append-only per-segment logs.
At response time, the bot searches the index for direct matches and related associative matches, formats the best hits, and injects them into the local model's system context.
For broad questions like "what did we talk about before?", memory recall uses recent segments as an overview. For questions with a concrete topic, such as "did we talk about SQLite?", it uses keyword retrieval so old but relevant segments can beat newer unrelated chats.
[!WARNING] Memory files contain conversation content. Do not commit real user memory directories, downloaded model files, or private chat logs.
Tests
The test suite uses only the built-in unittest module and the echo backend:
python -m unittest
Acknowledgements
This project builds on the local inference ecosystem around llama.cpp, the
Python bindings provided by llama-cpp-python, open GGUF model releases from
the Qwen and SmolLM communities, and the Python standard library.
Project details
Release history Release notifications | RSS feed
Download files
Download the file for your platform. If you're not sure which to choose, learn more about installing packages.
Source Distribution
Built Distribution
Filter files by name, interpreter, ABI, and platform.
If you're not sure about the file name format, learn more about wheel file names.
Copy a direct link to the current filters
File details
Details for the file llama_simple_chat_bot-0.1.0.tar.gz.
File metadata
- Download URL: llama_simple_chat_bot-0.1.0.tar.gz
- Upload date:
- Size: 37.4 kB
- Tags: Source
- Uploaded using Trusted Publishing? No
- Uploaded via: poetry/2.4.1 CPython/3.11.15 Windows/10
File hashes
| Algorithm | Hash digest | |
|---|---|---|
| SHA256 |
24981d0f5f7196a42514e06002b4650f905b43572e669cad107574f5008d03c4
|
|
| MD5 |
fafe3a777654b21f4c049e5e1668047d
|
|
| BLAKE2b-256 |
753d7c3be59dd1082a73e0dc589f094d475d739227f4bfa2eaa0024c28968e5d
|
File details
Details for the file llama_simple_chat_bot-0.1.0-py3-none-any.whl.
File metadata
- Download URL: llama_simple_chat_bot-0.1.0-py3-none-any.whl
- Upload date:
- Size: 52.2 kB
- Tags: Python 3
- Uploaded using Trusted Publishing? No
- Uploaded via: poetry/2.4.1 CPython/3.11.15 Windows/10
File hashes
| Algorithm | Hash digest | |
|---|---|---|
| SHA256 |
0292be5bcd568c38c112e6b1f6631d812a23e0759778e8e3631eadff7d0d08d6
|
|
| MD5 |
79584753249caab4bedb17939d5ff57e
|
|
| BLAKE2b-256 |
b3dd098678c4b91d64b6e83f5c20b86d08d165532f0066b734f43bb84a40b693
|