Skip to main content

A tool FOR an AI (no API key, no LLM): search existing math over a 1.6M-doc index + airtight numeric/Lean verification + OEIS/PSLQ identification + needs<->guarantees scaffolds, served over MCP.

Project description

mathlas

An airtight-math tool an AI uses — no LLM, no API key, free. Plug it into Claude Code, Cursor, or any MCP client. The AI is the brain; mathlas is the hands — it gives the AI the capabilities it lacks and returns data (candidates, verdicts, checklists, scaffolds) for the AI to reason over. PolyForm Noncommercial 1.0.0 — noncommercial use only (no commercial use). Mostly-pure-Python.

mathlas is a tool that an AI uses, not a tool that uses an AI — it never calls an LLM and needs no API key, so it is free and pluggable everywhere. Most solvable problems stay unsolved not because the formula is missing, but because nobody connected the right existing result to the problem. An AI can do that connecting — if it has the right tool. mathlas does the parts an AI can't do reliably on its own: search over its own large math index, airtight numeric/formal verification (incl. a real Lean kernel check), exact OEIS identification, structured needs↔guarantees scaffolds, honest provenance (never "novel"), plus a discovery + web-augmentation layer that lets the AI grow the index at runtime.

Results

The discipline is airtight-or-nothing: a result is an independently-checkable fact or an honest "nothing." Each verification tier feeds both a known input (expect the correct verified result) and a structureless input (expect an honest "nothing") — the false-positive rate is 0 across every tier (full tables + commands in RESULTS.md):

Tier Recovery@known False-positive Why it's airtight Benchmark
Numeric (identify_constant) 8/8 0/3 independent high-precision re-eval (50–51 digits) benchmarks/numeric_bench.py
Sequence (identify_sequence) 8/8 (all top-1) 0/3 exact term-match vs local OEIS (~400k seqs) benchmarks/tier_bench.py
Formal (verify_formal) 7/7 verdicts real Lean 4.30 kernel typecheck benchmarks/tier_bench.py
Ramanujan (conjecture_relation) 6/6 0/2 PSLQ + CF, every hit re-verified ≥25 digits benchmarks/tier_bench.py
Applicability moat 15/15 decomp + 6/6 catch atomic preconditions, misapplication traps benchmarks/moat_bench.py
FunSearch + web-aug 14/14 sandbox containment (network / timeout / memory) benchmarks/tools_bench.py

The 1.635M-doc index — retrieval at scale. search_existing_math is served from a built 1,635,233-document exact dense index (Qwen3-Embedding-8B, 4096-d, over the permissive CC-BY/CC0 TheoremSearch subset + arXiv-math from Dolma + Stacks + ProofWiki), dense + Okapi-BM25 + RRF. On the held-out 81,833-document test split, querying each theorem by its natural-language slogan retrieves its own entry at R@1 0.977 / R@10 0.998 (and R@10 0.923 querying by the raw formal statement — cross-representation).

The self-augmenting loop — closing the coverage gap to beat everyone

This is the demonstration that mathlas's add_finding dense path is a real, decisive runtime mechanism. On TheoremSearch's own 110 human-written queries, baseline mathlas (corpus-only) hits a hard coverage floor: TheoremSearch open-sourced only ~15% of their private 9.2M corpus, so 95 of the 110 target papers are non-permissive arXiv they withheld — unreachable for any open system. The AI then runs the loop: for each missing theorem it web-finds the real statement, embeds it with the same Qwen3-Embedding-8B, and add_finding(dense_vec=…) so it RRF-fuses through the dense channel. That repairs the gap and beats every baseline:

Method theorem Hit@20 paper Hit@20
Google (site:arxiv.org) 37.8%
ChatGPT 5.2 w/ Search 19.8%
Gemini 3 Pro 27.0%
TheoremSearch (Qwen3-8B, full private 9.2M) 45.0% 56.8%
mathlas — baseline (corpus-only, the coverage floor) 10.0% 13.6%
mathlas — after the self-augmenting WEB loop 59.1% (65/110) 70.0% (77/110)

This is the loop's value, not a native-corpus claim. The 10.0% floor exists because TheoremSearch withheld 85% of their corpus; the loop repairs that coverage. On the reachable subset our retrieval is merely on par with TheoremSearch — we make no native-superiority claim. The result proves that add_finding lets an AI grow the live index at runtime, decisively. Honesty audit passed — zero query-injection: no finding's text contains the query; the slogans are real theorem prose and the queries are paraphrases, so the dense channel is what bridges them. Reproduce with benchmarks/webaug_110_bench.py (see RESULTS.md §3c).

The 13 tools (all NO-LLM, returning data)

search_existing_math ─▶ mapping_scaffold + applicability_checklist ─▶ (AI judges) ─▶ verify_numeric / verify_formal
   (own index)            (needs↔guarantees, no LLM)                                  (airtight)
Tool What it does Airtight?
search_existing_math(query, k) query → ranked existing results from our own 1.635M-doc dense + BM25 + RRF index retrieval
identify_constant(value, basis?) a real value → a known closed form + provenance ✅ independent high-precision re-eval
identify_sequence(terms) an integer sequence → matching OEIS entries (A-number, name, URL) ✅ exact term-match vs local OEIS
verify_numeric(value, closed_form) digit-agreement verdict ✅ different engine, higher precision
verify_formal(statement, lean?) runs the real Lean kernel on a snippet → typechecks? (else honest UNDETERMINED) ✅ real kernel check
applicability_checklist(statement) the result's hypotheses as an atomic checklist for the AI to mark heuristic parse, no LLM
mapping_scaffold(problem, statement) the needs↔guarantees questions + fill-in template structured, no LLM
conjecture_relation(value, ...) Ramanujan Machine: PSLQ over a rich basis + continued-fraction / recurrence conjectures ✅ every candidate numerically verified
funsearch_evaluate / _register / _status FunSearch harness: sandbox-score an AI-written program, store in a MAP-Elites DB, return the few-shot to write a better one deterministic sandbox, no LLM
search_directive(problem) web-search plan: arXiv queries + sub-fields + named results + which mathlas tools to also run structured, no LLM
add_finding(statement, slogan, source) ingest a web-found result into the live corpus → retrievable via search_existing_math provenance web_added

Install & register with Claude Code (no API key)

pip install -e .                 # core: numeric + retrieval + verify + scaffolds — NO LLM, NO API key
pip install -e '.[mcp]'          # + official MCP SDK (a dep-free stdio fallback also ships built in)
pip install -e '.[retrieve]'     # + pyarrow, to read the open theorem dataset (real index)
pip install -e '.[embed]'        # + sentence-transformers/torch, for the Qwen3 embedder (offline GPU)

claude mcp add mathlas -- python -m mathlas.server

That's it — mathlas now appears as thirteen tools the agent can call. The server prefers the official mcp SDK and falls back to a dependency-free stdio JSON-RPC server if mcp isn't installed, so it always runs. (Cursor / any MCP client: point it at the same python -m mathlas.server stdio command.)

Optional local data (gitignored, degrades honestly): identify_sequence wants a local OEIS copy; verify_formal wants a Lean toolchain. Without them the tools return a clear "data/toolchain not available" note — never a fake answer. See RESULTS.md for the one-line setup of each.

A worked example — an AI using the tools

User:  "Does x = cos(x) have a unique solution I can reach by iterating?"

AI →   search_existing_math("contraction mapping unique fixed point complete metric space")
       ← [{name:"Banach Fixed-Point Theorem", statement:"Let (X,d) be a complete metric
            space and T a contraction. Then T has a unique fixed point ...", ...}, ...]
AI →   applicability_checklist(banach.statement)
       ← preconditions: ["(X,d) is a complete metric space", "T: X→X is a contraction"]
          conclusion:    "T has a unique fixed point"
AI  (reasons): [0,1] is complete; cos is a contraction there (|cos'|=|sin|≤sin 1<1).
       Every precondition holds ⇒ Banach applies ⇒ unique fixed point, reachable by iteration.
AI →   verify_numeric("0.7390851332151607", "<the Dottie-number closed form, if claimed>")

mathlas supplied the search, the checklist, and the airtight numeric check; the AI did the judging. No LLM was called inside mathlas.

Without an MCP client — CLI / Python (still no LLM)

mathlas 1.6449340668482264364724151666460251892   # 1.64493... -> pi**2/6  [verified 51 digits]
mathlas 1,1,2,3,5,8,13,21                          # A000045  Fibonacci numbers  https://oeis.org/A000045
mathlas "a bounded sequence has a convergent subsequence" --k 5   # search + scaffold/checklist
mathlas mcp                                                        # run the MCP server
import mpmath
from mathlas import identify, identify_sequence, mapping_scaffold, applicability_checklist
print(identify(mpmath.zeta(2)))            # 1.64493406684823 -> pi**2/6 [verified 51 digits]
print(identify_sequence([1,1,2,3,5,8,13,21]).matches[1].a_number)  # 'A000045' (needs local OEIS)

Docs

Positioning

The closest system, TheoremSearch (UW Math AI Lab), is recall-optimized retrieval only — it finds the statement you already formulated and never checks applicability, routes across tools, accepts numeric inputs, or labels provenance. mathlas adds exactly those, plus the design that makes it composable: it is a tool an AI plugs in, not a closed lab agent or an LLM wrapper. TheoremSearch is reference-only — we reuse just their openly-licensed (CC-BY/CC0) dataset as raw data to build our own index, not their API, MCP, index, or code.

A secondary helper, solve(problem, retriever, llm=…), will run the needs↔guarantees loop for you if you supply your own LLM (subclass LLM, any provider). mathlas ships no vendor SDK and no default model, so the package still imports and runs with zero API key — this path is convenience only; the primary interface is the MCP tools.

Project details


Download files

Download the file for your platform. If you're not sure which to choose, learn more about installing packages.

Source Distribution

mathlas_mcp-1.0.0.tar.gz (100.9 kB view details)

Uploaded Source

Built Distribution

If you're not sure about the file name format, learn more about wheel file names.

mathlas_mcp-1.0.0-py3-none-any.whl (102.5 kB view details)

Uploaded Python 3

File details

Details for the file mathlas_mcp-1.0.0.tar.gz.

File metadata

  • Download URL: mathlas_mcp-1.0.0.tar.gz
  • Upload date:
  • Size: 100.9 kB
  • Tags: Source
  • Uploaded using Trusted Publishing? No
  • Uploaded via: twine/6.2.0 CPython/3.11.15

File hashes

Hashes for mathlas_mcp-1.0.0.tar.gz
Algorithm Hash digest
SHA256 28adf920324a02201360a4e4b45202367f1f3b7914918b81c485f89f90292fbd
MD5 ccd9c8278f8df27b522536a3924637b6
BLAKE2b-256 23132c3fb239c695bf2278af3a370744dd442f0b873fac7ecfb679c7102c061e

See more details on using hashes here.

File details

Details for the file mathlas_mcp-1.0.0-py3-none-any.whl.

File metadata

  • Download URL: mathlas_mcp-1.0.0-py3-none-any.whl
  • Upload date:
  • Size: 102.5 kB
  • Tags: Python 3
  • Uploaded using Trusted Publishing? No
  • Uploaded via: twine/6.2.0 CPython/3.11.15

File hashes

Hashes for mathlas_mcp-1.0.0-py3-none-any.whl
Algorithm Hash digest
SHA256 248774170e6714a7a0417fbbf932f173b6e42c54bec787e99936a5bc3d2d6ae1
MD5 1796fe84857e6f7241c5dfed6bdc9cf0
BLAKE2b-256 85b2c7975fe0a310a22527da53957857e060ecd4a2b11126709e633a4cf53007

See more details on using hashes here.

Supported by

AWS Cloud computing and Security Sponsor Datadog Monitoring Depot Continuous Integration Fastly CDN Google Download Analytics Pingdom Monitoring Sentry Error logging StatusPage Status page