Skip to main content

Data processing and analysis pipelines with LLMs

Project description

Semlib Build Status Coverage Reference PyPI PyPI - Python version

Semlib is a Python library for building data processing and data analysis pipelines that leverage the power of large language models (LLMs). Semlib provides, as building blocks, familiar functional programming primitives like map, reduce, sort, and filter, but with a twist: Semlib's implementation of these operations are programmed with natural language descriptions rather than code. Under the hood, Semlib handles complexities such as prompting, parsing, concurrency control, caching, and cost tracking.

pip install semlib

📖 API Reference         🤔 Rationale         💡 Examples

>>> presidents = await prompt(
...     "Who were the 39th through 42nd presidents of the United States?",
...     return_type=Bare(list[str])
... )

>>> await sort(presidents, by="right-leaning")
['Jimmy Carter', 'Bill Clinton', 'George H. W. Bush', 'Ronald Reagan']

>>> await find(presidents, by="former actor")
'Ronald Reagan'

>>> await map(
...     presidents,
...     "How old was {} when he took office?",
...     return_type=Bare(int),
... )
[52, 69, 64, 46]

Rationale

Large language models are great at natural-language data processing and data analysis tasks, but when you have a large amount of data, you can't get high-quality results by just dumping all the data into a long-context LLM and asking it to complete a complex task in a single shot. Even with today's reasoning models and agents, this approach doesn't give great results.

This library provides an alternative. You can structure your computation using the building blocks that Semlib provides: functional programming primitives upgraded to handle semantic operations. This approach has a number of benefits.

Quality. By breaking down a sophisticated data processing task into simpler steps that are solved by today's LLMs, you can get higher-quality results, even in situations where today's LLMs might be capable of processing the data in a single shot and ending up with barely acceptable results. (example: analyzing support tickets in Airline Support Report)

Feasibility. Even long-context LLMs have limitations (e.g., 1M tokens in today's frontier models). Furthermore, performance often drops off with longer inputs. By breaking down the data processing task into smaller steps, you can handle arbitrary-sized data. (example: sorting an arbitrary number of arXiv papers in arXiv Paper Recommendations)

Latency. By breaking down the computation into smaller pieces and structuring it using functional programming primitives like map and reduce, the parts of the computation can be run concurrently, reducing the latency of the overall computation. (example: tree reduce with O(log n) computation depth in Disneyland Reviews Synthesis)

Cost. By breaking down the computation into simpler sub-tasks, you can use smaller and cheaper models that are capable of solving those sub-tasks, which can reduce data processing costs. Furthermore, you can choose the model on a per-subtask basis, allowing you to further optimize costs. (example: using gpt-4.1-nano for the pre-filtering step in arXiv Paper Recommendations)

Security. By breaking down the computation into tasks that simpler models can handle, you can use open models that you host yourself, allowing you to process sensitive data without having to trust a third party. (example: using gpt-oss and qwen3 in Resume Filtering)

Flexibility. LLMs are great at certain tasks, like natural-language processing. They're not so great at other tasks, like multiplying numbers. Using Semlib, you can break down your data processing task into multiple steps, some of which use LLMs and others that just use regular old Python code, getting the best of both worlds. (example: Python code for filtering in Resume Filtering)

Read more about the rationale, the story behind this library, and related work in the blog post.

Citation

If you use Semlib in any way in academic work, please cite the following:

@misc{athalye:semlib,
  author = {Anish Athalye},
  title = {{Semlib}: LLM-powered data processing for {Python}},
  year = {2025},
  howpublished = {\url{https://github.com/anishathalye/semlib}},
}

License

Copyright (c) Anish Athalye. Released under the MIT License. See LICENSE.md for details.

Project details


Download files

Download the file for your platform. If you're not sure which to choose, learn more about installing packages.

Source Distribution

semlib-1.0.0.tar.gz (80.2 kB view details)

Uploaded Source

Built Distribution

semlib-1.0.0-py3-none-any.whl (32.8 kB view details)

Uploaded Python 3

File details

Details for the file semlib-1.0.0.tar.gz.

File metadata

  • Download URL: semlib-1.0.0.tar.gz
  • Upload date:
  • Size: 80.2 kB
  • Tags: Source
  • Uploaded using Trusted Publishing? No
  • Uploaded via: python-httpx/0.28.1

File hashes

Hashes for semlib-1.0.0.tar.gz
Algorithm Hash digest
SHA256 68183ec7051376dc178bfda702b472e8070b9f1b22760e20214144b2537e58ca
MD5 ee619a22f4b339c5e18c4a21d35d299b
BLAKE2b-256 e72ff05d724815faf9e1d8caf68efd94f5fcbbfc398c419e95eb5caf8aa6d26e

See more details on using hashes here.

File details

Details for the file semlib-1.0.0-py3-none-any.whl.

File metadata

  • Download URL: semlib-1.0.0-py3-none-any.whl
  • Upload date:
  • Size: 32.8 kB
  • Tags: Python 3
  • Uploaded using Trusted Publishing? No
  • Uploaded via: python-httpx/0.28.1

File hashes

Hashes for semlib-1.0.0-py3-none-any.whl
Algorithm Hash digest
SHA256 2cb79d6215968dc3c4710074a07f2605fe5bba8fedb607f9cbded765a4afd7ad
MD5 df06fa8691078d78e268a88d3044ac04
BLAKE2b-256 52b4391ff9f46879761daa4a68ea442529b23f6984cc3ccacde7558bc3c6816f

See more details on using hashes here.

Supported by

AWS Cloud computing and Security Sponsor Datadog Monitoring Fastly CDN Google Download Analytics Pingdom Monitoring Sentry Error logging StatusPage Status page