Skip to main content

Integrating LLMs into structured NLP pipelines

Project description

spacy-llm
Structured NLP with LLMs



GitHub Workflow Status pypi Version Code style: black

This package integrates Large Language Models (LLMs) into spaCy, featuring a modular system for fast prototyping and prompting, and turning unstructured responses into robust outputs for various NLP tasks, no training data required.

Feature Highlight

🧠 Motivation

Large Language Models (LLMs) feature powerful natural language understanding capabilities. With only a few (and sometimes no) examples, an LLM can be prompted to perform custom NLP tasks such as text categorization, named entity recognition, coreference resolution, information extraction and more.

spaCy is a well-established library for building systems that need to work with language in various ways. spaCy's built-in components are generally powered by supervised learning or rule-based approaches.

Supervised learning is much worse than LLM prompting for prototyping, but for many tasks it's much better for production. A transformer model that runs comfortably on a single GPU is extremely powerful, and it's likely to be a better choice for any task for which you have a well-defined output. You train the model with anything from a few hundred to a few thousand labelled examples, and it will learn to do exactly that. Efficiency, reliability and control are all better with supervised learning, and accuracy will generally be higher than LLM prompting as well.

spacy-llm lets you have the best of both worlds. You can quickly initialize a pipeline with components powered by LLM prompts, and freely mix in components powered by other approaches. As your project progresses, you can look at replacing some or all of the LLM-powered components as you require.

Of course, there can be components in your system for which the power of an LLM is fully justified. If you want a system that can synthesize information from multiple documents in subtle ways and generate a nuanced summary for you, bigger is better. However, even if your production system needs an LLM for some of the task, that doesn't mean you need an LLM for all of it. Maybe you want to use a cheap text classification model to help you find the texts to summarize, or maybe you want to add a rule-based system to sanity check the output of the summary. These before-and-after tasks are much easier with a mature and well-thought-out library, which is exactly what spaCy provides.

⏳ Install

spacy-llm will be installed automatically in future spaCy versions. For now, you can run the following in the same virtual environment where you already have spacy installed.

python -m pip install spacy-llm

⚠️ This package is still experimental and it is possible that changes made to the interface will be breaking in minor version updates.

🐍 Quickstart

Let's run some text classification using a GPT model from OpenAI.

Create a new API key from openai.com or fetch an existing one, and ensure the keys are set as environmental variables. For more background information, see the documentation around setting API keys.

In Python code

To do some quick experiments, from 0.5.0 onwards you can run:

import spacy

nlp = spacy.blank("en")
llm = nlp.add_pipe("llm_textcat")
llm.add_label("INSULT")
llm.add_label("COMPLIMENT")
doc = nlp("You look gorgeous!")
print(doc.cats)
# {"COMPLIMENT": 1.0, "INSULT": 0.0}

By using the llm_textcat factory, the latest version of the built-in textcat task is used, as well as the default GPT-3-5 model from OpenAI.

Using a config file

To control the various parameters of the llm pipeline, we can use spaCy's config system. To start, create a config file config.cfg containing at least the following (or see the full example here):

[nlp]
lang = "en"
pipeline = ["llm"]

[components]

[components.llm]
factory = "llm"

[components.llm.task]
@llm_tasks = "spacy.TextCat.v3"
labels = ["COMPLIMENT", "INSULT"]

[components.llm.model]
@llm_models = "spacy.GPT-4.v2"

Now run:

from spacy_llm.util import assemble

nlp = assemble("config.cfg")
doc = nlp("You look gorgeous!")
print(doc.cats)
# {"COMPLIMENT": 1.0, "INSULT": 0.0}

That's it! There's a lot of other features - prompt templating, more tasks, logging etc. For more information on how to use those, check out https://spacy.io/api/large-language-models.

🚀 Ongoing work

In the near future, we will

  • Add more example tasks
  • Support a broader range of models
  • Provide more example use-cases and tutorials

PRs are always welcome!

📝️ Reporting issues

If you have questions regarding the usage of spacy-llm, or want to give us feedback after giving it a spin, please use the discussion board. Bug reports can be filed on the spaCy issue tracker. Thank you!

Migration guides

Please refer to our migration guide.

Project details


Download files

Download the file for your platform. If you're not sure which to choose, learn more about installing packages.

Source Distribution

spacy_llm-0.7.4.tar.gz (150.2 kB view details)

Uploaded Source

Built Distribution

If you're not sure about the file name format, learn more about wheel file names.

spacy_llm-0.7.4-py2.py3-none-any.whl (256.1 kB view details)

Uploaded Python 2Python 3

File details

Details for the file spacy_llm-0.7.4.tar.gz.

File metadata

  • Download URL: spacy_llm-0.7.4.tar.gz
  • Upload date:
  • Size: 150.2 kB
  • Tags: Source
  • Uploaded using Trusted Publishing? Yes
  • Uploaded via: twine/6.1.0 CPython/3.13.7

File hashes

Hashes for spacy_llm-0.7.4.tar.gz
Algorithm Hash digest
SHA256 ced03732d180040ee4693c1193981ba316f28eb667d5e7d22dd14176871ad365
MD5 3678f6266756ba4b00b3ed51d8cefbdc
BLAKE2b-256 929b2c7d0c7cf024c3968d3510d9fc099696a52cd373afa14fca9d88367116bd

See more details on using hashes here.

Provenance

The following attestation bundles were made for spacy_llm-0.7.4.tar.gz:

Publisher: publish.yml on explosion/wheelmonger

Attestations: Values shown here reflect the state when the release was signed and may no longer be current.

File details

Details for the file spacy_llm-0.7.4-py2.py3-none-any.whl.

File metadata

  • Download URL: spacy_llm-0.7.4-py2.py3-none-any.whl
  • Upload date:
  • Size: 256.1 kB
  • Tags: Python 2, Python 3
  • Uploaded using Trusted Publishing? Yes
  • Uploaded via: twine/6.1.0 CPython/3.13.7

File hashes

Hashes for spacy_llm-0.7.4-py2.py3-none-any.whl
Algorithm Hash digest
SHA256 65e1a710ebfec9ae27c69e579c7d195b6c83f48bb8204f16e6f0cf8e88341e6e
MD5 a85fdc02c731b83d4d2ef8b8d7a67d07
BLAKE2b-256 1d72fa857ae593da8fd6b4e8fbd59359e7c8adb77ac4d28a5c6e16aec89a8fc7

See more details on using hashes here.

Provenance

The following attestation bundles were made for spacy_llm-0.7.4-py2.py3-none-any.whl:

Publisher: publish.yml on explosion/wheelmonger

Attestations: Values shown here reflect the state when the release was signed and may no longer be current.

Supported by

AWS Cloud computing and Security Sponsor Datadog Monitoring Depot Continuous Integration Fastly CDN Google Download Analytics Pingdom Monitoring Sentry Error logging StatusPage Status page