🚀 Unified NLP Pipelines for Language Models

These details have not been verified by PyPI

Project links

Project description

Langformers

Langformers is a flexible and user-friendly library that unifies NLP pipelines for both Large Language Models (LLMs) and Masked Language Models (MLMs) into one simple API.

Why Langformers? Chat, build, train, label, and embed — faster than ever.

Whether you're generating text, training classifiers, labelling data, embedding sentences, or building a semantic search index... the API stays consistent:

from langformers import tasks

component = tasks.create_<something>(...)
component.<do_something>()

No need to juggle different frameworks — Langformers wraps Hugging Face Transformers, SentenceTransformers, Ollama, FAISS, ChromaDB, Pinecone, and more under one unified interface.

Use the same pattern everywhere:

tasks.create_generator(...)  # Chatting with LLMs
tasks.create_labeller(...)   # Data labelling using LLMs
tasks.create_embedder(...)   # Embeding Sentences
tasks.create_classifier(...) # Training a Text Classifier
tasks.create_tokenizer()     # Training a Custom Tokenizer
tasks.create_mlm(...)        # Pretraining an MLM
tasks.create_searcher(...)   # Vector Database search
tasks.create_mimicker(...)   # Knowledge Distillation

Supported Tasks

Generative LLMs (e.g., Llama, Mistral, DeepSeek)

Seamless Chat with LLMs
LLM Inference via API
Data Labelling with LLMs

Masked Language Models (e.g., RoBERTa)

Train Text Classifiers
Pretrain MLMs from scratch
Continue Pretraining on custom data

Embeddings & Search (e.g., Sentence Transformers, FAISS, Pinecone)

Embed Sentences
Semantic Search
Mimic a Pretrained Model (Knowledge Distillation)

Installation

pip install -U langformers

Quick Start

For more advanced use, refer to the documentation at https://langformers.com.

To get started quickly, below are some example use cases of Langformers.

Conversational AI

Run this code as a python script (e.g., chat.py).

# Import langformers
from langformers import tasks

# Create a generator
generator = tasks.create_generator(provider="ollama", model_name="llama3.1:8b")

# Run the generator
generator.run(host="0.0.0.0", port=8000)

Open your browser at http://0.0.0.0:8000 (or the specific host and port you provided) to chat with the LLM.

Instead of using the chat interface, if you want to perform LLM inference through a REST API, you can send a POST request to host:port/api/generate endpoint. This is great when you’re building your own application.

The host:port/api/generate endpoint accepts the following:

{  
 "system_prompt": "You are an Aussie AI assistant, reply in an Aussie way.",
 "memory_k": 10, 
 "temperature": 0.5, 
 "top_p": 1, 
 "max_length": 5000, 
 "prompt": "Hi"
 }

Data Labelling with LLMs

Generative LLMs are highly effective for data labeling, extending beyond just conversation. Langformers offers the simplest way to define labels and conditions for labelling texts with LLMs.

# Import langformers
from langformers import tasks

# Load an LLM as a data labeller
labeller = tasks.create_labeller(provider="huggingface", model_name="meta-llama/Meta-Llama-3-8B-Instruct", multi_label=False)

# Provide labels and conditions
conditions = {
    "Positive": "The text expresses a positive sentiment.",
    "Negative": "The text expresses a negative sentiment.",
    "Neutral": "The text does not express any emotions."
}

# Label a text
text = "No doubt, The Shawshank Redemption is a cinematic masterpiece."
labeller.label(text, conditions)

Training a Text Classifier

Training text classifiers with Langformers is quite straightforward.

First, we define the training configurations, prepare the dataset, and select the MLM we would like to fine-tune for the classification task. All these can be achieved in few lines of code, but fully customizable!

# Import langformers
from langformers import tasks

# Define training configuration
training_config = {
    "max_length": 80,
    "num_train_epochs": 1,
    "report_to": ['tensorboard'],
    "logging_steps": 20,
    "save_steps": 20,
    # ...
}

# Initialize the model
model = tasks.create_classifier(
    model_name="roberta-base",          # model from Hugging Face or a local path
    csv_path="/path/to/dataset.csv",    # csv dataset
    text_column="text",                 # text column name
    label_column="label",               # label/class column name
    training_config=training_config
)

# Start fine-tuning
model.train()

Training a Custom Tokenizer

Before an MLM pretraining, you need to create a tokenizer (if you already don’t have one) and tokenize your dataset.

# Import langformers
from langformers import tasks

# Define configuration for the tokenizer
tokenizer_config = {
     "vocab_size": 50_265,
     "min_frequency": 2,
     "max_length": 512,
     # ...
}

# Train the tokenizer and tokenize the dataset
tokenizer = tasks.create_tokenizer(data_path="data.txt", tokenizer_config=tokenizer_config)
tokenizer.train()

Pretraining an MLM

With a tokenizer and tokenized dataset ready, pretraining an MLM is too easy with Langformers.

# Import langformers
from langformers import tasks

# Define model architecture
model_config = {
    "vocab_size": 50_265,            # Size of the vocabulary (must match tokenizer's `vocab_size`)
    "max_position_embeddings": 512,  # Maximum sequence length (must match tokenizer's `max_length`)
    "num_attention_heads": 12,       # Number of attention heads
    "num_hidden_layers": 12,         # Number of hidden layers
    "hidden_size": 768,              # Size of the hidden layers
    "intermediate_size": 3072,       # Size of the intermediate layer in the Transformer
    # ...
}

# Define training configuration
training_config = {
    "num_train_epochs": 2,           # Number of training epochs
    "save_total_limit": 1,           # Maximum number of checkpoints to save
    "learning_rate": 2e-4,           # Learning rate for optimization
    # ...
}

# Initialize the training
model = tasks.create_mlm(
    tokenizer="tokenizer",
    tokenized_dataset="tokenized_dataset",
    training_config=training_config,
    model_config=model_config
)

# For continuing pretraining of a existing MLM such as RoBERTa
# provide `checkpoint_path` to tasks.create_mlm() instead of `model_config`.

# Start the training
model.train()

Embed Sentences

Using state-of-the-art embedding models for vectorizing your sentences takes just two steps with Langformers.

# Import langformers
from langformers import tasks

# Create an embedder
embedder = tasks.create_embedder(provider="huggingface", model_name="sentence-transformers/all-MiniLM-L6-v2")

# Get your sentence embeddings
embeddings = embedder.embed(["I am hungry.", "I want to eat something."])

Semantic Search

Langformers can help you quickly set up a semantic search engine for vectorized text retrieval. All you need to do is specify an embedding model, the type of database (FAISS, ChromaDB, or Pinecone), and an index type (if required).

# Import langformers
from langformers import tasks

# Initialize a searcher
searcher = tasks.create_searcher(embedder="sentence-transformers/all-MiniLM-L12-v2", database="faiss", index_type="HNSW")

'''
For other vector databases:

ChromaDB
searcher = llms.create_searcher(embedder="sentence-transformers/all-MiniLM-L12-v2", database="chromadb")

Pinecone
searcher = llms.create_searcher(embedder="sentence-transformers/all-MiniLM-L12-v2", database="pinecone", api_key="your-api-key-here")
'''

# Sentences to add in the vector database
sentences = [
    "He is learning Python programming.",
    "The coffee shop opens at 8 AM.",
    "She bought a new laptop yesterday.",
    "He loves to play basketball with friends.",
    "Artificial Intelligence is evolving rapidly.",
    "He studies CS at the University of Melbourne."
]

# Metadata for the respective sentences
metadata = [
    {"action": "learning", "category": "education"},
    {"action": "opens", "category": "business"},
    {"action": "bought", "category": "shopping"},
    {"action": "loves", "category": "sports"},
    {"action": "evolving", "category": "technology"},
    {"action": "studies", "category": "education"}
]

# Add the sentences
searcher.add(texts=sentences, metadata=metadata)

# Define a search query
query_sentence = "computer science"

# Query the vector database
results = searcher.query(query=query_sentence, items=2, include_metadata=True)
print(results)

Knowledge Distillation (Mimicking a pretrained model)

Langformers can train a custom model to replicate the embedding space of a pretrained teacher model.

# Load a text corpus
# In this example we use all the sentences from `allnli` dataset.
from datasets import load_dataset
data = load_dataset("langformers/allnli-mimic-embedding")

# Import langformers
from langformers import tasks

# Define the architecture of your student model
student_config = {
    "max_position_embeddings": 130,
    "num_attention_heads":8,
    "num_hidden_layers": 8,
    "hidden_size": 128,
    "intermediate_size": 256,
    # ...
}

# Define the training configurations
training_config = {
    "num_train_epochs": 10,
    "learning_rate": 5e-5,
    "batch_size": 128,                          # use large batch
    "dataset_path": data['train']['sentence'],  # `list` of sentences or `path` to a text corpus
    "logging_steps": 100,
    # ...
}

# Create a mimicker
mimicker = tasks.create_mimicker(teacher_model="roberta-base", student_config=student_config, training_config=training_config)

# Start training
mimicker.train()

Documentation

For full documentation, API reference, and advanced usage, visit: https://langformers.com

License

Langformers is released under the Apache License 2.0.

Contributing

We welcome contributions! Please see our contribution guidelines for details.

Built with ❤️ for the future of language AI.

Project details

These details have not been verified by PyPI

Project links

Release history Release notifications | RSS feed

0.5.0

May 4, 2025

0.4.0

Apr 17, 2025

0.3.1

Apr 16, 2025

0.3.0

Apr 14, 2025

0.2.0

Apr 10, 2025

This version

0.1.1

Apr 8, 2025

0.1.0

Apr 8, 2025

Download files

Download the file for your platform. If you're not sure which to choose, learn more about installing packages.

Source Distribution

langformers-0.1.1.tar.gz (777.4 kB view details)

Uploaded Apr 8, 2025 Source

Built Distribution

If you're not sure about the file name format, learn more about wheel file names.

The dropdown lists show the available interpreters, ABIs, and platforms. Enable javascript to be able to filter the list of wheel files.

langformers-0.1.1-py3-none-any.whl (794.5 kB view details)

Uploaded Apr 8, 2025 Python 3

File details

Details for the file langformers-0.1.1.tar.gz.

File metadata

Download URL: langformers-0.1.1.tar.gz
Upload date: Apr 8, 2025
Size: 777.4 kB
Tags: Source
Uploaded using Trusted Publishing? No
Uploaded via: twine/6.1.0 CPython/3.9.10

File hashes

Hashes for langformers-0.1.1.tar.gz
Algorithm	Hash digest
SHA256	`44ff447bd788108b429382e57c91cbc729df18d326713e20934a0b13a0d9b457`
MD5	`523675baf38b2eff5ec6f4486d16ca98`
BLAKE2b-256	`1556d3038dddade96aa029467559eb8bdb830451a5b070f031468138d992cca1`

See more details on using hashes here.

File details

Details for the file langformers-0.1.1-py3-none-any.whl.

File metadata

Download URL: langformers-0.1.1-py3-none-any.whl
Upload date: Apr 8, 2025
Size: 794.5 kB
Tags: Python 3
Uploaded using Trusted Publishing? No
Uploaded via: twine/6.1.0 CPython/3.9.10

File hashes

Hashes for langformers-0.1.1-py3-none-any.whl
Algorithm	Hash digest
SHA256	`3da4c1fe0ed73fb82313490fa1e62f0f7f62fe4c93e10f87a8e9c605b91cd2b7`
MD5	`de11b320758cb0db7b23cec5af6e5aaf`
BLAKE2b-256	`2ca40bc4e44a9a8ee06b9e96fe9686bed27bbe7f2e39a0b5dc369ea43d477ecd`

See more details on using hashes here.

langformers 0.1.1

Navigation

Verified details

Maintainers

Unverified details

Project links

Meta

Classifiers

Project description

Langformers

Supported Tasks

Generative LLMs (e.g., Llama, Mistral, DeepSeek)

Masked Language Models (e.g., RoBERTa)

Embeddings & Search (e.g., Sentence Transformers, FAISS, Pinecone)

Installation

Quick Start

Conversational AI

Data Labelling with LLMs

Training a Text Classifier

Training a Custom Tokenizer

Pretraining an MLM

Embed Sentences

Semantic Search

Knowledge Distillation (Mimicking a pretrained model)

Documentation

License

Contributing

Project details

Verified details

Maintainers

Unverified details

Project links

Meta

Classifiers

Release history Release notifications | RSS feed

Download files

Source Distribution

Built Distribution

File details

File metadata

File hashes

File details

File metadata

File hashes