Skip to main content

Clone and prune transformer models with new tokenizers

Project description

🔄 Transformer Cloner

PyPI version Python 3.10+ License: MIT

Clone and prune transformer models with new tokenizers. Create smaller, more efficient models by mapping vocabularies, reducing dimensions, and pruning layers.

Use Cases

  • 🌍 Language Adaptation: Use a custom tokenizer optimized for your language
  • 📉 Model Compression: Create smaller models for edge deployment
  • 🎓 Knowledge Distillation: Generate student models from teacher models
  • 🔬 Research: Experiment with different model architectures

📦 Installation

pip install transformer-cloner

Requirements:

  • Python 3.10+
  • PyTorch 2.0+
  • Transformers 4.40+

📖 Complete API Reference

TransformerCloner

The main class for cloning and pruning transformer models.

from transformer_cloner import TransformerCloner

cloner = TransformerCloner(
    org_model_id: str,         # HuggingFace model ID or local path to original model
    target_tokenizer_id: str,  # HuggingFace tokenizer ID or local path to target tokenizer
)

Attributes after initialization:

  • cloner.org_model - The loaded original model
  • cloner.org_tokenizer - The original model's tokenizer
  • cloner.target_tokenizer - The target tokenizer
  • cloner.token_id_map - Dictionary mapping target token IDs to source token IDs (empty until build_token_id_map() is called)

Method: build_token_id_map()

Build a mapping from target tokenizer IDs to original tokenizer IDs.

token_map = cloner.build_token_id_map(
    batch_size: int = 5000,   # Number of tokens to process per batch (higher = faster but more memory)
    verbose: bool = True,     # Whether to print progress
) -> dict[int, list[int]]     # Returns {target_token_id: [source_token_ids]}

Example:

cloner = TransformerCloner(
    org_model_id="google/gemma-3-270m-it",
    target_tokenizer_id="alibayram/turkish-tokenizer",
)

# Build the token mapping
token_map = cloner.build_token_id_map(batch_size=10000, verbose=True)
# Output: Building token ID map for 65536 tokens...
#         Processed 10000/65536 tokens
#         ...
#         Token ID map built with 65536 entries

print(token_map[100])  # [234, 567] - target token 100 maps to source tokens 234, 567

Method: clone()

Clone the model with a new tokenizer, mapping embeddings from the original model.

model = cloner.clone(
    strategy: EmbeddingStrategy = EmbeddingStrategy.MEAN,  # How to combine multiple source embeddings
    verbose: bool = True,                                   # Whether to print progress
) -> AutoModelForCausalLM                                   # Returns the cloned model

Example:

from transformer_cloner import TransformerCloner, EmbeddingStrategy

cloner = TransformerCloner(
    org_model_id="google/gemma-3-270m-it",
    target_tokenizer_id="alibayram/turkish-tokenizer",
)

# Clone with mean embedding strategy
model = cloner.clone(strategy=EmbeddingStrategy.MEAN, verbose=True)
# Output: Building token ID map for 65536 tokens...
#         Cloning model with strategy: mean
#         Model vocab size: 65536, Tokenizer vocab size: 65536
#         Copying weights from original model...
#         Mapping embeddings...
#         Mapped 1000/65536 embeddings
#         ...
#         Model cloning complete!

# Save the cloned model
model.save_pretrained("./cloned-model")

Method: clone_with_lm_head()

Clone the model including the language modeling head (for models with untied weights).

model = cloner.clone_with_lm_head(
    strategy: EmbeddingStrategy = EmbeddingStrategy.MEAN,  # How to combine embeddings
    verbose: bool = True,                                   # Whether to print progress
) -> AutoModelForCausalLM                                   # Returns the cloned model

Example:

# For models where lm_head is NOT tied to embeddings
model = cloner.clone_with_lm_head(strategy=EmbeddingStrategy.WEIGHTED)
# Output: ... (same as clone)
#         Mapping lm_head weights...
#         lm_head mapping complete!

model.save_pretrained("./cloned-with-lm-head")

Method: clone_pruned()

Clone the model with architecture pruning (smaller hidden size, fewer layers, etc.).

model = cloner.clone_pruned(
    pruning_config: PruningConfig,                          # Configuration for pruned dimensions
    strategy: EmbeddingStrategy = EmbeddingStrategy.MEAN,   # How to combine embeddings
    verbose: bool = True,                                    # Whether to print progress
) -> AutoModelForCausalLM                                    # Returns the pruned model

Example:

from transformer_cloner import TransformerCloner, PruningConfig, EmbeddingStrategy

cloner = TransformerCloner(
    org_model_id="google/gemma-3-270m-it",
    target_tokenizer_id="alibayram/turkish-tokenizer",
)

# Create a smaller model
pruning_config = PruningConfig(
    hidden_size=320,           # Reduce from 640 to 320
    num_hidden_layers=9,       # Reduce from 18 to 9
    intermediate_size=1024,    # Reduce from 2048 to 1024
    num_attention_heads=2,     # Reduce from 4 to 2
    num_key_value_heads=1,     # Keep at 1
)

model = cloner.clone_pruned(
    pruning_config=pruning_config,
    strategy=EmbeddingStrategy.MEAN,
    verbose=True,
)
# Output: Original: hidden=640, layers=18, intermediate=2048, heads=4, kv_heads=1
#         Pruned:   hidden=320, layers=9, intermediate=1024, heads=2, kv_heads=1
#         Model vocab size: 65536
#         Copying and pruning weights from original model...
#         Mapping embeddings with pruning...
#         Pruned model cloning complete!

model.save_pretrained("./pruned-model")

Method: clone_with_vocab_pruning()

Clone model with a reduced embedding table (fewer tokens).

model, tokenizer, id_mapping = cloner.clone_with_vocab_pruning(
    keep_token_ids: Optional[list[int]] = None,   # Specific token IDs to keep
    vocab_size: Optional[int] = None,             # Keep first N tokens (ignored if keep_token_ids provided)
    pruning_config: Optional[PruningConfig] = None,  # Optional architecture pruning
    verbose: bool = True,                          # Whether to print progress
) -> tuple[AutoModelForCausalLM, AutoTokenizer, dict[int, int]]
# Returns: (model, original_tokenizer, id_mapping)
# id_mapping: {old_token_id: new_embedding_index}

Note: The original tokenizer is returned unchanged because modifying SentencePiece/BPE vocabularies breaks them. Use id_mapping to convert token IDs to embedding indices.

Example 1: Keep first N tokens

from transformer_cloner import TransformerCloner

cloner = TransformerCloner(
    org_model_id="google/gemma-3-270m-it",
    target_tokenizer_id="google/gemma-3-270m-it",  # Same tokenizer
)

# Keep only first 8000 tokens
model, tokenizer, id_mapping = cloner.clone_with_vocab_pruning(
    vocab_size=8000,
    verbose=True,
)
# Output: Cloning with vocab pruning: 8000 tokens
#         New vocab size: 8000
#         Creating model with vocab_size=8000
#         Copying weights from original model...
#         Mapping embeddings (direct 1:1)...
#         Mapped 8000 embeddings directly
#         Vocab-pruned model cloning complete!

model.save_pretrained("./vocab-pruned-model")

# Use id_mapping to convert token IDs
print(id_mapping)  # {0: 0, 1: 1, 2: 2, ..., 7999: 7999}

Example 2: Keep specific tokens

# Keep only specific token IDs
important_tokens = [0, 1, 2, 100, 200, 500, 1000, 2000, 5000]

model, tokenizer, id_mapping = cloner.clone_with_vocab_pruning(
    keep_token_ids=important_tokens,
    verbose=True,
)

print(id_mapping)
# {0: 0, 1: 1, 2: 2, 100: 3, 200: 4, 500: 5, 1000: 6, 2000: 7, 5000: 8}

Example 3: Combined vocab + architecture pruning

from transformer_cloner import TransformerCloner, PruningConfig

cloner = TransformerCloner(
    org_model_id="google/gemma-3-270m-it",
    target_tokenizer_id="google/gemma-3-270m-it",
)

# Combine vocab pruning with architecture pruning
model, tokenizer, id_mapping = cloner.clone_with_vocab_pruning(
    vocab_size=8000,
    pruning_config=PruningConfig(
        hidden_size=320,
        num_hidden_layers=6,
        intermediate_size=1024,
    ),
    verbose=True,
)

# Result: Tiny model with 8000 tokens and smaller architecture
model.save_pretrained("./tiny-model")

Method: get_token_info()

Get information about how a specific token is mapped.

info = cloner.get_token_info(
    token: str,  # The token string to look up
) -> dict      # Returns token mapping information

Example:

cloner.build_token_id_map()

info = cloner.get_token_info("hello")
print(info)
# {
#     'token': 'hello',
#     'target_id': 1234,
#     'source_ids': [567, 890],
#     'source_tokens': ['hel', 'lo']
# }

# Token not found
info = cloner.get_token_info("xyz123")
# {'error': "Token 'xyz123' not found in target tokenizer"}

Method: print_vocab_samples()

Print sample vocabulary entries from both tokenizers.

cloner.print_vocab_samples(
    n: int = 10,  # Number of samples to print
) -> None

Example:

cloner.print_vocab_samples(n=5)
# Output:
# Original tokenizer samples:
#   0: '<bos>'   0
#   1: '<eos>'   1
#   2: '<pad>'   2
#   3: '▁'       3
#   4: '▁the'    4
# Total: 262144 tokens
#
# Target tokenizer samples:
#   0: '<bos>'   0
#   1: '<eos>'   1
#   2: '<pad>'   2
#   3: '▁'       3
#   4: '▁ve'     4
# Total: 65536 tokens

🎯 EmbeddingStrategy

When a target token maps to multiple source tokens, choose how to combine their embeddings:

from transformer_cloner import EmbeddingStrategy
Strategy Value Description Best For
MEAN "mean" Average of all embeddings Default, balanced representation
SUM "sum" Sum of all embeddings Preserving total magnitude
FIRST "first" First token's embedding only Prefix-focused tokens
LAST "last" Last token's embedding only Suffix-focused tokens
WEIGHTED "weighted" Weighted average (first tokens weighted more) Morphological priority
MAX "max" Element-wise maximum Preserving dominant features
MIN "min" Element-wise minimum Preserving minimal features

Example:

# Use different strategies
model = cloner.clone(strategy=EmbeddingStrategy.MEAN)     # Average
model = cloner.clone(strategy=EmbeddingStrategy.WEIGHTED) # First tokens matter more
model = cloner.clone(strategy=EmbeddingStrategy.FIRST)    # Only first token

⚙️ PruningConfig

Configuration dataclass for model architecture pruning.

from transformer_cloner import PruningConfig

config = PruningConfig(
    hidden_size: Optional[int] = None,           # Embedding dimension
    num_hidden_layers: Optional[int] = None,     # Number of transformer layers
    intermediate_size: Optional[int] = None,     # FFN intermediate dimension
    num_attention_heads: Optional[int] = None,   # Number of attention heads
    num_key_value_heads: Optional[int] = None,   # Number of KV heads (for GQA)
    head_dim: Optional[int] = None,              # Dimension per attention head
)

Set any value to None to keep the original model's value.

Example configurations:

# Half the layers only
config = PruningConfig(num_hidden_layers=9)  # 18 -> 9

# Half all dimensions
config = PruningConfig(
    hidden_size=320,          # 640 -> 320
    num_hidden_layers=9,      # 18 -> 9
    intermediate_size=1024,   # 2048 -> 1024
)

# Tiny model
config = PruningConfig(
    hidden_size=128,
    num_hidden_layers=3,
    intermediate_size=512,
    num_attention_heads=2,
    num_key_value_heads=1,
)

Validation

Use validate() to check if your config is valid before cloning:

errors = config.validate(cloner.org_model.config)
if errors:
    print("Validation errors:", errors)
else:
    print("Config is valid!")

Validation checks:

  • ✅ Dimensions don't exceed original model
  • ✅ All values are positive
  • num_attention_heads is divisible by num_key_value_heads
  • hidden_size is compatible with attention configuration

📊 Gemma-3-270m Architecture Reference

For google/gemma-3-270m-it:

Parameter Original Value
hidden_size 640
num_hidden_layers 18
intermediate_size 2048
num_attention_heads 4
num_key_value_heads 1
head_dim 256
vocab_size 262144

Example pruned configs:

# ~50% size (9 layers, same hidden)
PruningConfig(num_hidden_layers=9)

# ~25% size (half dimensions)
PruningConfig(
    hidden_size=320,
    num_hidden_layers=9,
    intermediate_size=1024,
)

# Tiny (~12% size)
PruningConfig(
    hidden_size=160,
    num_hidden_layers=6,
    intermediate_size=512,
    num_attention_heads=2,
)

🔧 Complete Workflow Example

from transformer_cloner import TransformerCloner, PruningConfig, EmbeddingStrategy

# 1. Initialize
cloner = TransformerCloner(
    org_model_id="google/gemma-3-270m-it",
    target_tokenizer_id="my-org/turkish-gemma-tokenizer",
)

# 2. Explore the vocabularies
cloner.print_vocab_samples(n=5)

# 3. Build token mapping
token_map = cloner.build_token_id_map()

# 4. Check a specific token
info = cloner.get_token_info("merhaba")
print(f"'merhaba' maps to source tokens: {info['source_tokens']}")

# 5. Create pruned model
pruning_config = PruningConfig(
    hidden_size=320,
    num_hidden_layers=9,
    intermediate_size=1024,
)

# Validate first
errors = pruning_config.validate(cloner.org_model.config)
if errors:
    raise ValueError(f"Invalid config: {errors}")

# 6. Clone with pruning
model = cloner.clone_pruned(
    pruning_config=pruning_config,
    strategy=EmbeddingStrategy.MEAN,
)

# 7. Save
model.save_pretrained("./turkish-gemma-small")
cloner.target_tokenizer.save_pretrained("./turkish-gemma-small")

print("Done! Model saved to ./turkish-gemma-small")

🤝 Contributing

Contributions are welcome! Please feel free to submit a Pull Request.


📄 License

MIT License - see LICENSE for details.


🙏 Acknowledgments

  • Built on top of 🤗 Transformers
  • Inspired by vocabulary adaptation research in multilingual NLP

Project details


Download files

Download the file for your platform. If you're not sure which to choose, learn more about installing packages.

Source Distribution

transformer_cloner-0.1.5.tar.gz (17.9 kB view details)

Uploaded Source

Built Distribution

If you're not sure about the file name format, learn more about wheel file names.

transformer_cloner-0.1.5-py3-none-any.whl (15.1 kB view details)

Uploaded Python 3

File details

Details for the file transformer_cloner-0.1.5.tar.gz.

File metadata

  • Download URL: transformer_cloner-0.1.5.tar.gz
  • Upload date:
  • Size: 17.9 kB
  • Tags: Source
  • Uploaded using Trusted Publishing? No
  • Uploaded via: twine/6.1.0 CPython/3.13.3

File hashes

Hashes for transformer_cloner-0.1.5.tar.gz
Algorithm Hash digest
SHA256 33c6783a2f735ecb3087111ac93be59b87c3c8fb92b6b49e49e73e3190b4a602
MD5 f0a5f075efd4039eb9b080c5a6fbe443
BLAKE2b-256 ecfd588d51203daea37ed8f6a1080b797c0aa75a92e4fc8b7a516515fe715d5b

See more details on using hashes here.

File details

Details for the file transformer_cloner-0.1.5-py3-none-any.whl.

File metadata

File hashes

Hashes for transformer_cloner-0.1.5-py3-none-any.whl
Algorithm Hash digest
SHA256 8183681cced596a0c271b8f23afa9ff57e36284b032369ce5d59bfdac6dbf17e
MD5 a9fbf475335dda9d3177594a7502a666
BLAKE2b-256 6a8a17340088dc54d813fdf3baabf2a4c633d734bd565616641fe105f1db0b17

See more details on using hashes here.

Supported by

AWS Cloud computing and Security Sponsor Datadog Monitoring Depot Continuous Integration Fastly CDN Google Download Analytics Pingdom Monitoring Sentry Error logging StatusPage Status page