Experimental compact UTF-8 string type for CPython
Project description
py-ministring
Experimental compact UTF-8 string type for CPython as a C-extension.
Description
py-ministring implements a new string-like type Utf8String with efficient Unicode indexing and slicing. This prototype is designed to reduce memory footprint when working with texts containing predominantly ASCII characters with occasional multi-byte characters (like emojis).
Why py-ministring?
- Compact Storage: Stores original UTF-8 bytes instead of wide characters
- O(1) Indexing: Uses offset table for fast character access
- Hash Caching: Speeds up comparison operations and dictionary usage
- Protocol Compatibility: Implements core Python string protocols (indexing, slicing, equality, hashing)
Installation
git clone https://github.com/AI-Stratov/py-ministring
cd py-ministring
python setup.py build_ext --inplace
Usage
from ministring import ministr
# Create a string
s = ministr("hello 😃 world")
# Length in codepoints
print(len(s)) # 13
# Indexing
print(s[6]) # "😃"
print(s[0]) # "h"
print(s[-1]) # "d"
# Slicing
print(str(s[0:5])) # "hello"
print(str(s[6:7])) # "😃"
print(str(s[8:])) # "world"
# Convert to regular string
print(str(s)) # "hello 😃 world"
# Comparison
assert s == "hello 😃 world"
assert "hello 😃 world" == s
# Hashing (can use in dict/set)
d = {s: "value"}
s2 = ministr("hello 😃 world")
print(d[s2]) # "value"
API
Constructor
ministr(obj)- creates a new Utf8String object from a string or str()-convertible object
Methods
len(s)- returns the number of Unicode codepointss[i]- returns character at index as a regular Python strings[start:stop]- returns a new Utf8String with slicestr(s)- converts to regular Python stringrepr(s)- string representation for debugginghash(s)- hash value (cached)s == other- comparison with other Utf8String or regular strings
Data Structure
typedef struct {
PyObject_HEAD
char *utf8_data; // UTF-8 bytes
Py_ssize_t utf8_size; // size in bytes
int32_t *offsets; // offset table: codepoint → byte
Py_ssize_t length; // number of codepoints
Py_hash_t hash; // cached hash
} Utf8StringObject;
Testing
Run tests with pytest:
pip install pytest
pytest -v
Limitations
⚠️ WARNING: This is an experimental prototype, not intended for production use!
- Missing support for many string methods (
find,replace, etc.) - May be slower than regular strings for some operations
- No support for step slicing (
s[::2]) - Limited handling of invalid UTF-8
- No optimizations for very long strings
Technical Details
C API
Core functions for working with Utf8String:
Utf8String_FromUTF8(data, size)- create from UTF-8 datautf8_codepoint_count(data, size)- count codepointsbuild_offset_table(self)- build offset tableutf8_char_length(first_byte)- determine UTF-8 character length
Architecture
- Data Storage: Original UTF-8 bytes are preserved unchanged
- Indexing: Offset table built on-demand for O(1) access
- Caching: Hash values cached for faster comparisons
- Compatibility: Full support for Python protocols (sequence, mapping)
Usage Examples
Working with Emojis
s = ministr("Hello 👋 world 🌍!")
print(f"Length: {len(s)}") # Length: 14
print(f"Emojis: {s[6]}, {s[12]}") # Emojis: 👋, 🌍
Multi-language Text Processing
s = ministr("Hello 世界 🌍 Мир")
print(f"English: {str(s[0:5])}") # Hello
print(f"Chinese: {str(s[6:8])}") # 世界
print(f"Emoji: {s[9]}") # 🌍
print(f"Russian: {str(s[11:14])}") # Мир
Performance
# Creating many strings with emojis
texts = [ministr(f"Text {i} 😀") for i in range(1000)]
text_set = set(texts) # Fast thanks to cached hash
License
Experimental code for educational purposes.
Project details
Release history Release notifications | RSS feed
Download files
Download the file for your platform. If you're not sure which to choose, learn more about installing packages.
Source Distribution
Built Distribution
Filter files by name, interpreter, ABI, and platform.
If you're not sure about the file name format, learn more about wheel file names.
Copy a direct link to the current filters
File details
Details for the file py_ministring-0.1.1.tar.gz.
File metadata
- Download URL: py_ministring-0.1.1.tar.gz
- Upload date:
- Size: 12.3 kB
- Tags: Source
- Uploaded using Trusted Publishing? No
- Uploaded via: twine/6.1.0 CPython/3.13.5
File hashes
| Algorithm | Hash digest | |
|---|---|---|
| SHA256 |
6ea9dc2fb7ca1c05bdd480b30459332dc524cd6c6a45e7a36b4e17e94b41c81b
|
|
| MD5 |
be5312e79aacbc85413eff6a65f4f38c
|
|
| BLAKE2b-256 |
ae0321da5c48f58cf036996380c603b53183514b352f4897b4b8723d89305a70
|
File details
Details for the file py_ministring-0.1.1-cp313-cp313-win_amd64.whl.
File metadata
- Download URL: py_ministring-0.1.1-cp313-cp313-win_amd64.whl
- Upload date:
- Size: 10.8 kB
- Tags: CPython 3.13, Windows x86-64
- Uploaded using Trusted Publishing? No
- Uploaded via: twine/6.1.0 CPython/3.13.5
File hashes
| Algorithm | Hash digest | |
|---|---|---|
| SHA256 |
2dbcf958c875227011f686caef0b57102a8c8fb7b08dfbea155b563895e2f916
|
|
| MD5 |
391800eb532cab151cc39264cd0ef15b
|
|
| BLAKE2b-256 |
5ea43d0bcfea66f16358a5852e2ead590c3bfb4d472dda275035c2dee358663a
|