Skip to main content

Experimental compact UTF-8 string type for CPython

Project description

py-ministring

Experimental compact UTF-8 string type for CPython as a C-extension.

Description

py-ministring implements a new string-like type Utf8String with efficient Unicode indexing and slicing. This prototype is designed to reduce memory footprint when working with texts containing predominantly ASCII characters with occasional multi-byte characters (like emojis).

Why py-ministring?

  • Compact Storage: Stores original UTF-8 bytes instead of wide characters
  • O(1) Indexing: Uses offset table for fast character access
  • Hash Caching: Speeds up comparison operations and dictionary usage
  • Protocol Compatibility: Implements core Python string protocols (indexing, slicing, equality, hashing)

Installation

git clone https://github.com/AI-Stratov/py-ministring
cd py-ministring
python setup.py build_ext --inplace

Usage

from ministring import ministr

# Create a string
s = ministr("hello 😃 world")

# Length in codepoints
print(len(s))      # 13

# Indexing
print(s[6])        # "😃"
print(s[0])        # "h"
print(s[-1])       # "d"

# Slicing
print(str(s[0:5]))    # "hello"
print(str(s[6:7]))    # "😃"
print(str(s[8:]))     # "world"

# Convert to regular string
print(str(s))      # "hello 😃 world"

# Comparison
assert s == "hello 😃 world"
assert "hello 😃 world" == s

# Hashing (can use in dict/set)
d = {s: "value"}
s2 = ministr("hello 😃 world")
print(d[s2])       # "value"

API

Constructor

  • ministr(obj) - creates a new Utf8String object from a string or str()-convertible object

Methods

  • len(s) - returns the number of Unicode codepoints
  • s[i] - returns character at index as a regular Python string
  • s[start:stop] - returns a new Utf8String with slice
  • str(s) - converts to regular Python string
  • repr(s) - string representation for debugging
  • hash(s) - hash value (cached)
  • s == other - comparison with other Utf8String or regular strings

Data Structure

typedef struct {
    PyObject_HEAD
    char *utf8_data;        // UTF-8 bytes
    Py_ssize_t utf8_size;   // size in bytes
    int32_t *offsets;       // offset table: codepoint → byte
    Py_ssize_t length;      // number of codepoints
    Py_hash_t hash;         // cached hash
} Utf8StringObject;

Testing

Run tests with pytest:

pip install pytest
pytest -v

Limitations

⚠️ WARNING: This is an experimental prototype, not intended for production use!

  • Missing support for many string methods (find, replace, etc.)
  • May be slower than regular strings for some operations
  • No support for step slicing (s[::2])
  • Limited handling of invalid UTF-8
  • No optimizations for very long strings

Technical Details

C API

Core functions for working with Utf8String:

  • Utf8String_FromUTF8(data, size) - create from UTF-8 data
  • utf8_codepoint_count(data, size) - count codepoints
  • build_offset_table(self) - build offset table
  • utf8_char_length(first_byte) - determine UTF-8 character length

Architecture

  1. Data Storage: Original UTF-8 bytes are preserved unchanged
  2. Indexing: Offset table built on-demand for O(1) access
  3. Caching: Hash values cached for faster comparisons
  4. Compatibility: Full support for Python protocols (sequence, mapping)

Usage Examples

Working with Emojis

s = ministr("Hello 👋 world 🌍!")
print(f"Length: {len(s)}")           # Length: 14
print(f"Emojis: {s[6]}, {s[12]}")    # Emojis: 👋, 🌍

Multi-language Text Processing

s = ministr("Hello 世界 🌍 Мир")
print(f"English: {str(s[0:5])}")     # Hello
print(f"Chinese: {str(s[6:8])}")     # 世界
print(f"Emoji: {s[9]}")              # 🌍
print(f"Russian: {str(s[11:14])}")   # Мир

Performance

# Creating many strings with emojis
texts = [ministr(f"Text {i} 😀") for i in range(1000)]
text_set = set(texts)  # Fast thanks to cached hash

License

Experimental code for educational purposes.

Project details


Download files

Download the file for your platform. If you're not sure which to choose, learn more about installing packages.

Source Distribution

py_ministring-0.1.1.tar.gz (12.3 kB view details)

Uploaded Source

Built Distribution

If you're not sure about the file name format, learn more about wheel file names.

py_ministring-0.1.1-cp313-cp313-win_amd64.whl (10.8 kB view details)

Uploaded CPython 3.13Windows x86-64

File details

Details for the file py_ministring-0.1.1.tar.gz.

File metadata

  • Download URL: py_ministring-0.1.1.tar.gz
  • Upload date:
  • Size: 12.3 kB
  • Tags: Source
  • Uploaded using Trusted Publishing? No
  • Uploaded via: twine/6.1.0 CPython/3.13.5

File hashes

Hashes for py_ministring-0.1.1.tar.gz
Algorithm Hash digest
SHA256 6ea9dc2fb7ca1c05bdd480b30459332dc524cd6c6a45e7a36b4e17e94b41c81b
MD5 be5312e79aacbc85413eff6a65f4f38c
BLAKE2b-256 ae0321da5c48f58cf036996380c603b53183514b352f4897b4b8723d89305a70

See more details on using hashes here.

File details

Details for the file py_ministring-0.1.1-cp313-cp313-win_amd64.whl.

File metadata

File hashes

Hashes for py_ministring-0.1.1-cp313-cp313-win_amd64.whl
Algorithm Hash digest
SHA256 2dbcf958c875227011f686caef0b57102a8c8fb7b08dfbea155b563895e2f916
MD5 391800eb532cab151cc39264cd0ef15b
BLAKE2b-256 5ea43d0bcfea66f16358a5852e2ead590c3bfb4d472dda275035c2dee358663a

See more details on using hashes here.

Supported by

AWS Cloud computing and Security Sponsor Datadog Monitoring Depot Continuous Integration Fastly CDN Google Download Analytics Pingdom Monitoring Sentry Error logging StatusPage Status page