Skip to main content

Experimental compact UTF-8 string type for CPython

Project description

py-ministring

Experimental compact UTF-8 string type for CPython as a C-extension.

Description

py-ministring implements a new string-like type Utf8String with efficient Unicode indexing and slicing. This prototype is designed to reduce memory footprint when working with texts containing predominantly ASCII characters with occasional multi-byte characters (like emojis).

Why py-ministring?

  • Compact Storage: Stores original UTF-8 bytes instead of wide characters
  • O(1) Indexing: Uses offset table for fast character access
  • Hash Caching: Speeds up comparison operations and dictionary usage
  • Protocol Compatibility: Implements core Python string protocols (indexing, slicing, equality, hashing)

Installation

git clone https://github.com/AI-Stratov/py-ministring
cd py-ministring
python setup.py build_ext --inplace

Usage

from ministring import ministr

# Create a string
s = ministr("hello 😃 world")

# Length in codepoints
print(len(s))      # 13

# Indexing
print(s[6])        # "😃"
print(s[0])        # "h"
print(s[-1])       # "d"

# Slicing
print(str(s[0:5]))    # "hello"
print(str(s[6:7]))    # "😃"
print(str(s[8:]))     # "world"

# Convert to regular string
print(str(s))      # "hello 😃 world"

# Comparison
assert s == "hello 😃 world"
assert "hello 😃 world" == s

# Hashing (can use in dict/set)
d = {s: "value"}
s2 = ministr("hello 😃 world")
print(d[s2])       # "value"

API

Constructor

  • ministr(obj) - creates a new Utf8String object from a string or str()-convertible object

Methods

  • len(s) - returns the number of Unicode codepoints
  • s[i] - returns character at index as a regular Python string
  • s[start:stop] - returns a new Utf8String with slice
  • str(s) - converts to regular Python string
  • repr(s) - string representation for debugging
  • hash(s) - hash value (cached)
  • s == other - comparison with other Utf8String or regular strings

Data Structure

typedef struct {
    PyObject_HEAD
    char *utf8_data;        // UTF-8 bytes
    Py_ssize_t utf8_size;   // size in bytes
    int32_t *offsets;       // offset table: codepoint → byte
    Py_ssize_t length;      // number of codepoints
    Py_hash_t hash;         // cached hash
} Utf8StringObject;

Testing

Run tests with pytest:

pip install pytest
pytest -v

Limitations

⚠️ WARNING: This is an experimental prototype, not intended for production use!

  • Missing support for many string methods (find, replace, etc.)
  • May be slower than regular strings for some operations
  • No support for step slicing (s[::2])
  • Limited handling of invalid UTF-8
  • No optimizations for very long strings

Technical Details

C API

Core functions for working with Utf8String:

  • Utf8String_FromUTF8(data, size) - create from UTF-8 data
  • utf8_codepoint_count(data, size) - count codepoints
  • build_offset_table(self) - build offset table
  • utf8_char_length(first_byte) - determine UTF-8 character length

Architecture

  1. Data Storage: Original UTF-8 bytes are preserved unchanged
  2. Indexing: Offset table built on-demand for O(1) access
  3. Caching: Hash values cached for faster comparisons
  4. Compatibility: Full support for Python protocols (sequence, mapping)

Usage Examples

Working with Emojis

s = ministr("Hello 👋 world 🌍!")
print(f"Length: {len(s)}")           # Length: 14
print(f"Emojis: {s[6]}, {s[12]}")    # Emojis: 👋, 🌍

Multi-language Text Processing

s = ministr("Hello 世界 🌍 Мир")
print(f"English: {str(s[0:5])}")     # Hello
print(f"Chinese: {str(s[6:8])}")     # 世界
print(f"Emoji: {s[9]}")              # 🌍
print(f"Russian: {str(s[11:14])}")   # Мир

Performance

# Creating many strings with emojis
texts = [ministr(f"Text {i} 😀") for i in range(1000)]
text_set = set(texts)  # Fast thanks to cached hash

License

Experimental code for educational purposes.

Project details


Download files

Download the file for your platform. If you're not sure which to choose, learn more about installing packages.

Source Distribution

py_ministring-0.1.0.tar.gz (11.6 kB view details)

Uploaded Source

Built Distribution

If you're not sure about the file name format, learn more about wheel file names.

py_ministring-0.1.0-cp313-cp313-win_amd64.whl (9.9 kB view details)

Uploaded CPython 3.13Windows x86-64

File details

Details for the file py_ministring-0.1.0.tar.gz.

File metadata

  • Download URL: py_ministring-0.1.0.tar.gz
  • Upload date:
  • Size: 11.6 kB
  • Tags: Source
  • Uploaded using Trusted Publishing? No
  • Uploaded via: twine/6.1.0 CPython/3.13.5

File hashes

Hashes for py_ministring-0.1.0.tar.gz
Algorithm Hash digest
SHA256 60f4539c76966d90f85105f6f6581107751dec4d0889394cfa2da7ab8a97e257
MD5 586e65b8f7a045e13a5072b62b28fd21
BLAKE2b-256 924706eedb6c3b41f3077a0fe0de33d13b520523634ae44a9c07a36d056b1fb1

See more details on using hashes here.

File details

Details for the file py_ministring-0.1.0-cp313-cp313-win_amd64.whl.

File metadata

File hashes

Hashes for py_ministring-0.1.0-cp313-cp313-win_amd64.whl
Algorithm Hash digest
SHA256 89ff47c4ca820a1adbd5615cc993773772033dc5da1a076239f1a01d8c98879b
MD5 1630cbf719c59bdda0f09a04e8d07331
BLAKE2b-256 d49b713e4697178e34c4e0c3f9d8618aeac54d0e0d748cdf64ba1ef7d86cc1d3

See more details on using hashes here.

Supported by

AWS Cloud computing and Security Sponsor Datadog Monitoring Depot Continuous Integration Fastly CDN Google Download Analytics Pingdom Monitoring Sentry Error logging StatusPage Status page