Pure-Python Reed Solomon encoder/decoder

## Project description

A pure-python universal errors-and-erasures Reed-Solomon Codec , based on the wonderful tutorial at wikiversity, written by “Bobmath” and “LRQ3000”.

<nav class="contents local" id="table-of-contents" role="doc-toc">

</nav>

## Installation

pip install --upgrade reedsolo

## Usage

### Basic usage with high-level RSCodec class

# Initialization
>>> from reedsolo import RSCodec
>>> rsc = RSCodec(10)  # 10 ecc symbols

# Encoding
>>> rsc.encode([1,2,3,4])
b'\x01\x02\x03\x04,\x9d\x1c+=\xf8h\xfa\x98M'
>>> rsc.encode(bytearray([1,2,3,4]))
bytearray(b'\x01\x02\x03\x04,\x9d\x1c+=\xf8h\xfa\x98M')
>>> rsc.encode(b'hello world')
b'hello world\xed%T\xc4\xfd\xfd\x89\xf3\xa8\xaa'
# Note that chunking is supported transparently to encode any string length.

# Decoding (repairing)
>>> rsc.decode(b'hello world\xed%T\xc4\xfd\xfd\x89\xf3\xa8\xaa')[0]
b'hello world'
>>> rsc.decode(b'heXlo worXd\xed%T\xc4\xfdX\x89\xf3\xa8\xaa')[0]     # 3 errors
b'hello world'
>>> rsc.decode(b'hXXXo worXd\xed%T\xc4\xfdX\x89\xf3\xa8\xaa')[0]     # 5 errors
b'hello world'
>>> rsc.decode(b'hXXXo worXd\xed%T\xc4\xfdXX\xf3\xa8\xaa')[0]        # 6 errors - fail
Traceback (most recent call last):
...
reedsolo.ReedSolomonError: Too many (or few) errors found by Chien Search for the errata locator polynomial!

Important upgrade notice for pre-1.0 users: Note that RSCodec.decode() returns 3 variables:

1. the decoded (corrected) message

2. the decoded message and error correction code (which is itself also corrected)

3. and the list of positions of the errata (errors and erasures)

Here is an example:

>>> tampered_msg = b'heXlo worXd\xed%T\xc4\xfdX\x89\xf3\xa8\xaa'
>>> decoded_msg, decoded_msgecc, errata_pos = rsc.decode(tampered_msg)
>>> print(decoded_msg)  # decoded/corrected message
bytearray(b'hello world')
>>> print(decoded_msgecc)  # decoded/corrected message and ecc symbols
bytearray(b'hello world\xed%T\xc4\xfd\xfd\x89\xf3\xa8\xaa')
>>> print(errata_pos)  # errata_pos is returned as a bytearray, hardly intelligible
bytearray(b'\x10\t\x02')
>>> print(list(errata_pos))  # convert to a list to get the errata positions as integer indices
[16, 9, 2]

Since we failed to decode with 6 errors with a codec set to 10 error correction code (ecc) symbols, let’s try to use a bigger codec, with 12 ecc symbols.

>>> rsc = RSCodec(12)  # using 2 more ecc symbols (to correct max 6 errors or 12 erasures)
>>> rsc.encode(b'hello world')
b'hello world?Ay\xb2\xbc\xdc\x01q\xb9\xe3\xe2='
>>> rsc.decode(b'hello worXXXXy\xb2XX\x01q\xb9\xe3\xe2=')[0]         # 6 errors - ok, but any more would fail
b'hello world'
>>> rsc.decode(b'helXXXXXXXXXXy\xb2XX\x01q\xb9\xe3\xe2=', erase_pos=[3, 4, 5, 6, 7, 8, 9, 10, 11, 12, 15, 16])[0]  # 12 erasures - OK
b'hello world'

This shows that we can decode twice as many erasures (where we provide the location of errors ourselves) than errors (with unknown locations). This is the cost of error correction compared to erasure correction.

To get the maximum number of errors OR erasures that can be independently corrected (ie, not simultaneously):

>>> maxerrors, maxerasures = rsc.maxerrata(verbose=True)
This codec can correct up to 6 errors and 12 erasures independently
>>> print(maxerrors, maxerasures)
6 12

To get the maximum number of errors AND erasures that can be simultaneously corrected, you need to specify the number of errors or erasures you expect:

>>> maxerrors, maxerasures = rsc.maxerrata(erasures=6, verbose=True)  # we know the number of erasures, will calculate how many errors we can afford
This codec can correct up to 3 errors and 6 erasures simultaneously
>>> print(maxerrors, maxerasures)
3 6
>>> maxerrors, maxerasures = rsc.maxerrata(errors=5, verbose=True)  # we know the number of errors, will calculate how many erasures we can afford
This codec can correct up to 5 errors and 2 erasures simultaneously
>>> print(maxerrors, maxerasures)
5 2

Note that if a chunk has more errors and erasures than the Singleton Bound as calculated by the maxerrata() method, the codec will try to raise a ReedSolomonError exception, but may very well not detect any error either (this is a theoretical limitation of error correction codes). In other words, error correction codes are unreliable to detect if a chunk of a message is corrupted beyond the Singleton Bound. If you want more reliability in errata detection, use a checksum or hash such as SHA or MD5 on your message, these are much more reliable and have no bounds on the number of errata (the only potential issue is with collision but the probability is very very low).

To check if a message is tampered given its error correction symbols, without decoding, use the check() method:

# Checking
>> rsc.check(b'hello worXXXXy\xb2XX\x01q\xb9\xe3\xe2=')  # Tampered message will return False
[False]
>> rmes, rmesecc, errata_pos = rsc.decode(b'hello worXXXXy\xb2XX\x01q\xb9\xe3\xe2=')
>> rsc.check(rmesecc)  # Corrected or untampered message will return True
[True]
>> print('Number of detected errors and erasures: %i, their positions: %s' % (len(errata_pos), list(errata_pos)))
Number of detected errors and erasures: 6, their positions: [16, 15, 12, 11, 10, 9]

By default, most Reed-Solomon codecs are limited to characters that can be encoded in 256 bits and with a length of maximum 256 characters. But this codec is universal, you can reduce or increase the length and maximum character value by increasing the Galois Field:

# To use longer chunks or bigger values than 255 (may be very slow)
>> rsc = RSCodec(12, nsize=4095)  # always use a power of 2 minus 1
>> rsc = RSCodec(12, c_exp=12)  # alternative way to set nsize=4095
>> mes = 'a' * (4095-12)
>> mesecc = rsc.encode(mes)
>> mesecc[2] = 1
>> mesecc[-1] = 1
>> rmes, rmesecc, errata_pos = rsc.decode(mesecc)
>> rsc.check(mesecc)
[False]
>> rsc.check(rmesecc)
[True]

Note that the RSCodec class supports transparent chunking, so you don’t need to increase the Galois Field to support longer messages, but characters will still be limited to 256 bits (or whatever field you set with c_exp).

If you want full control, you can skip the API and directly use the library as-is. Here’s how:

First you need to init the precomputed tables:

>> import reedsolo as rs
>> rs.init_tables(0x11d)

Pro tip: if you get the error: ValueError: byte must be in range(0, 256), please check that your prime polynomial is correct for your field. Pro tip2: by default, you can only encode messages of max length and max symbol value = 256. If you want to encode bigger messages, please use the following (where c_exp is the exponent of your Galois Field, eg, 12 = max length 2^12 = 4096):

>> prim = rs.find_prime_polys(c_exp=12, fast_primes=True, single=True)
>> rs.init_tables(c_exp=12, prim=prim)

Let’s define our RS message and ecc size:

>> n = 255  # length of total message+ecc
>> nsym = 12  # length of ecc
>> mes = "a" * (n-nsym)  # generate a sample message

To optimize, you can precompute the generator polynomial:

>> gen = rs.rs_generator_poly_all(n)

Then to encode:

>> mesecc = rs.rs_encode_msg(mes, nsym, gen=gen[nsym])

Let’s tamper our message:

>> mesecc[1] = 0

To decode:

>> rmes, recc, errata_pos = rs.rs_correct_msg(mesecc, nsym, erase_pos=erase_pos)

Note that both the message and the ecc are corrected (if possible of course). Pro tip: if you know a few erasures positions, you can specify them in a list erase_pos to double the repair power. But you can also just specify an empty list.

You can check how many errors and/or erasures were corrected, which can be useful to design adaptive bitrate algorithms:

>> print('A total of %i errata were corrected over all chunks of this message.' % len(errata_pos))

If the decoding fails, it will normally automatically check and raise a ReedSolomonError exception that you can handle. However if you want to manually check if the repaired message is correct, you can do so:

>> rs.rs_check(rmes + recc, nsym)

Note: if you want to use multiple reedsolomon with different parameters, you need to backup the globals and restore them before calling reedsolo functions:

>> rs.init_tables()
>> global gf_log, gf_exp, field_charac
>> bak_gf_log, bak_gf_exp, bak_field_charac = gf_log, gf_exp, field_charac

Then at anytime, you can do:

>> global gf_log, gf_exp, field_charac
>> gf_log, gf_exp, field_charac = bak_gf_log, bak_gf_exp, bak_field_charac
>> mesecc = rs.rs_encode_msg(mes, nsym)
>> rmes, recc, errata_pos = rs.rs_correct_msg(mesecc, nsym)

The globals backup is not necessary if you use RSCodec, it will be automatically managed.

Read the sourcecode’s comments for more info about how it works, and for the various parameters you can setup if you need to interface with other RS codecs.

## Extended description

The code of wikiversity is here consolidated into a nice API with exceptions handling. The algorithm can correct up to 2*e+v <= nsym, where e is the number of errors, v the number of erasures and nsym = n-k = the number of ECC (error correction code) symbols. This means that you can either correct exactly floor(nsym/2) errors, or nsym erasures (errors where you know the position), and a combination of both errors and erasures. This is called the Singleton Bound, and is the maximum/optimal theoretical number of erasures and errors any error correction algorithm can correct (although there are experimental approaches to go a bit further, named list decoding, not implemented here, but feel free to do pull request!). The code should work on pretty much any reasonable version of python (2.4-3.7), but I’m only testing on 2.7 and 3.7. Python 3.8 should work except for Cython which is currently incompatible with this version.

The codec has quite reasonable performances if you either use PyPy on the pure-python implementation (reedsolo.py) or either if you compile the Cython extension creedsolo.pyx (which is about 2x faster than PyPy). You can expect encoding rates of several MB/s.

This library is also thoroughly unit tested so that nearly any encoding/decoding case should be covered.

The codec is universal, meaning that it can decode any message encoded by another RS encoder as long as you provide the correct parameters. Note however that if you use higher fields (ie, bigger c_exp), the algorithms will be slower, first because we cannot then use the optimized bytearray() structure but only array.array(‘i’, …), and also because Reed-Solomon’s complexity is quadratic (both in encoding and decoding), so this means that the longer your messages, the longer it will take to encode/decode (quadratically!).

The algorithm itself can handle messages of a length up to (2^c_exp)-1 symbols per message (or chunk), including the ECC symbols, and each symbol can have a value of up to (2^c_exp)-1 (indeed, both the message length and the maximum value for one character is constrained by the same mathematical reason). By default, we use the field GF(2^8), which means that you are limited to values between 0 and 255 (perfect to represent a single hexadecimal symbol on computers, so you can encode any binary stream) and limited to messages+ecc of maximum length 255. However, you can “chunk” longer messages to fit them into the message length limit. The RSCodec class will automatically apply chunking, by splitting longer messages into chunks and encode/decode them separately; it shouldn’t make a difference from an API perspective (ie, from your POV).

To use the Cython implementation, you need to pip install cython and a C++ compiler (Microsoft Visual C++ 14.0 for Windows and Python 3.7). Then you can simply cd to the root of the folder where creedsolo.pyx is, and type python setup.py build_ext --inplace. Alternatively, you can generate just the C++ code by typing cython -3 creedsolo.pyx. When building a distributable egg or installing the module from source, the Cython module will be automatically transpiled and compiled if both Cython and a C compiler are installed. This behavior can be modified using the --nocython and --native-compile arguments for setup.py.

## Authors

This module was conceived and developed by Tomer Filiba.

It was further extended and is currently maintained by Stephen Karl Larroque.

This software is released to the Public Domain.

If the Public Domain is not adequate for your purpose, you can instead consider this module under the MIT License as you prefer.

## Project details

Uploaded source
Uploaded cp37