Python implementation of the Codex decoder.

These details have not been verified by PyPI

Project links

Development Status
- 5 - Production/Stable
Operating System
- OS Independent
Programming Language
- Python :: 3
Topic
- File Formats
- System :: Archiving :: Compression

Project description

Codex

Codex is an archive format for storing and distributing large-scale collections of articles. It was designed for the Omnipedia app, an offline Wikipedia reader for iOS.

The Codex format has four notable features:

Block compression. Articles are grouped together and compressed in "blocks" to balance compression ratio with random access speed.
File sharding. Archives can be split across many shards (separate files) to allow for partial downloading over unstable connections.
Isolated title index. Article titles are stored separately from articles to allow for rapid title searches without decompressing the articles themselves.
Incremental updates. Infrastructure is in place to permit incremental updates to articles.

This repository provides a reference implementation of the Codex decoder (in Python) and describes the Codex format.

CodexPyDec

CodexPyDec requires Python 3.10 or greater. It has no required dependencies unless you want to access an archive compressed with LZFSE, in which case pyliblzfse is required.

Installation

CodexPyDec is available on the Python Package Index and can be installed using pip:

pip install codexpydec

To install alongside pyliblzfse for LZFSE compression support, use:

pip install codexpydec[lzfse]

Usage example

In your script or Python shell, import CodexDecoder from the codexpydec package:

from codexpydec import CodexDecoder

Load an archive by specifying the path to the Codex archive on your machine. Do not include a shard number or file extension – these are appended automatically.

my_archive = CodexDecoder("path/to/my_archive")

If the archive loaded successfully, you will be able to read the library metadata:

print(my_archive.library_id)
print(my_archive.library_name)
print(my_archive.library_license)
print(my_archive.library_version)
print(my_archive.n_catalog_entries)
print(my_archive.n_library_entries)

Catalog search

Use the search_catalog() method to perform a search of the article titles:

search_results = my_archive.search_catalog("united nations")
print(search_results)

search_catalog() returns a list of tuples. Each tuple holds the catalog entry title (which may include redirect information) and an entry number that points to the article content. For example:

[
	('United Nations', 1231),
	('United nations\x00United Nations', 1231),
	...
	('United Nations Trusteeship Council', 135068)
]

The results above tell us that there is an article titled "United Nations" located at entry number 1231, an article titled "United Nations Trusteeship Council" located at entry number 135068, and a catalog redirect for "United nations" (with a lowercase N) that redirects to the "United Nations" article.

Optionally, you can cap the number of results and remove redirects using the max_results and include_redirects arguments:

search_results = my_archive.search_catalog("united nations", max_results=10, include_redirects=False)
print(search_results)

Retrieving articles

There are three methods to retrieve an article, and your choice of which method to use will depend on what information you have in advance. If you only know the title, you can use the get_article_by_title() method:

article = my_archive.get_article_by_title("United Nations")
print(article)

If you only know the entry number, you can use the get_article_by_entry_number() method:

article = my_archive.get_article_by_entry_number(1231)
print(article)

If you know both the title and entry number, you can use the get_article() method:

article = get_article("United Nations", 1231)
print(article)

get_article() and get_article_by_title() return the full article including title and footer. get_article_by_entry_number() only returns the main article body (without title and footer). get_article_by_title() is slower because it needs to perform a catalog search to establish the entry number. get_article_by_entry_number() is faster but does not include the title and footer. get_article() provides the full article but you need to know both title and entry number in advance.

Exporting articles

Once you've extracted an article, you can save it as a Markdown file like so:

with open("output_directory/article.md", "w") as file:
	file.write(article)

To extract all articles from the archive, you can iterate over the archive and save each article like so:

for entry_number, article in my_archive:
	with open(f"output_directory/{entry_number}.md", "w") as file:
		file.write(article)

The complete set of decompressed articles will typically be around three times larger than the compressed archive. Extracting millions of articles is likely to take multiple hours.

To extract all articles based on some search query, you can use a script such as the following:

search_results = my_archive.search_catalog("united nations", max_results=10, include_redirects=False)
for (article_title, entry_number) in search_results:
	article = my_archive.get_article(article_title, entry_number)
	if article is not None:
		with open(f"output_directory/{entry_number}.md", "w") as file:
			file.write(article)

The Codex Format

A Codex archive consists of three main parts: the header, the catalog, and the library. The header can be further broken down into four parts: the header proper, the inventory, the catalog index, and the library index. The catalog and library consist of some number of catalog blocks and library blocks. Each catalog and library block is comprised of a block index and a block payload.

Archives can be split across multiple files and each file must be named with its shard number (e.g. my_archive.017.codex). Shard numbers (as presented in the filename) are always three digits and left-padded with zeros. Shard 000 is always the "header shard" – the shard that contains the header. Optionally, the catalog and library can also be placed in the header shard resulting in a single file.

Header

The header contains general metadata about the contents of the archive as well as various counts, pointers, and indexes for navigating the rest of the archive.

The header proper is 256 bytes in length and consists of a mix of strings and integers as described in the table below. Strings are always UTF-8 encoded and right-padded with null bytes. Integers are always unsigned and little-endian, but they vary in bit-length. The ranges and lengths in the following table are expressed in bytes.

Field	Length	Range	Notes
File signature	4	0–4	String. Always set to `CODX` (hex: `43 4f 44 58`).
Major schema version	1	4–5	8-bit integer. Major schema versions indicate breaking changes to the schema.
Minor schema version	1	5–6	8-bit integer. Minor schema versions indicate backward-compatible changes to the schema.
Compression algorithm	4	6–10	String. Compression algorithm used to compress blocks, typically set to `ZLIB`, `LZ4`, `LZMA`, `LZFS`. Codex does not mandate any particular compression format; however, the Python decoder only decodes ZLIB compressed archives.
Library ID	8	10–18	String. Unique identifier for the library that remains fixed across library versions.
Library name	64	18–82	String. Descriptive library name.
Library license	128	82–210	String. Copyright and licensing information.
Library version	4	210–214	32-bit integer. Library version number, typically set to the snapshot date in YYYYMMDD format.
Patched version	4	214–218	32-bit integer. If set to 0, the archive is a regular archive with full articles. If greater than 0, the archive contains diffs patching the specified version number.
N catalog entries	4	218–222	32-bit integer. Number of entries contained in the catalog. Must be > 0 and is limited to ~4.29 billion.
N library entries	4	222–226	32-bit integer. Number of articles contained in the library. Must be > 0 and is limited to ~4.29 billion.
N catalog blocks	2	226–228	16-bit integer. Number of compression blocks that the catalog is divided into. Must be > 0 and is limited to 65,535.
N library blocks	2	228–230	16-bit integer. Number of compression blocks that the library is divided into. Must be > 0 and is limited to 65,535.
N catalog shards	1	230–231	8-bit integer. Number of shards that the catalog blocks are distributed over. If 0, the catalog is contained in the header shard. Otherwise, the decoder expects to find additional files with with a zero-padded shard number at the end of the file name (e.g. `my_archive.001.codex`). Catalog shards (if any) are numbered from 001 to 00N in the filename (since 000 is reserved for the header shard).
N library shards	1	231–232	8-bit integer. Number of shards that the library blocks are distributed over. If 0, the library is contained in the header shard. Otherwise, the decoder expects to find additional files with with a zero-padded shard number at the end of the file name (e.g. `my_archive.002.codex`). Library shards (if any) are numbered sequentially after the catalog shard numbers. The total number of shards – header plus catalog plus library – cannot exceed 256.
Inventory pointer	8	232–240	32-bit integer. Byte offset of the inventory.
Catalog index pointer	8	240–248	32-bit integer. Byte offset of the catalog index
Library index pointer	8	248–256	32-bit integer. Byte offset of the library index

Inventory

The inventory is a chunk of compressed data of variable length stored in the header shard immediately after the 256 bytes described above. It holds a list of persistent article IDs (32-bit integers) that are only used during archive updates. The inventory is immediately followed by the catalog index.

Catalog index

The catalog index is a chunk of uncompressed data stored in the header shard immediately after the inventory. It specifies the location of each catalog block using 5 bytes. The first byte is the shard number (8-bit integer) and the remaining 4 bytes is the file offset within the shard (32-bit integer). The catalog index consists of N + 1 items, where N is the number of catalog blocks, and is therefore (N + 1) × 5 bytes in length. The extra item at the end of the catalog index specifies the byte position of the end of the last catalog block. Thus, any two consecutive index items specify the start and end position of a catalog block.

Library index

The library index has the same format as the catalog index, with N = the number of library blocks.

Catalog and catalog blocks

The catalog is a concatenation of N catalog blocks, which may be spread across multiple shards. Splitting across shards always occurs at block boundaries.

Each catalog block is a chunk of compressed data of variable length. Once decompressed, the catalog block consists of two parts: the block index and the block payload. The first 4 bytes of the block index state how many entries are contained in the block – the entry count, N. The remainder of the index, which will be N × 4 bytes in length, gives the start position of each block entry.

A catalog entry, which is variable in length, consists of two parts. The first 4 bytes is a 32-bit integer (the "entry number"). This entry number can be translated into a shard–block–article address for lookup of the associated article. The remainder of the catalog entry is a UTF-8 encoded string (typically an article title). If the string contains a null byte, the entry is a "redirection entry." The string should be split at the null byte to yield two substrings: a redirect-from title and a redirect-to title. Redirections make it possible for multiple catalog entries (e.g. "UN" and "United Nations") to point to the same article.

Catalog entries are arranged in case-insensitive lexicographic order.

Library and library blocks

The library has the same basic structure as the catalog, except that the payload of each library block is a concatenation of articles. Articles are arranged in arbitrary order. Article 0 – the first article in the first block of the first shard – is special. It is the footer text that is automatically appended to the bottom of each article.

License

© 2025 Recursive Ink Ltd. CodexPyDec is licensed under the terms of the GNU General Public License version 3 (GPLv3). By submitting a pull request you represent that your contribution can be licensed under GPLv3.

Project details

These details have not been verified by PyPI

Project links

Development Status
- 5 - Production/Stable
Operating System
- OS Independent
Programming Language
- Python :: 3
Topic
- File Formats
- System :: Archiving :: Compression

Release history Release notifications | RSS feed

1.2.0

May 21, 2026

1.1.2

Oct 11, 2025

1.1.1

Oct 7, 2025

This version

1.1.0

Sep 27, 2025

1.0

Jun 16, 2025

Download files

Download the file for your platform. If you're not sure which to choose, learn more about installing packages.

Source Distribution

codexpydec-1.1.0.tar.gz (26.3 kB view details)

Uploaded Sep 27, 2025 Source

Built Distribution

If you're not sure about the file name format, learn more about wheel file names.

The dropdown lists show the available interpreters, ABIs, and platforms. Enable javascript to be able to filter the list of wheel files.

codexpydec-1.1.0-py3-none-any.whl (22.4 kB view details)

Uploaded Sep 27, 2025 Python 3

File details

Details for the file codexpydec-1.1.0.tar.gz.

File metadata

Download URL: codexpydec-1.1.0.tar.gz
Upload date: Sep 27, 2025
Size: 26.3 kB
Tags: Source
Uploaded using Trusted Publishing? No
Uploaded via: twine/6.1.0 CPython/3.12.9

File hashes

Hashes for codexpydec-1.1.0.tar.gz
Algorithm	Hash digest
SHA256	`0d24acf1438e37b0f2108f6537fe85a9e5937345c8fb3c8a0c0baac0f8cf93f8`
MD5	`3b4d79333f904a0e6d6e67e8a0d3e8f8`
BLAKE2b-256	`2801ab8d3b2a9e1219db8e2f843bee33c9a19f0ab95cda23697ac6a125b94eff`

See more details on using hashes here.

File details

Details for the file codexpydec-1.1.0-py3-none-any.whl.

File metadata

Download URL: codexpydec-1.1.0-py3-none-any.whl
Upload date: Sep 27, 2025
Size: 22.4 kB
Tags: Python 3
Uploaded using Trusted Publishing? No
Uploaded via: twine/6.1.0 CPython/3.12.9

File hashes

Hashes for codexpydec-1.1.0-py3-none-any.whl
Algorithm	Hash digest
SHA256	`72facd135e3b4f64f4d81314992701f8ff8b5ffdb0b04c9900ba9f624e502025`
MD5	`cd1b691eed28f1497fff0cb85cda9896`
BLAKE2b-256	`6bbdcda4041b0e2853078921c938e66e6964cb9b7c562251161cccbd15713489`

See more details on using hashes here.

codexpydec 1.1.0

Navigation

Verified details

Maintainers

Unverified details

Project links

Meta

Classifiers

Project description

Codex

CodexPyDec

Installation

Usage example

Catalog search

Retrieving articles

Exporting articles

The Codex Format

Header

Inventory

Catalog index

Library index

Catalog and catalog blocks

Library and library blocks

License

Project details

Verified details

Maintainers

Unverified details

Project links

Meta

Classifiers

Release history Release notifications | RSS feed

Download files

Source Distribution

Built Distribution

File details

File metadata

File hashes

File details

File metadata

File hashes