CollateX is a software to
- read multiple (≥ 2) versions of a text, splitting each version into parts (tokens) to be compared,
- identify similarities of and differences between the versions (including moved/transposed segments) by aligning tokens, and
- output the alignment results in a variety of formats for further processing, for instance to support the production of a critical apparatus or the stemmatical analysis of a text’s genesis.
- Non progressive multiple sequence alignment
- Multiple output formats: alignment table, variant graph
- Near matching (optional)
- Supports Python 3
- Supports unicode (Python 3 only)
How to install:
sudo pip3 install collatex
if you don’t have pip installed, install it first with:
sudo easy_install3 pip
For near matching functionality python-levenshtein C library is required.
Install it with (on Mac OS X and Linux):
sudo pip3 install python-levenshtein
Windows users need a precompiled binary distribution of this library if they want to use near matching.
from collatex import *
collation = Collation()
collation.add_plain_witness("A", "The quick brown fox jumps over the dog.")
collation.add_plain_witness("B", "The brown fox jumps over the lazy dog.")
alignment_table = collate(collation)
to show the results.
When running from the command shell run the example script with:
- Added the ability to use output=”svg_simple” next to output=”svg”. The “svg_simple” option gets you the n-property
- based graph, so just the normalized version of the tokens, which will hide any variation in the t-property.
- Thanks to Joris van Zundert for the patch.
- Changed the colour scheme of the “html2” output option, to aid those with Red/Green colour-blindness.
- Thanks to Melodee H. Beals for the patch.
- Bug fix release
- Fixed a bug in the new near match functionality that would cause tokens to go missing in the alignment table.
- Thanks to Torsten Hiltmann for reporting it and providing a test case.
- Official release for the Dixit code and collation workshop in Amsterdam
- This release contains the new near match functionality implemented as a post process after alignment. Same as RC1.
- It also contains the multiple short witnesses bugfix done in 2.0.1
- Bug fix release for the Dixit code and collation workshop in Amsterdam
- Fixed index out of range bug when multiple very short witnesses (= one token) were collated
- Disabled debug statements for near matching
- New near match functionality, implemented as a post process after alignment.
- First official release for the Dixit code and collation workshop in Amsterdam
- Added XML as an output format
- Added TEI parallel segmentation as an output format
- Tokenizer: retain whitespace in the t-property of preceding token
- Witness: added normalization: strips whitespace
- Merged old collate_pretokenized_json() function into collate() function
- JSON output contains full JSON representation of the tokens
- Enabled segmentation support for all input formats, and for SVG output
- Enhanced SVG output to include “n” value and all “t” values of JSON input
- JSON output is raw Unicode, instead of escaped characters
- Test suite updated
- Rename of TokenIndex.py was not in effect in the uploaded files. Fixed now.
- Renamed TokenIndex.py module to tokenindex.py to follow conventions.
- Moved all the block and suffix, LCP interval code to new class TokenIndex.
- Added output option ‘html2’ for colored alignment table rendering.
- Fix a bug that was caused by the fact that a dash was stored in empty cells of the AlignmentTable. Now None is stored (this resolved a TODO). Plain text and HTML rendering of the table render a dash for empty cells. JSON output now returns null for empty cells. Fixes bug when a token with a dash in the content was screwing the rendering of the alignment table (caused of by one errors).
- Further improved blockification of witnesses.
- Added properties_filter option to enable users to influence matching based on properties of tokens.
- Improved blockification of witnesses.
- Added SVG output option to the collate function. For this functionality to work the graphviz python library needs to be installed.
- Bug-fix: collate_pretokenize_json function should not re-tokenized the content. Thanks to Tara L. Andrews.
- Allow near-matching for plain as well as for pre-tokenized content. Thanks to Tara L. Andrews.
- Added HTML option to collate function for the output as an alignment table represented as HTML.
- Added support for Unicode character encoding
- Ported codebase from Python 2 to Python 3
- Separated IPython display logic from functional logic. No longer will the collate function try to determine whether you are running an environment that is capable of display HTML or SVG.
- Added near matching option to collate function.
- Added variant or invariant status to columns in alignment table object and JSON output.
- Added experimental A* decision graph search optimization.
- Added WordPunctuationTokenizer (treats punctuation as separate tokens).
- Combined suffix array and edit graph aligner approaches into one collation algorithm.
- Fixed handling of segmentation parameter in pretokenized JSON function.
- Added Windows support. Thanks to David J. Birnbaum.
- Fixed handling of IPython imports.
- Added JSON output to collate method.
- Added option to collate method to enable or disable parallel segmentation.
- Added table output to collate_pretokenized_json method, next to the already existing JSON output.
- Cached the suffix and LCP arrays to prevent unnecessary recalculation
- Fixed handling of empty cells in JSON output of pretokenized JSON.
- Fixed compatibility issue when rendering HTML or SVG with IPython 2.1 instead of IPython 0.13.
- Corrected RST syntax in package info description.
- Added pretokenized JSON support.
- Added JSON visualization for the alignment table.
- Fixed imports in init.py, “from collatex import *” now works correctly.
- Added IPython HTML support for alignment table.
- Added IPython SVG support for variant graph.
- Added convenience constructors on Collation object.
- Added horizontal layout for the alignment table visualization, next to vertical one.
- Removed max 6 witness limit in aligner, now n number of witnesses can be aligned.
- Added transposition detection.
- Added alignment table plus plain text visualization.
- Added collate convenience function.
- First release on PyPI.
- First pure Python development release of CollateX.
- New collation algorithm, which does non progressive multiple witness alignment.
Download the file for your platform. If you're not sure which to choose, learn more about installing packages.