CollateX is a collation tool.
CollateX is a software to
- read multiple (>= 2) versions of a text, splitting each version into parts (tokens) to be compared,
- identify similarities of and differences between the versions (including moved/transposed segments) by aligning tokens, and
- output the alignment results in a variety of formats for further processing, for instance to support the production of a critical apparatus or the stemmatic analysis of a text’s genesis.
- Free software: GPLv3 license
- Documentation: http://interedition.github.io/collatex/pythonport.html
- Partially non-progressive multiple-sequence alignment
- Multiple output formats: alignment table, variant graph
- Near matching (optional)
- Supports Python 3
- Supports unicode (Python 3 only)
How to install:
Mac/Linux: pip install collatex
if you don’t have pip installed, install it first with: easy_install pip
For near matching functionality python-levenshtein C library is required. Install it with (on Mac OS X and Linux): pip install python-levenshtein. Windows users may need a precompiled binary distribution of this library if they want to use near matching.
from collatex import * collation = Collation() collation.add_plain_witness("A", "The quick brown fox jumps over the dog.") collation.add_plain_witness("B", "The brown fox jumps over the lazy dog.") alignment_table = collate(collation) print(alignment_table)
When running from the command shell, run the example script with:
- Create documentation for CollateX Python
- TEI output writes “t” values instead of “n” values
- TEI output uses minidom instead of etree
- TEI output uses same namespaces and wrapper as CollateX Java
- Add “csv” and “tsv” output options
- Use graphviz Python bindings instead of PyGraphviz for Windows compatibility
- Update networkx compatibility from 1.11 to 2.1
- Replace pygraphviz bindings with graphviz for Windows compatibility
- Update near matching to add near-matching edges and adjust rank in SVG output
- New version of the alignment algorithm (which we call the MatchCube approach) to reduce order effects during multiple witness alignment.
- Added the new SVG renderer which uses the pygraphviz bindings instead of the graphviz bindings.
- Thanks to David J. Birnbaum for the patch.
- Fixed the CalledProcessError bug that the previous renderer caused when used with Python 3.
- Added the ability to use output=”svg_simple” next to output=”svg”. The “svg_simple” option gets you the n-property
- based graph, so just the normalized version of the tokens, which will hide any variation in the t-property.
- Thanks to Joris van Zundert for the patch.
- Changed the colour scheme of the “html2” output option, to aid those with Red/Green colour-blindness.
- Thanks to Melodee H. Beals for the patch.
- Bug fix release
- Fixed a bug in the new near match functionality that would cause tokens to go missing in the alignment table.
- Thanks to Torsten Hiltmann for reporting it and providing a test case.
- Official release for the Dixit code and collation workshop in Amsterdam
- This release contains the new near match functionality implemented as a post process after alignment. Same as RC1.
- It also contains the multiple short witnesses bugfix done in 2.0.1
- Bug fix release for the Dixit code and collation workshop in Amsterdam
- Fixed index out of range bug when multiple very short witnesses (= one token) were collated
- Disabled debug statements for near matching
- New near match functionality, implemented as a post process after alignment.
- First official release for the Dixit code and collation workshop in Amsterdam
- Added XML as an output format
- Added TEI parallel segmentation as an output format
- Tokenizer: retain whitespace in the t-property of preceding token
- Witness: added normalization: strips whitespace
- Merged old collate_pretokenized_json() function into collate() function
- JSON output contains full JSON representation of the tokens
- Enabled segmentation support for all input formats, and for SVG output
- Enhanced SVG output to include “n” value and all “t” values of JSON input
- JSON output is raw Unicode, instead of escaped characters
- Test suite updated
- Rename of TokenIndex.py was not in effect in the uploaded files. Fixed now.
- Renamed TokenIndex.py module to tokenindex.py to follow conventions.
- Moved all the block and suffix, LCP interval code to new class TokenIndex.
- Added output option ‘html2’ for colored alignment table rendering.
- Fix a bug that was caused by the fact that a dash was stored in empty cells of the AlignmentTable. Now None is stored (this resolved a TODO). Plain text and HTML rendering of the table render a dash for empty cells. JSON output now returns null for empty cells. Fixes bug when a token with a dash in the content was screwing the rendering of the alignment table (caused of by one errors).
- Further improved blockification of witnesses.
- Added properties_filter option to enable users to influence matching based on properties of tokens.
- Improved blockification of witnesses.
- Added SVG output option to the collate function. For this functionality to work the graphviz python library needs to be installed.
- Bug-fix: collate_pretokenize_json function should not re-tokenized the content. Thanks to Tara L. Andrews.
- Allow near-matching for plain as well as for pre-tokenized content. Thanks to Tara L. Andrews.
- Added HTML option to collate function for the output as an alignment table represented as HTML.
- Added support for Unicode character encoding
- Ported codebase from Python 2 to Python 3
- Separated IPython display logic from functional logic. No longer will the collate function try to determine whether you are running an environment that is capable of display HTML or SVG.
- Added near matching option to collate function.
- Added variant or invariant status to columns in alignment table object and JSON output.
- Added experimental A* decision graph search optimization.
- Added WordPunctuationTokenizer (treats punctuation as separate tokens).
- Combined suffix array and edit graph aligner approaches into one collation algorithm.
- Fixed handling of segmentation parameter in pretokenized JSON function.
- Added Windows support. Thanks to David J. Birnbaum.
- Fixed handling of IPython imports.
- Added JSON output to collate method.
- Added option to collate method to enable or disable parallel segmentation.
- Added table output to collate_pretokenized_json method, next to the already existing JSON output.
- Cached the suffix and LCP arrays to prevent unnecessary recalculation
- Fixed handling of empty cells in JSON output of pretokenized JSON.
- Fixed compatibility issue when rendering HTML or SVG with IPython 2.1 instead of IPython 0.13.
- Corrected RST syntax in package info description.
- Added pretokenized JSON support.
- Added JSON visualization for the alignment table.
- Fixed imports in init.py, “from collatex import *” now works correctly.
- Added IPython HTML support for alignment table.
- Added IPython SVG support for variant graph.
- Added convenience constructors on Collation object.
- Added horizontal layout for the alignment table visualization, next to vertical one.
- Removed max 6 witness limit in aligner, now n number of witnesses can be aligned.
- Added transposition detection.
- Added alignment table plus plain text visualization.
- Added collate convenience function.
- First release on PyPI.
- First pure Python development release of CollateX.
- New collation algorithm, which does non progressive multiple witness alignment.
Release history Release notifications | RSS feed
Download the file for your platform. If you're not sure which to choose, learn more about installing packages.
|Filename, size||File type||Python version||Upload date||Hashes|
|Filename, size collatex-2.2-py2.py3-none-any.whl (57.3 kB)||File type Wheel||Python version py2.py3||Upload date||Hashes View|
|Filename, size collatex-2.2.tar.gz (88.8 kB)||File type Source||Python version None||Upload date||Hashes View|
Hashes for collatex-2.2-py2.py3-none-any.whl