Skip to main content

CollateX is a collation tool.

Project description

CollateX is a software to

  • read multiple (≥ 2) versions of a text, splitting each version into parts (tokens) to be compared,

  • identify similarities of and differences between the versions (including moved/transposed segments) by aligning tokens, and

  • output the alignment results in a variety of formats for further processing, for instance to support the production of a critical apparatus or the stemmatical analysis of a text’s genesis.

Features

  • Non progressive multiple sequence alignment

  • Multiple output formats: alignment table, variant graph

  • Near matching (optional)

  • Supports Python 3

  • Supports unicode (Python 3 only)

How to install:

Mac/Linux: sudo pip3 install –pre collatex

if you don’t have pip installed, install it first with: sudo easy_install3 pip

For near matching functionality python-levenshtein C library is required.

Install it with (on Mac OS X and Linux):

sudo pip3 install python-levenshtein

Windows users need a precompiled binary distribution of this library if they want to use near matching.

Simple example:

from collatex import *

collation = Collation()
collation.add_witness("A", "The quick brown fox jumps over the dog.")
collation.add_witness("B", "The brown fox jumps over the lazy dog.")

alignment_table = collate(collation)

Add

print(alignment_table)

to show the results.

When running from the command shell run the example script with:

python ./nameofscript.py

History

2.0.0pre10 (2014-11-13)

  • Added support for Unicode character encoding

  • Ported codebase from Python 2 to Python 3

  • Separated IPython display logic from functional logic. No longer will the collate function try to determine whether you are running an environment that is capable of display HTML or SVG.

2.0.0pre9 (2014-10-02)

  • Added near matching option to collate function.

  • Added variant or invariant status to columns in alignment table object and JSON output.

  • Added experimental A* decision graph search optimization.

2.0.0pre8 (2014-09-18)

  • Added WordPunctuationTokenizer (treats punctuation as separate tokens).

  • Combined suffix array and edit graph aligner approaches into one collation algorithm.

2.0.0pre7 (2014-07-14)

  • Fixed handling of segmentation parameter in pretokenized JSON function.

2.0.0pre6 (2014-06-30)

  • Added Windows support. Thanks to David J. Birnbaum.

  • Fixed handling of IPython imports.

2.0.0pre5 (2014-06-11)

  • Added JSON output to collate method.

  • Added option to collate method to enable or disable parallel segmentation.

  • Added table output to collate_pretokenized_json method, next to the already existing JSON output.

  • Cached the suffix and LCP arrays to prevent unnecessary recalculation

  • Fixed handling of empty cells in JSON output of pretokenized JSON.

  • Fixed compatibility issue when rendering HTML or SVG with IPython 2.1 instead of IPython 0.13.

  • Corrected RST syntax in package info description.

2.0.0pre4 (2014-06-11)

  • Added pretokenized JSON support.

  • Added JSON visualization for the alignment table.

2.0.0pre3 (2014-06-10)

  • Fixed imports in init.py, “from collatex import *” now works correctly.

  • Added IPython HTML support for alignment table.

  • Added IPython SVG support for variant graph.

  • Added convenience constructors on Collation object.

  • Added horizontal layout for the alignment table visualization, next to vertical one.

2.0.0pre2 (2014-06-09)

  • Removed max 6 witness limit in aligner, now n number of witnesses can be aligned.

  • Added transposition detection.

  • Added alignment table plus plain text visualization.

  • Added collate convenience function.

2.0.0pre1 (2014-06-02)

  • First release on PyPI.

  • First pure Python development release of CollateX.

  • New collation algorithm, which does non progressive multiple witness alignment.

Project details


Download files

Download the file for your platform. If you're not sure which to choose, learn more about installing packages.

Source Distribution

collatex-2.0.0pre10.tar.gz (62.8 kB view details)

Uploaded Source

Built Distribution

If you're not sure about the file name format, learn more about wheel file names.

collatex-2.0.0pre10-py3.3.egg (84.9 kB view details)

Uploaded Egg

File details

Details for the file collatex-2.0.0pre10.tar.gz.

File metadata

  • Download URL: collatex-2.0.0pre10.tar.gz
  • Upload date:
  • Size: 62.8 kB
  • Tags: Source
  • Uploaded using Trusted Publishing? No

File hashes

Hashes for collatex-2.0.0pre10.tar.gz
Algorithm Hash digest
SHA256 f467b887b846c61db71ed1e0c065ef71362b9ef6da7a29d976d184beaa018c13
MD5 e607c04a59791e7548dfd3c6a784ed63
BLAKE2b-256 6f2cdf806b01e1446bef09b80e70471b7efd9fa6c3d3326f85a14bb611ed7bdd

See more details on using hashes here.

File details

Details for the file collatex-2.0.0pre10-py3.3.egg.

File metadata

File hashes

Hashes for collatex-2.0.0pre10-py3.3.egg
Algorithm Hash digest
SHA256 0a5f7aedf4acbd60c7435ee29e81da7db838989557d52c00b74e77d7d8b13776
MD5 28edc9c4d935916e9ed95efc8287e434
BLAKE2b-256 bb7fb17f115f07aba1980339bc422c00df68b7251b57712f5ae254fe5d000528

See more details on using hashes here.

Supported by

AWS Cloud computing and Security Sponsor Datadog Monitoring Depot Continuous Integration Fastly CDN Google Download Analytics Pingdom Monitoring Sentry Error logging StatusPage Status page