Skip to main content

A Python library for easy calculation of tree edit distances with visualization capabilities.

Project description

EasyTED Library

The EasyTED Library offers a straightforward approach for calculating the syntactic tree edit distance (TED) between two sentences. Utilizing advanced Natural Language Processing (NLP) techniques, EasyTED parses sentences to their constituency trees, facilitating in-depth linguistic analyses with minimal setup. Beyond calculating distances, it features tree visualization and transformation tools, making it an indispensable resource for linguistics research and NLP applications.

Features

  • Tree Edit Distance Calculation: Compute the TED between any two sentences.
  • Constituency Tree Parsing: Transform sentences into their underlying constituency tree structures.
  • Tree Visualization: Generate and save visual representations of constituency trees.
  • Bracketed String Transformation: Convert constituency trees into a bracketed string format for easy comparison and analysis.
  • Simple Integration: Designed to seamlessly integrate with broader NLP and linguistic analysis workflows.

Installation

Install EasyTED directly from the Python Package Index (PyPI) using pip:

pip install easyted

High-Level Usage

Calculating Full Tree Edit Distance

from easyted.ted import TreeEditDistanceCalculator

# Initialize the calculator
calculator = TreeEditDistanceCalculator()

# Calculate the tree edit distance between two sentences
distance = calculator.calculate_ted("This is a test.", "This is only a test.")
print(f"Tree Edit Distance: {distance}")

Calculating N Tree Edit Distance

from easyted.ted import TreeEditDistanceCalculator

# Initialize the calculator
calculator = TreeEditDistanceCalculator()

# Calculate the tree edit distance between two sentences for first 3 layers
distance = calculator.calculate_ted("This is a test.", "This is only a test.", 3)
print(f"Tree Edit Distance for first 3 layers: {distance}")

# Calculate the tree edit distance between two sentences considering only the first 3 layers
distance_first_3 = calculator.calculate_ted("This is a test.", "This is only a test.", 3)
print(f"Tree Edit Distance (First 3 Layers): {distance_first_3}")
Calculating Tree Edit Distance for the First 5 Layers

Visualizing Constituency Trees

# Draw and save the constituency tree to a file
calculator.draw_and_save_tree("A visualization of a constituency tree.", "tree_visualization.ps")

Main Features and Methods

The TreeEditDistanceCalculator class provides a suite of methods for parsing, manipulating, and visualizing constituency trees, as well as calculating the tree edit distance between sentences. Here's a breakdown of its core functionalities:

Initialization

calculator = TreeEditDistanceCalculator(language='en')

Initializes the calculator with a specified language for the NLP pipeline. The default language is English ('en').

Parsing Sentences into Constituency Trees

tree = calculator.get_constituency_tree("This is a test sentence.")

Parses a sentence and returns its constituency tree, enabling further linguistic analysis.

Simplifying Tree Representations

cleaned_string = calculator.remove_non_terminal_labels("(S (NP This) (VP is))")
Removes non-terminal labels from a tree string, simplifying its structure for comparison or analysis.

Converting Trees to Bracketed String Format

bracket_string = calculator.nltk_tree_to_bracket_string(tree)

Converts a constituency tree into a bracketed string format, facilitating easy comparison and visualization.

Limiting Tree Depth

limited_bracket_string = calculator.nltk_tree_to_n_bracket_string(tree, max_depth=3)

Converts a tree to a bracketed string while limiting its depth, useful for focusing on higher-level structural similarities or differences.

Visualizing and Saving Trees

calculator.draw_and_save_tree("This is a test sentence.", "tree_output.ps")

Draws the constituency tree of a sentence and saves the visualization to a file, perfect for presentations or further analysis.

Calculating Tree Edit Distance

distance = calculator.calculate_ted("Sentence one.", "Sentence two.", depth='full')

Calculates the Tree Edit Distance (TED) between two sentences. The depth can be 'full' for complete trees or an integer for a specific depth, offering flexibility in analyzing tree similarities.

Requirements

  • Python 3.6+
  • NLTK
  • stanza
  • APTED

Contributing

We welcome contributions to the EasyTED Library! If you have suggestions for improvements or wish to contribute new features, please feel free to open an issue or submit a pull request. Ensure your contributions adhere to the coding standards set forth by the project.

License

EasyTED is licensed under the MIT License. See the LICENSE file in the project repository for more details.

Acknowledgments

Thanks to NLTK for providing the foundational tools for working with natural language data. Appreciation to the Stanford NLP Group for the development of the stanza library, which powers the linguistic analysis capabilities of EasyTED. Gratitude to the developers of the APTED algorithm for their work on efficient tree edit distance computation.

Citations

If you use EasyTED in your research, please consider citing the following:

@inproceedings{qi2020stanza,
    title={Stanza: A {Python} Natural Language Processing Toolkit for Many Human Languages},
    author={Qi, Peng and Zhang, Yuhao and Zhang, Yuhui and Bolton, Jason and Manning, Christopher D.},
    booktitle = "Proceedings of the 58th Annual Meeting of the Association for Computational Linguistics: System Demonstrations",
    year={2020}
}

@article{pawlik2016tree,
    title={Tree edit distance: Robust and memory- efficient},
    author={Pawlik, Mateusz and Augsten, Nikolaus},
    journal={Information Systems},
    volume={56},
    year={2016}
}

@article{pawlik2015efficient,
    title={Efficient Computation of the Tree Edit Distance},
    author={Pawlik, Mateusz and Augsten, Nikolaus},
    journal={ACM Transactions on Database Systems (TODS)},
    volume={40},
    number={1},
    year={2015}
}

@article{pawlik2011rted,
    title={RTED: A Robust Algorithm for the Tree Edit Distance},
    author={Pawlik, Mateusz and Augsten, Nikolaus},
    journal={PVLDB},
    volume={5},
    number={4},
    year={2011}
}

Project details


Download files

Download the file for your platform. If you're not sure which to choose, learn more about installing packages.

Source Distribution

easyted-0.0.3.tar.gz (6.0 kB view details)

Uploaded Source

Built Distribution

If you're not sure about the file name format, learn more about wheel file names.

easyted-0.0.3-py3-none-any.whl (6.4 kB view details)

Uploaded Python 3

File details

Details for the file easyted-0.0.3.tar.gz.

File metadata

  • Download URL: easyted-0.0.3.tar.gz
  • Upload date:
  • Size: 6.0 kB
  • Tags: Source
  • Uploaded using Trusted Publishing? No
  • Uploaded via: twine/5.0.0 CPython/3.10.13

File hashes

Hashes for easyted-0.0.3.tar.gz
Algorithm Hash digest
SHA256 6de76c0a313b3de8a8615740a379a45c4ba0a23ee57747ab36d575305d54d856
MD5 92cd687535eb0588d94c41cd9e9018d6
BLAKE2b-256 7a1a9986ba7cb361f89b233cfe6940fa23e61cf61f6330e94f9b88cbe0785572

See more details on using hashes here.

File details

Details for the file easyted-0.0.3-py3-none-any.whl.

File metadata

  • Download URL: easyted-0.0.3-py3-none-any.whl
  • Upload date:
  • Size: 6.4 kB
  • Tags: Python 3
  • Uploaded using Trusted Publishing? No
  • Uploaded via: twine/5.0.0 CPython/3.10.13

File hashes

Hashes for easyted-0.0.3-py3-none-any.whl
Algorithm Hash digest
SHA256 74d2ad758779d4ebe3b4f453766f39cf97b9bc63df10da76ba68a3e8e29bc9f7
MD5 b37e1ce01f1f7cd4ce53476f3c40050a
BLAKE2b-256 ea032231589a79f0e30ff002f58b0ae1b1a328078d53130367374a251109765b

See more details on using hashes here.

Supported by

AWS Cloud computing and Security Sponsor Datadog Monitoring Depot Continuous Integration Fastly CDN Google Download Analytics Pingdom Monitoring Sentry Error logging StatusPage Status page