nlptoolkit-wordnet-cy

Turkish WordNet KeNet
Project description

Turkish WordNet KeNet ============
    # WordNet
    
    Wordnet, in its broader definition, is a highly comprehensive dictionary that is built on distinct word senses along with their definitions. Most of the words in a wordnet are open-class words such as nouns, verbs, adjectives and adverbs. Main building blocks of a wordnet are synsets, which are comprised of synonym synset members. Synsets are the distinct units in wordnets and all the mappings including intra and interlingual ones are constructed based on the synsets. In lexical semantics, it is argued that words can be defined based on the relations between them. Adopting this principle, wordnets map semantic relations such as hypernymy, meronymy or antonymy through synsets.
    
    Constructing a wordnet, whether from scratch or by expanding a previous one, is a labor intensive process that requires several steps and extensive use of both human labor and automated systems. Since the creation of the first wordnet Princeton WordNet (PWN) in 1995 (Miller, 1995), many other wordnets have been created for several languages (e.g., Finnish WordNet FinnWordNet (Linden and Carlson, 2010), Polish WordNet (Derwojedowa et al., 2008), Norwegian WordNet (Fjeld and Nygaard, 2009), Danish WordNet (Pedersen et al., 2009), French WordNet WOLF (Sagot, 2008)). In addition, multilingual wordnets linking the wordnets of multiple languages have been created. To exemplify, EuroWordNet (EWN) is a multilingual WordNet project that consists several European languages (English, Dutch, Italian, Spanish, German, French, Czech and Estonian) (Vossen, 2007). In EWN, the wordnets were created for each language separately and then linked through an Inter-Lingual-Index based on PWN. BalkaNet, similar to EWN, is a multilingual wordnet project consisting of six Balkan languages (Bulgarian, Czech, Greek, Romanian, Serbian, and Turkish) (Tufis et al., 2004). This project was done to produce a multilingual semantic network, fully compatible with EWN and its extensions.
    
    # Turkish WordNet
    
    The very first step in constructing KeNet, as in every other wordnet, was to create synsets. Synset can be defined as a group of words sharing the same sense and part of speech (POS). Regarding the construction of these synsets, the first version of the database was constructed through mining of the latest Contemporary Dic- tionary of Turkish (CDT) (2011’s print) published by the Turkish Language Institute (TLI) (Ehsani et al., 2018). By convention, CDT marks synonyms by using commas such that synonyms of a word are given after its definition with a separation of comma. To decide on true synonyms that must occur in the same synsets, we sliced the definitions at commas and listed the comma-separated lemmas and the rest of the definitions as candidates of synonyms. Then, those lists were displayed for linguistically-informed human annotators who decided on the synonymy relation between the lem- mas and the definitions. 49,774 pairs were annotated at the end of this phase. Although some of them were included as separate entries in CDT, passivized and causativized forms of verbs were deleted from KeNet as they share the same root with their active forms.
    
    Although the vast majority of the synsets were constructed during this process, there was a need for follow-up procedures to improve the organization of the current synsets. Since the main problem encountered in synset construction was the semantic relatedness of the synset members, two other procedures were followed in order to control the synonymy relations within the synsets: the merge process and the split process.
    
    ## Merge Process
    
    In the merge process, different synsets that should be grouped together were identified and grouped as a single synset. Three things were crucial while merging the synsets: (i) having a single and unique definition for each synset, (ii) having true synonyms as synset members in each synset and (iii) having a representative first synset member in each synset. Firstly, the synsets that were created by combining the synset members with identical senses had as many definitions as the number of synset members in them since the definitions were also merged while merging the synset members. The definitions of the merged synsets were initially combined with a pipe symbol in between them. A new definition for each merged synset was written so that each synset had a single and unique definition that covers the meaning of all its synset members. None of the synset members of a synset appeared in its definition. In this process, new definitions for 10,612 number of synsets were written by the human annotators. Secondly, some synsets were found to include unrelated synset members. Therefore, another goal of the merge process was to include only the synset members that were synonyms. 1,144 number of synsets with unrelated synset members that had been identified in other parts of the work were transferred to the split process.
    
    ## Split Process
    
    In the split process, the synsets that included synset members with different senses were split and separate synsets were created for each group of related synset members. In order to fix this problem, we created a pool where we collected all the synsets that had unrelated synset members. We displayed these synsets on Google Sheets. Linguistically-informed human annotators then split these wrongly-merged synsets and wrote new definitions for the newly-created ones.
    
    Currently, there are 77,330 synsets, 109,049 synset members and 80,956 distinct synset members in KeNet. The POS categories that are included are nouns, adverbs, adjectives, adverbs, interjections, pronouns, postpositions and conjunctions.
    
    |Part of Speech|# of Synsets|
    |---|---|
    |Nouns|44,074|
    |Verbs|17,791|
    |Adjectives|12,416|
    |Adverbs|2,550|
    |Interjections|342|
    |Pronouns|68|
    |Conjunctions|60|
    |Postpositions|29|
    |Total|77,330|
    
    ## Data Format
    
    The structure of a sample synset is as follows:
    
    	<SYNSET>
    		<ID>TUR10-0038510</ID>
    		<LITERAL>anne<SENSE>2</SENSE>
    		</LITERAL>
    		<POS>n</POS>
    		<DEF>...</DEF>
    		<EXAMPLE>...</EXAMPLE>
    	</SYNSET>
    
    Each entry in the dictionary is enclosed by \<SYNSET> and \</SYNSET> tags. Synset members are represented as literals and their sense numbers. \<ID> shows the unique identifier given to the synset. \<POS> and \<DEF> tags denote part of speech and definition, respectively. As for the \<EXAMPLE> tag, it gives a sample sentence for the synset.
    
    Simple Web Interface
    ============
    [Turkish WordNet Link 1](http://104.247.163.162/nlptoolkit/turkish-wordnet.html) [Turkish WordNet Link 2](https://starlangsoftware.github.io/nlptoolkit-web-simple/turkish-wordnet.html)
    
    [Turkish WordNet Tree Link 1](http://104.247.163.162/nlptoolkit/turkish-wordnet-tree.html) [Turkish WordNet Tree Link 2](https://starlangsoftware.github.io/nlptoolkit-web-simple/turkish-wordnet-tree.html)
    
    [English WordNet Link 1](http://104.247.163.162/nlptoolkit/english-wordnet.html) [English WordNet Link 2](https://starlangsoftware.github.io/nlptoolkit-web-simple/english-wordnet.html)
    
    [English WordNet Tree Link 1](http://104.247.163.162/nlptoolkit/english-wordnet-tree.html) [English WordNet Tree Link 2](https://starlangsoftware.github.io/nlptoolkit-web-simple/english-wordnet-tree.html)
    
    Video Lectures
    ============
    
    [<img src="https://github.com/StarlangSoftware/TurkishWordNet/blob/master/video1.jpg" width="50%">](https://youtu.be/RLVTegHva_k)[<img src="https://github.com/StarlangSoftware/TurkishWordNet/blob/master/video2.jpg" width="50%">](https://youtu.be/DFc_XEqJshU)[<img src="https://github.com/StarlangSoftware/TurkishWordNet/blob/master/video3.jpg" width="50%">](https://youtu.be/KyA32rOv308)
    	
    For Developers
    ============
    
    You can also see [Python](https://github.com/starlangsoftware/TurkishWordNet-Py), [Java](https://github.com/starlangsoftware/TurkishWordNet), [C++](https://github.com/starlangsoftware/TurkishWordNet-CPP), [C](https://github.com/starlangsoftware/TurkishWordNet-C), [Swift](https://github.com/starlangsoftware/TurkishWordNet-Swift), [Js](https://github.com/starlangsoftware/TurkishWordNet-Js), [Php](https://github.com/starlangsoftware/TurkishWordNet-Php), or [C#](https://github.com/starlangsoftware/TurkishWordNet-CS) repository.
    
    ## Requirements
    
    * [Python 3.7 or higher](#python)
    * [Git](#git)
    
    ### Python 
    
    To check if you have a compatible version of Python installed, use the following command:
    
        python -V
        
    You can find the latest version of Python [here](https://www.python.org/downloads/).
    
    ### Git
    
    Install the [latest version of Git](https://git-scm.com/book/en/v2/Getting-Started-Installing-Git).
    
    ## Pip Install
    
    	pip3 install NlpToolkit-WordNet-Cy
    
    ## Download Code
    
    In order to work on code, create a fork from GitHub page. 
    Use Git for cloning the code to your local or below line for Ubuntu:
    
    	git clone <your-fork-git-link>
    
    A directory called DataStructure will be created. Or you can use below link for exploring the code:
    
    	git clone https://github.com/starlangsoftware/TurkishWordNet-Cy.git
    
    ## Open project with Pycharm IDE
    
    Steps for opening the cloned project:
    
    * Start IDE
    * Select **File | Open** from main menu
    * Choose `TurkishWordNet-Cy` file
    * Select open as project option
    * Couple of seconds, dependencies will be downloaded. 
    
    Detailed Description
    ============
    
    + [WordNet](#wordnet)
    + [SynSet](#synset)
    + [Synonym](#synonym)
    
    ## WordNet
    
    To load the WordNet KeNet,
    
    	a = WordNet()
    
    To load a particular WordNet,
    
    	domain = WordNet("domain_wordnet.xml");
    
    To bring all the synsets,
    
    	synSetList(self) -> list
    
    To bring a particular synset,
    
    	getSynSetWithId(self, synSetId: str) -> SynSet
    
    And, to bring all the meanings (Synsets) of a particular word, the following is used.
    
    	getSynSetsWithLiteral(self, literal: str) -> list
    
    ## SynSet
    
    Synonym is procured in order to find the synonymous literals of a synset.
    
    	getSynonym(self) -> Synonym
    	
    In order to obtain the Relations inside a synset as index based, the following method is used.
    
    	getRelation(self, index: int) -> Relation
    
    For instance, all the relations in a synset,
    
    
    	for i in range(synset.relationSize()):
    		relation = synset.getRelation(i);
    		...
    
    ## Synonym
    
    The literals inside the Synonym are found as index based with the following method.
    
    	getLiteral(self, index: int) -> Literal
    
    For example, all the literals inside a synonym can be found with the following:
    
    	for i in range(synonym.literalSize()):
    		literal = synonym.getLiteral(i);
    		...
    
    # Cite
    
    	@inproceedings{bakay21,
     	title={{T}urkish {W}ord{N}et {K}e{N}et},
     	year={2021},
     	author={O. Bakay and O. Ergelen and E. Sarmis and S. Yildirim and A. Kocabalcioglu and B. N. Arican and M. Ozcelik and E. Saniyar and O. Kuyrukcu and B. 	Avar and O. T. Y{\i}ld{\i}z},
     	booktitle={Proceedings of GWC 2021}
     	}
    
    For Contibutors
    ============
    
    ### Setup.py file
    1. Do not forget to set package list. All subfolders should be added to the package list.
    ```
        packages=['Classification', 'Classification.Model', 'Classification.Model.DecisionTree',
                  'Classification.Model.Ensemble', 'Classification.Model.NeuralNetwork',
                  'Classification.Model.NonParametric', 'Classification.Model.Parametric',
                  'Classification.Filter', 'Classification.DataSet', 'Classification.Instance', 'Classification.Attribute',
                  'Classification.Parameter', 'Classification.Experiment',
                  'Classification.Performance', 'Classification.InstanceList', 'Classification.DistanceMetric',
                  'Classification.StatisticalTest', 'Classification.FeatureSelection'],
    ```
    2. Package name should be lowercase and only may include _ character.
    ```
        name='nlptoolkit_math',
    ```
    3. Package data should be defined and must ibclude pyx, pxd, c and py files.
    ```
        package_data={'NGram': ['*.pxd', '*.pyx', '*.c', '*.py']},
    ```
    4. Setup should include ext_modules with compiler directives.
    ```
        ext_modules=cythonize(["NGram/*.pyx"],
                              compiler_directives={'language_level': "3"}),
    ```
    
    ### Cython files
    1. Define the class variables and class methods in the pxd file.
    ```
    cdef class DiscreteDistribution(dict):
    
        cdef float __sum
    
        cpdef addItem(self, str item)
        cpdef removeItem(self, str item)
        cpdef addDistribution(self, DiscreteDistribution distribution)
    ```
    2. For default values in class method declarations, use *.
    ```
        cpdef list constructIdiomLiterals(self, FsmMorphologicalAnalyzer fsm, MorphologicalParse morphologicalParse1,
                                   MetamorphicParse metaParse1, MorphologicalParse morphologicalParse2,
                                   MetamorphicParse metaParse2, MorphologicalParse morphologicalParse3 = *,
                                   MetamorphicParse metaParse3 = *)
    ```
    3. Define the class name as cdef, class methods as cpdef, and \_\_init\_\_ as def.
    ```
    cdef class DiscreteDistribution(dict):
    
        def __init__(self, **kwargs):
            """
            A constructor of DiscreteDistribution class which calls its super class.
            """
            super().__init__(**kwargs)
            self.__sum = 0.0
    
        cpdef addItem(self, str item):
    ```
    4. Do not forget to comment each function.
    ```
        cpdef addItem(self, str item):
            """
            The addItem method takes a String item as an input and if this map contains a mapping for the item it puts the
            item with given value + 1, else it puts item with value of 1.
    
            PARAMETERS
            ----------
            item : string
                String input.
            """
    ```
    5. Function names should follow caml case.
    ```
        cpdef addItem(self, str item):
    ```
    6. Local variables should follow snake case.
    ```
    	det = 1.0
    	copy_of_matrix = copy.deepcopy(self)
    ```
    7. Variable types should be defined for function parameters, class variables.
    ```
        cpdef double getValue(self, int rowNo, int colNo):
    ```
    8. Local variables should be defined with types.
    ```
        cpdef sortDefinitions(self):
            cdef int i, j
            cdef str tmp
    ```
    9. For abstract methods, use ABC package and declare them with @abstractmethod.
    ```
        @abstractmethod
        def train(self, train_set: list[Tensor]):
            pass
    ```
    10. For private methods, use __ as prefix in their names.
    ```
        cpdef list __linearRegressionOnCountsOfCounts(self, list countsOfCounts)
    ```
    11. For private class variables, use __ as prefix in their names.
    ```
    cdef class NGram:
        cdef int __N
        cdef double __lambda1, __lambda2
        cdef bint __interpolated
        cdef set __vocabulary
        cdef list __probability_of_unseen
    ```
    12. Write \_\_repr\_\_ class methods as toString methods
    13. Write getter and setter class methods.
    ```
        cpdef int getN(self)
        cpdef setN(self, int N)
    ```
    14. If there are multiple constructors for a class, define them as constructor1, constructor2, ..., then from the original constructor call these methods.
    ```
    cdef class NGram:
    
        cpdef constructor1(self, int N, list corpus):
        cpdef constructor2(self, str fileName):
        def __init__(self,
                     NorFileName,
                     corpus=None):
            if isinstance(NorFileName, int):
                self.constructor1(NorFileName, corpus)
            else:
                self.constructor2(NorFileName)
    ```
    15. Extend test classes from unittest and use separate unit test methods.
    ```
    class NGramTest(unittest.TestCase):
    
        def test_GetCountSimple(self):
    ```
    16. For undefined types use object as type in the type declarations.
    ```
    cdef class WordNet:
    
        cdef object __syn_set_list
        cdef object __literal_list
    ```
    17. For boolean types use bint as type in the type declarations.
    ```
    	cdef bint is_done
    ```
    18. Enumerated types should be used when necessary as enum classes, and should be declared in py files.
    ```
    class AttributeType(Enum):
        """
        Continuous Attribute
        """
        CONTINUOUS = auto()
        """
    ```
    19. Resource files should be taken from pkg_recources package.
    ```
    	fileName = pkg_resources.resource_filename(__name__, 'data/turkish_wordnet.xml')
    ```
Project details

Release history Release notifications | RSS feed

This version
1.0.16
May 1, 2026
1.0.15
Apr 28, 2026
1.0.14
Apr 2, 2025
1.0.13
Feb 20, 2023
1.0.12
Feb 20, 2023
1.0.11
Sep 28, 2022
1.0.10
Jun 8, 2022
1.0.9
Jun 8, 2022
1.0.8
May 22, 2022
1.0.7
Jan 14, 2022
1.0.6
Nov 26, 2021
1.0.5
Oct 30, 2021
1.0.4
Oct 30, 2021
1.0.3
Feb 21, 2021
1.0.2
Nov 13, 2020
1.0.1
Oct 24, 2020
1.0.0
Oct 8, 2020
Download files

Download the file for your platform. If you're not sure which to choose, learn more about installing packages.
Source Distribution

nlptoolkit_wordnet_cy-1.0.16.tar.gz (13.2 MB view details)
Uploaded May 1, 2026 Source
File details

Details for the file nlptoolkit_wordnet_cy-1.0.16.tar.gz.
File metadata

Download URL: nlptoolkit_wordnet_cy-1.0.16.tar.gz
Upload date: May 1, 2026
Size: 13.2 MB
Tags: Source
Uploaded using Trusted Publishing? No
Uploaded via: twine/6.1.0 CPython/3.9.0
File hashes

Hashes for nlptoolkit_wordnet_cy-1.0.16.tar.gz
Algorithm	Hash digest
SHA256	`1c335c91ebe4eab88266e21077b7f7ba4c4a0d0244b7207e806e81661927496a`
MD5	`f3bdce1b490e8035cb70358107271fc5`
BLAKE2b-256	`be743380f684679082924bee9a2d670b384752effc4ecc0bb3dbd8191d53b58d`
See more details on using hashes here.
nlptoolkit-wordnet-cy 1.0.16

Navigation

Verified details

Maintainers

Unverified details

Project links

Meta