This is a pre-production deployment of Warehouse, however changes made here WILL affect the production instance of PyPI.
Latest Version Dependencies status unknown Test status unknown Test coverage unknown
Project Description

Morphological/Inflection/Lemmatization Engine for Croatian language

“text-hr” is Morphological/Inflectional/Lemmatization Engine for Croatian language written in Python programming language. Includes stopwords and Part-Of-Speech tagging engine (POS tagging) based on inverse inflection algorithm for detection.

Since API is not freezed, this project is still in alpha.

TAGS

Croatian language, lemmatization, stemming, inflection, python, natural language processing (NLP), Part-of-speech (POS) tagging, stopwords, inverse inflection, morphological lexicon

OZNAKE

Hrvatski jezik, lematizacija, Python biblioteka, morfologija, infleksija, obrnuta infleksija, prepoznavanje vrsta riječi, računalna obrada govornog jezika, zaustavne riječi, morfološki leksikon

AUTHOR

Robert Lujo, Zagreb, Croatia, find mail address in LICENCE

FEATURES

To name the most important:
  • inflection system - for producing all forms of one word
  • detection of word types (POS tagging) - from existing list of word forms
  • list of stopwords

System is based on unicode strings, default codepage to convert from and to string is cp-1250.

Check Getting started.

INSTALLATION

Installation instructions - if you have installed pip package http://pypi.python.org/pypi/pip:

pip install text-hr
If not, then do it old-fashioned way:

GETTING STARTED

There are three important parts that this project provides:

Inflection system

Usage example - start python shell:

>>> from text_hr import Verb
>>> v = Verb("platiti")
>>> for k in sorted(v.forms.keys()):
...     print k, v.forms[k]
...
AOR/P/1 [u'platismo']
AOR/P/2 [u'platiste']
AOR/P/3 [u'plati\u0161e']
AOR/S/1 [u'platih']
AOR/S/2 [u'plati']
AOR/S/3 [u'plati']
IMP/P/1 [u'platasmo', u'pla\u0107asmo', u'platijasmo']
IMP/P/2 [u'plataste', u'pla\u0107aste', u'platijaste']
IMP/P/3 [u'platahu', u'pla\u0107ahu', u'platijahu']
...
VA_PA//P_O+S+V+N [u'pla\u0107eno']
X_INF// [u'platiti']
X_VAD_PAS// [u'plativ\u0161i']
X_VAD_PRE// [u'plate\u0107i']
X_VAD_PRE// [u'plate\u0107i']

Detection of word types (POS tagging)

TODO: to be done - check test_detect.txt for samples, and detect.py for the logic:

First example in test_detect.txt:

>>> from text_hr.detect import WordTypeRecognizerExample
>>> def test_it(word_list, wt_filter=None, level=2):
...     wdh = WordTypeRecognizerExample(word_list, silent=True)
...     if not wt_filter is None:
...         wdh.detect(wt_filter=wt_filter, level=level)  # e.g. wt_filter=["N"]
...     else:
...         wdh.detect(level=level)  # all word types
...     lines_file = LinesFile()
...     wdh.dump_result(lines_file) # doctest: +NORMALIZE_WHITESPACE +ELLIPSIS
...     print "\n".join(lines_file.lines)
...     return wdh

>>> class LinesFile(object):
...     def __init__(self):
...         self.lines = []
...     def write(self, s):
...         self.lines.append(repr(s.rstrip()))

>>> word_list = [
...   "Broj    84"
... , "broji   34"
... , "Brojila  28"
... , "broje   23"
... , "brojeći 22"
... , "brojim   7"
... , "brojimo  5"
... , "brojiš   4"
... , "brojahu  2"
... , "brojaše  1"
... , "brojite  1"
... , "-brijestovu 1"
... , "brijestovi 1"   #the only one checked with endswith, but all other will be checked with get_freq
... , "-brijestove 1"
... , "-brijestova 1"
... ]

Lowest quality, but fastest
>>> wdh = test_it(word_list, level=4) # doctest: +ELLIPSIS
" 10/  183 -> brojati              (u'V-XX_-_JATI-je\\u0107i-0') 84/broj,34/broji,23/broje,22/broje\xe6i,7/brojim,5/brojimo,4/broji\x9a,2/brojahu,1/brojite,1/broja\x9ae"

List of stopwords

Is located in std_words.txt, and you can read it directly from here

http://bitbucket.org/trebor74hr/text-hr/src/tip/text_hr/std_words.txt

The list can be updated like this:

>>> import text_hr
>>> text_hr.dump_all_std_words()
Totaly 2904 word forms dumped to r:\hg-clones\python\text-hr\text_hr\std_words.txt in codepage utf8

Iteration over all words goes like this:

from text_hr import get_all_std_words

for word_base, l_key, cnt, _suff_id, wform_key, wform in get_all_std_words():
    print word_base, l_key, cnt, _suff_id, wform_key, wform

Further

Since there is currently no good documentation, the best source of further information is by reading tests inside of modules and tests in tests directory (dev version). More information in Running tests. You can allways read a source.

DOCUMENTATION

Currently there is no documentation. In progress …

SUPPORT

Since this project is limited by my free time, support is limited.

REPORT BUG OR REQUEST FEATURE

If you encounter bug, the best is to report it to the bitbucket web page http://bitbucket.org/trebor74hr/text-hr.

If there will be an interest for development for other inflection rich languages, I’d be glad to decouple language specific code and create new project that will be capable to deal with multiple languages.

The best way to contact me is by mail (find in LICENCE).

TODO list is in readme.txt (dev version).

CONTRIBUTION

Since this project is not currently in the stable API phase, contribution should wait for a while.

RUNNING TESTS

All tests are doctests (not unittests). There are three type of tests in the package:

  1. doctests in each module - e.g. in verbs.py
  2. doctests in tests/test_*.txt - only development version
  3. tests which are not automatically compared - i.e. in special call mode detect.py can produce output file which needs to be compared manually with some existing file. Such test(s) are very slow. This needs to be changed to be automatic.

Running each module directly will run 1. and 2. if running from development version. To get development version To use development version (http://bitbucket.org/trebor74hr/text-hr):

hg clone https://bitbucket.org/trebor74hr/text-hr

create text_hr.pth in python site-packages directory with path to text-hr e.g.:

r:\hg-clones\python\text-hr
To run all tests:
  • go to tests directory

  • run tests.py like (with sample output):

    > python tests.py
    testing module   __init__
    testing module   adjectives
    ...
    testing textfile R:\hg-clones\python\text-hr\tests\test_adj.txt
    ...
    testing textfile R:\hg-clones\python\text-hr\tests\test_verbs_type.txt
    
To run tests for just one module:
  • goto text_hr directory

  • run tests by running module, e.g.:

    > py pronouns.py
    __main__: running doctests
    ..\tests\test_pronouns.txt: running doctests
    
  • in the case you’re not running from dev version, you’ll get output like this:

    > py pronouns.py
    __main__: running doctests
    ..\tests\test_pronouns.txt: Not found, skipping
    

ADDITIONAL

Master thesis pdf in Croatian (134 pages) with title:

Lociranje sličnih logičkih cjelina u tekstualnim
dokumentima na hrvatskome jeziku

can be found at:

http://bitbucket.org/trebor74hr/text-hr/downloads/magistarski-konacni.pdf

TODO

various things, see readme.txt for details.

CHANGES

0.18

ulr1 121210
  • fixed wrong readme on bitbucket homepage

0.17

ulr1 100617
  • utf-8 setup

0.16

ulr1 100617
  • master thesis pdf added to repository (in Croatian, 134 pages)

0.15

ulr1 100617
  • minor changes

0.14

ulr1 100617
  • beta release
  • tags: lemmatization, stemming

0.13

ulr1 100610:
  • text_hr package reorganized (__init__.py with __all__ and imports …)
  • word_types.py removed
  • std_words.txt

0.12

ulr1 100608 :
  • README
  • enabled tests from tests.py for all
  • enabled tests from directly from each modules

0.11

ulr1 100607:
  • recreated repo at bitbucket
  • no .suff_registry.pickle and testing_*.out put in zip

0.10

ulr1 100605:
  • first installable release
Release History

Release History

0.18

This version

History Node

TODO: Figure out how to actually get changelog content.

Changelog content for this version goes here.

Donec et mollis dolor. Praesent et diam eget libero egestas mattis sit amet vitae augue. Nam tincidunt congue enim, ut porta lorem lacinia consectetur. Donec ut libero sed arcu vehicula ultricies a non tortor. Lorem ipsum dolor sit amet, consectetur adipiscing elit.

Show More

0.17

History Node

TODO: Figure out how to actually get changelog content.

Changelog content for this version goes here.

Donec et mollis dolor. Praesent et diam eget libero egestas mattis sit amet vitae augue. Nam tincidunt congue enim, ut porta lorem lacinia consectetur. Donec ut libero sed arcu vehicula ultricies a non tortor. Lorem ipsum dolor sit amet, consectetur adipiscing elit.

Show More

0.16

History Node

TODO: Figure out how to actually get changelog content.

Changelog content for this version goes here.

Donec et mollis dolor. Praesent et diam eget libero egestas mattis sit amet vitae augue. Nam tincidunt congue enim, ut porta lorem lacinia consectetur. Donec ut libero sed arcu vehicula ultricies a non tortor. Lorem ipsum dolor sit amet, consectetur adipiscing elit.

Show More

0.14

History Node

TODO: Figure out how to actually get changelog content.

Changelog content for this version goes here.

Donec et mollis dolor. Praesent et diam eget libero egestas mattis sit amet vitae augue. Nam tincidunt congue enim, ut porta lorem lacinia consectetur. Donec ut libero sed arcu vehicula ultricies a non tortor. Lorem ipsum dolor sit amet, consectetur adipiscing elit.

Show More

0.13

History Node

TODO: Figure out how to actually get changelog content.

Changelog content for this version goes here.

Donec et mollis dolor. Praesent et diam eget libero egestas mattis sit amet vitae augue. Nam tincidunt congue enim, ut porta lorem lacinia consectetur. Donec ut libero sed arcu vehicula ultricies a non tortor. Lorem ipsum dolor sit amet, consectetur adipiscing elit.

Show More

0.12

History Node

TODO: Figure out how to actually get changelog content.

Changelog content for this version goes here.

Donec et mollis dolor. Praesent et diam eget libero egestas mattis sit amet vitae augue. Nam tincidunt congue enim, ut porta lorem lacinia consectetur. Donec ut libero sed arcu vehicula ultricies a non tortor. Lorem ipsum dolor sit amet, consectetur adipiscing elit.

Show More

0.11

History Node

TODO: Figure out how to actually get changelog content.

Changelog content for this version goes here.

Donec et mollis dolor. Praesent et diam eget libero egestas mattis sit amet vitae augue. Nam tincidunt congue enim, ut porta lorem lacinia consectetur. Donec ut libero sed arcu vehicula ultricies a non tortor. Lorem ipsum dolor sit amet, consectetur adipiscing elit.

Show More

0.10

History Node

TODO: Figure out how to actually get changelog content.

Changelog content for this version goes here.

Donec et mollis dolor. Praesent et diam eget libero egestas mattis sit amet vitae augue. Nam tincidunt congue enim, ut porta lorem lacinia consectetur. Donec ut libero sed arcu vehicula ultricies a non tortor. Lorem ipsum dolor sit amet, consectetur adipiscing elit.

Show More

Download Files

Download Files

TODO: Brief introduction on what you do with files - including link to relevant help section.

File Name & Checksum SHA256 Checksum Help Version File Type Upload Date
text-hr-0.18.tar.gz (110.1 kB) Copy SHA256 Checksum SHA256 Source Dec 10, 2012

Supported By

WebFaction WebFaction Technical Writing Elastic Elastic Search Pingdom Pingdom Monitoring Dyn Dyn DNS HPE HPE Development Sentry Sentry Error Logging CloudAMQP CloudAMQP RabbitMQ Heroku Heroku PaaS Kabu Creative Kabu Creative UX & Design Fastly Fastly CDN DigiCert DigiCert EV Certificate Rackspace Rackspace Cloud Servers DreamHost DreamHost Log Hosting