Polyglot is a natuaral language pipeline that supports massive multilingual applications.
Project description
Polyglot is a natural language pipeline that supports massive multilingual applications.
Free software: GPLv3 license
Documentation: http://polyglot.readthedocs.org.
Features
Tokenization (165 Languages)
Language detection (196 Languages)
Named Entity Recognition (40 Languages)
Sentiment Analysis (136 Languages)
Word Embeddings (137 Languages)
Morphological analysis (135 Languages)
Transliteration (69 Languages)
Developer
Rami Al-Rfou @ rmyeid gmail com
Qiuck Tutorial
import polyglot
from polyglot.text import Text
Language Detection
text = Text("Bonjour, Mesdames.")
print("Language Detected: Code={}, Name={}\n".format(text.language.code, text.language.name))
Language Detected: Code=fr, Name=French
Tokenization
zen = Text("Beautiful is better than ugly. "
"Explicit is better than implicit. "
"Simple is better than complex.")
print(zen.words)
[u'Beautiful', u'is', u'better', u'than', u'ugly', u'.', u'Explicit', u'is', u'better', u'than', u'implicit', u'.', u'Simple', u'is', u'better', u'than', u'complex', u'.']
print(zen.sentences)
[Sentence("Beautiful is better than ugly."), Sentence("Explicit is better than implicit."), Sentence("Simple is better than complex.")]
Part of Speech Tagging
text = Text(u"O primeiro uso de desobediência civil em massa ocorreu em setembro de 1906.")
print("{:<16}{}".format("Word", "POS Tag")+"\n"+"-"*30)
for word, tag in text.pos_tags:
print(u"{:<16}{:>2}".format(word, tag))
Word POS Tag ------------------------------ O PART primeiro SCONJ uso PART de ADP desobediência PART civil SCONJ em ADP massa PART ocorreu SCONJ em ADP setembro PART de ADP 1906 DET . ADV
Named Entity Recognition
text = Text(u"In Großbritannien war Gandhi mit dem westlichen Lebensstil vertraut geworden")
print(text.entities)
[I-LOC([u'Groxdfbritannien']), I-PER([u'Gandhi'])]
Polarity
print("{:<16}{}".format("Word", "Polarity")+"\n"+"-"*30)
for w in zen.words[:6]:
print("{:<16}{:>2}".format(w, w.polarity))
Word Polarity ------------------------------ Beautiful 0 is 0 better 1 than 0 ugly -1 . 0
Embeddings
word = zen.words[0]
print(word.vector)
[-0.08001513 -0.35475096 0.27702546 -0.20423636 0.36313248 0.06376412 0.0444247 -0.30489922 0.014972 0.13951094 0.07515849 -0.2703914 0.04650182 0.58747977 0.5101701 -0.04114699 0.37434807 -0.27707747 -0.06124159 0.21493433 -0.23498166 0.07404013 -0.23953673 -0.15044802 0.21210277 -0.58776855 0.12014424 0.30591646 0.07079886 0.44168213 0.2473582 -0.43409103 -0.25516582 0.45812422 0.33660468 0.61951864 0.16038296 -0.12069689 -0.59378242 -0.47525382 -0.03109539 0.28781402 -0.51556301 -0.26363477 -0.0820123 0.31425434 -0.10971891 0.53333962 0.3446033 -0.62146574 -0.15398794 0.11720303 0.50415224 -0.79616308 -0.25548786 0.36809164 -0.26254281 0.11736908 -0.30717522 -0.18103991 -0.03320931 -0.15692121 -0.22654058 0.56092978]
Morphology
word = Text("Preprocessing is an essential step.").words[0]
print(word.morphemes)
[u'Pre', u'process', u'ing']
Transliteration
from polyglot.transliteration import Transliterator
transliterator = Transliterator(source_lang="en", target_lang="ru")
print(transliterator.transliterate(u"preprocessing"))
препрокессинг
History
“14.11” (2014-01-11)
First release on PyPI.
Project details
Release history Release notifications | RSS feed
Download files
Download the file for your platform. If you're not sure which to choose, learn more about installing packages.
Source Distribution
polyglot-15.04.19.tar.gz
(142.0 kB
view hashes)