A simple iterator that reads conll files.
Project description
CONTENTS OF THIS FILE
- Introduction
- Setup
- Getting started
- Examples
INTRODUCTION
A simple iterator that reads conll and conllu files (https://universaldependencies.org/format.html) without keeping them in memory. It can iterate over words, sentences, or documents.
SETUP
pip install git+https://github.com/nicolaCirillo/conll_iterator.git
GETTING STARTED
from conll_iterator import ConllIterator
from tqdm import tqdm
sentences = ConllIterator('sample/sample_corpus.conllu',
fields=['form', 'upos'],
mode='sentences',
join_char='/')
for s in tqdm(sentences):
# do something
EXAMPLES
Training word2vec
from gensim.models import Word2Vec
from tqdm import tqdm
from conll_iterator import ConllIterator
sentences = ConllIterator('sample/sample_corpus.conllu', fields=['lemma', 'upos'], mode='sentences', join_char='/')
w2v_parameters = {'vector_size': 25, 'window': 5, 'min_count': 1, 'sg': 1, 'epochs': 15}
model = Word2Vec(tqdm(sentences), workers=5, **w2v_parameters)
model.save('sample_w2v')
word_vectors = model.wv
similar = list(zip(*word_vectors.most_similar('Pecorino/PROPN')[:10]))[0]
print("Most similar words to Pecorino/PROPN:")
print(similar)
Keyword extraction via tf-idf
from itertools import chain
from collections import Counter
docs = ConllIterator('sample/sample_corpus.conllu', fields=['lemma', 'upos'], lower=['lemma'], mode='documents')
doc_tf = list()
df = Counter()
allowed_pos = ['NOUN', 'PROPN','VERB', 'ADJ']
for d in docs:
tokens = list(chain(*d))
tokens = [t[0] for t in tokens if t[1] in allowed_pos]
tf = Counter(tokens)
df.update(set(tokens))
doc_tf.append(tf)
doc_keywords = list()
for d in doc_tf:
doc_tfidf = [(w, d[w]/df[w]) for w in d]
doc_tfidf = sorted(doc_tfidf, key=lambda x:x[1], reverse=True)
doc_keywords.append(list(zip(*doc_tfidf[:10]))[0])
for i, k in enumerate(doc_keywords[:20]):
print('keywords of doc {}:'.format(i+1), '; '.join(k))
Project details
Release history Release notifications | RSS feed
Download files
Download the file for your platform. If you're not sure which to choose, learn more about installing packages.
Source Distribution
conll_iterator-0.0.1.tar.gz
(6.2 kB
view details)
Built Distribution
File details
Details for the file conll_iterator-0.0.1.tar.gz
.
File metadata
- Download URL: conll_iterator-0.0.1.tar.gz
- Upload date:
- Size: 6.2 kB
- Tags: Source
- Uploaded using Trusted Publishing? No
- Uploaded via: twine/4.0.2 CPython/3.11.2
File hashes
Algorithm | Hash digest | |
---|---|---|
SHA256 | b769fcc9ce0188784cdd1279da9095287be8d14adb3781d4f880c2c2520030ae |
|
MD5 | 6359c9e9de8ecc678d8b9b4e3eb5f721 |
|
BLAKE2b-256 | 62a704f51ac27102e6a7b1a769d2f69fe21c8d3971ab641f0c6a2245228ef452 |
File details
Details for the file conll_iterator-0.0.1-py3-none-any.whl
.
File metadata
- Download URL: conll_iterator-0.0.1-py3-none-any.whl
- Upload date:
- Size: 6.7 kB
- Tags: Python 3
- Uploaded using Trusted Publishing? No
- Uploaded via: twine/4.0.2 CPython/3.11.2
File hashes
Algorithm | Hash digest | |
---|---|---|
SHA256 | 44a9697c87c05f56b946b0f6f086e6332e45e3d99cce77caa7a5019deef7a7e6 |
|
MD5 | eceeeb93cb5ca661a051a0f665f6c064 |
|
BLAKE2b-256 | db3895fe0c594ba17be517e51d228340224b1d957f73347dcc6453442d0cc962 |