Skip to main content

A simple iterator that reads conll files.

Project description

CONTENTS OF THIS FILE

  • Introduction
  • Setup
  • Getting started
  • Examples

INTRODUCTION

A simple iterator that reads conll and conllu files (https://universaldependencies.org/format.html) without keeping them in memory. It can iterate over words, sentences, or documents.

SETUP

pip install git+https://github.com/nicolaCirillo/conll_iterator.git

GETTING STARTED

from conll_iterator import ConllIterator
from tqdm import tqdm


sentences = ConllIterator('sample/sample_corpus.conllu', 
                         fields=['form', 'upos'], 
                         mode='sentences', 
                         join_char='/')

for s in tqdm(sentences):
    # do something

EXAMPLES

Training word2vec

from gensim.models import Word2Vec
from tqdm import tqdm
from conll_iterator import ConllIterator

sentences = ConllIterator('sample/sample_corpus.conllu', fields=['lemma', 'upos'], mode='sentences', join_char='/')
w2v_parameters = {'vector_size': 25, 'window': 5, 'min_count': 1, 'sg': 1, 'epochs': 15}
model = Word2Vec(tqdm(sentences), workers=5, **w2v_parameters)
model.save('sample_w2v')
word_vectors = model.wv
similar = list(zip(*word_vectors.most_similar('Pecorino/PROPN')[:10]))[0]
print("Most similar words to Pecorino/PROPN:")
print(similar)

Keyword extraction via tf-idf

from itertools import chain
from collections import Counter

docs = ConllIterator('sample/sample_corpus.conllu', fields=['lemma', 'upos'], lower=['lemma'], mode='documents')
doc_tf = list()
df = Counter()
allowed_pos = ['NOUN', 'PROPN','VERB', 'ADJ']
for d in docs:
    tokens = list(chain(*d))
    tokens = [t[0] for t in tokens if t[1] in allowed_pos]
    tf = Counter(tokens)
    df.update(set(tokens))
    doc_tf.append(tf)

doc_keywords = list()
for d in doc_tf:
    doc_tfidf = [(w, d[w]/df[w]) for w in d]
    doc_tfidf = sorted(doc_tfidf, key=lambda x:x[1], reverse=True)
    doc_keywords.append(list(zip(*doc_tfidf[:10]))[0])

for i, k in enumerate(doc_keywords[:20]):
    print('keywords of doc {}:'.format(i+1), '; '.join(k))

Project details


Download files

Download the file for your platform. If you're not sure which to choose, learn more about installing packages.

Source Distribution

conll_iterator-0.0.1.tar.gz (6.2 kB view details)

Uploaded Source

Built Distribution

conll_iterator-0.0.1-py3-none-any.whl (6.7 kB view details)

Uploaded Python 3

File details

Details for the file conll_iterator-0.0.1.tar.gz.

File metadata

  • Download URL: conll_iterator-0.0.1.tar.gz
  • Upload date:
  • Size: 6.2 kB
  • Tags: Source
  • Uploaded using Trusted Publishing? No
  • Uploaded via: twine/4.0.2 CPython/3.11.2

File hashes

Hashes for conll_iterator-0.0.1.tar.gz
Algorithm Hash digest
SHA256 b769fcc9ce0188784cdd1279da9095287be8d14adb3781d4f880c2c2520030ae
MD5 6359c9e9de8ecc678d8b9b4e3eb5f721
BLAKE2b-256 62a704f51ac27102e6a7b1a769d2f69fe21c8d3971ab641f0c6a2245228ef452

See more details on using hashes here.

File details

Details for the file conll_iterator-0.0.1-py3-none-any.whl.

File metadata

File hashes

Hashes for conll_iterator-0.0.1-py3-none-any.whl
Algorithm Hash digest
SHA256 44a9697c87c05f56b946b0f6f086e6332e45e3d99cce77caa7a5019deef7a7e6
MD5 eceeeb93cb5ca661a051a0f665f6c064
BLAKE2b-256 db3895fe0c594ba17be517e51d228340224b1d957f73347dcc6453442d0cc962

See more details on using hashes here.

Supported by

AWS AWS Cloud computing and Security Sponsor Datadog Datadog Monitoring Fastly Fastly CDN Google Google Download Analytics Microsoft Microsoft PSF Sponsor Pingdom Pingdom Monitoring Sentry Sentry Error logging StatusPage StatusPage Status page