Skip to main content

No project description provided

Project description

MultiDirectoryCorpusReader

MultiDirectoryCorpusReader provides an easy iterator for multi directory source globbing of raw text files which can be used either streaming or in memory.

Installation

It can be installed directly from github using:

#> python -m pip install git+https://github.com/blackplague/multidirectorycorpusreader.git

Usage example

The minimum viable usage is to supply a list of source directories and a list of globbing filters.

mdcr = MultiDirectoryCorpusReader(
    source_directories=['data/source1', 'data/source2'],
    glob_filters=['*.txt', '*.msg', '*.doc', '*.text'])

It is possible to pass a preprocess function to the script, this could for example be the simple_preprocess function from the Gensim library. This will also print the progress during the streaming of the files.

from gensim.utils import simple_preprocess

mdcr = MultiDirectoryCorpusReader(
    source_directories=['data/source1', 'data/source2'],
    glob_filters=['*.txt', '*.msg', '*.doc', '*.text'],
    preprocess_function=simple_preprocess,
    print_progress=True)

This example shows how to supply a preprocess function that you have written yourself. In addition this will also read all files into memory and print progress during.

def preprocessor_tokenize_remove_a(s: str) -> List[str]:
    return s.replace('a', '').split(' ')

mdcr = MultiDirectoryCorpusReader(
    source_directories=['data/source1', 'data/source2'],
    glob_filters=['*.txt', '*.msg', '*.doc', '*.text'],
    preprocess_function=preprocessor_tokenize_remove_a,
    in_memory=True,
    print_progress=True)

Release History

  • 0.2.1
    • Makes the MultiDirectoryCorpusReader available through from multidirectorycorpusreader import MultiDirectoryCorpusReader
  • 0.2.0
    • The first proper release

Meta

Michael Andersen - michael10andersen+mdcr -[at]- gmail.com - Github

Distributed under the LGPL3 license. See LICENSE for more information.

Project details


Download files

Download the file for your platform. If you're not sure which to choose, learn more about installing packages.

Source Distribution

multidirectorycorpusreader-0.2.1.tar.gz (6.3 kB view hashes)

Uploaded Source

Built Distribution

Supported by

AWS AWS Cloud computing and Security Sponsor Datadog Datadog Monitoring Fastly Fastly CDN Google Google Download Analytics Microsoft Microsoft PSF Sponsor Pingdom Pingdom Monitoring Sentry Sentry Error logging StatusPage StatusPage Status page