Skip to main content

No project description provided

Project description

MultiDirectoryCorpusReader

MultiDirectoryCorpusReader provides an easy iterator for multi directory source globbing of raw text files which can be used either streaming or in memory.

Installation

It can be installed directly from github using:

#> python -m pip install git+https://github.com/blackplague/multidirectorycorpusreader.git

Usage example

The minimum viable usage is to supply a list of source directories and a list of globbing filters.

mdcr = MultiDirectoryCorpusReader(
    source_directories=['data/source1', 'data/source2'],
    glob_filters=['*.txt', '*.msg', '*.doc', '*.text'])

It is possible to pass a preprocess function to the script, this could for example be the simple_preprocess function from the Gensim library. This will also print the progress during the streaming of the files.

from gensim.utils import simple_preprocess

mdcr = MultiDirectoryCorpusReader(
    source_directories=['data/source1', 'data/source2'],
    glob_filters=['*.txt', '*.msg', '*.doc', '*.text'],
    preprocess_function=simple_preprocess,
    print_progress=True)

This example shows how to supply a preprocess function that you have written yourself. In addition this will also read all files into memory and print progress during.

def preprocessor_tokenize_remove_a(s: str) -> List[str]:
    return s.replace('a', '').split(' ')

mdcr = MultiDirectoryCorpusReader(
    source_directories=['data/source1', 'data/source2'],
    glob_filters=['*.txt', '*.msg', '*.doc', '*.text'],
    preprocess_function=preprocessor_tokenize_remove_a,
    in_memory=True,
    print_progress=True)

Release History

  • 0.2.0
    • The first proper release

Meta

Michael Andersen - michael10andersen+mdcr -[at]- gmail.com - Github

Distributed under the LGPL3 license. See LICENSE for more information.

Project details


Download files

Download the file for your platform. If you're not sure which to choose, learn more about installing packages.

Source Distribution

multidirectorycorpusreader-0.2.0.tar.gz (6.2 kB view details)

Uploaded Source

Built Distribution

File details

Details for the file multidirectorycorpusreader-0.2.0.tar.gz.

File metadata

  • Download URL: multidirectorycorpusreader-0.2.0.tar.gz
  • Upload date:
  • Size: 6.2 kB
  • Tags: Source
  • Uploaded using Trusted Publishing? No
  • Uploaded via: poetry/1.1.11 CPython/3.9.1 Linux/5.4.0-47-generic

File hashes

Hashes for multidirectorycorpusreader-0.2.0.tar.gz
Algorithm Hash digest
SHA256 1f379b127fe1190cb725b3613c95f3b4984da0e58bac066c497d86ef615165ac
MD5 ac69a86735fcc3113231c760acb35431
BLAKE2b-256 32a916dec3642b57d0c219e42f9f67328f898031ce0253fce1ba9eac2d9960e8

See more details on using hashes here.

File details

Details for the file multidirectorycorpusreader-0.2.0-py3-none-any.whl.

File metadata

File hashes

Hashes for multidirectorycorpusreader-0.2.0-py3-none-any.whl
Algorithm Hash digest
SHA256 6b6c5baccd865d3f63324eb20dda488cf7ec12b76b714350fff8ce19e6e2485e
MD5 fd4181107cdb8e5e3fb8b9181b0b27a6
BLAKE2b-256 ad86796bc80f12d8467d24d84bb80ffe8de17a9176d6eed51c6ad5f23b4ae2ae

See more details on using hashes here.

Supported by

AWS AWS Cloud computing and Security Sponsor Datadog Datadog Monitoring Fastly Fastly CDN Google Google Download Analytics Microsoft Microsoft PSF Sponsor Pingdom Pingdom Monitoring Sentry Sentry Error logging StatusPage StatusPage Status page