Skip to main content

No project description provided

Project description

MultiDirectoryCorpusReader

MultiDirectoryCorpusReader provides an easy iterator for multi directory source globbing of raw text files which can be used either streaming or in memory.

Installation

It can be installed directly from github using:

#> python -m pip install git+https://github.com/blackplague/multidirectorycorpusreader.git

Usage example

The minimum viable usage is to supply a list of source directories and a list of globbing filters.

mdcr = MultiDirectoryCorpusReader(
    source_directories=['data/source1', 'data/source2'],
    glob_filters=['*.txt', '*.msg', '*.doc', '*.text'])

It is possible to pass a preprocess function to the script, this could for example be the simple_preprocess function from the Gensim library. This will also print the progress during the streaming of the files.

from gensim.utils import simple_preprocess

mdcr = MultiDirectoryCorpusReader(
    source_directories=['data/source1', 'data/source2'],
    glob_filters=['*.txt', '*.msg', '*.doc', '*.text'],
    preprocess_function=simple_preprocess,
    print_progress=True)

This example shows how to supply a preprocess function that you have written yourself. In addition this will also read all files into memory and print progress during.

def preprocessor_tokenize_remove_a(s: str) -> List[str]:
    return s.replace('a', '').split(' ')

mdcr = MultiDirectoryCorpusReader(
    source_directories=['data/source1', 'data/source2'],
    glob_filters=['*.txt', '*.msg', '*.doc', '*.text'],
    preprocess_function=preprocessor_tokenize_remove_a,
    in_memory=True,
    print_progress=True)

Release History

  • 0.2.1
    • Makes the MultiDirectoryCorpusReader available through from multidirectorycorpusreader import MultiDirectoryCorpusReader
  • 0.2.0
    • The first proper release

Meta

Michael Andersen - michael10andersen+mdcr -[at]- gmail.com - Github

Distributed under the LGPL3 license. See LICENSE for more information.

Project details


Download files

Download the file for your platform. If you're not sure which to choose, learn more about installing packages.

Source Distribution

multidirectorycorpusreader-0.2.1.tar.gz (6.3 kB view details)

Uploaded Source

Built Distribution

File details

Details for the file multidirectorycorpusreader-0.2.1.tar.gz.

File metadata

  • Download URL: multidirectorycorpusreader-0.2.1.tar.gz
  • Upload date:
  • Size: 6.3 kB
  • Tags: Source
  • Uploaded using Trusted Publishing? No
  • Uploaded via: poetry/1.1.11 CPython/3.9.1 Linux/5.4.0-47-generic

File hashes

Hashes for multidirectorycorpusreader-0.2.1.tar.gz
Algorithm Hash digest
SHA256 88c2a8a72c7104d2e3160e69a275f30b7260a4a6877bd0ddb67fe3487732a99e
MD5 b1f4307a055af6c2758fa49ff5ab78d4
BLAKE2b-256 8f4e8ef16b3d053eb84a251f1c4a1ecb5e1d5733667a79d2a2f229996fe588e4

See more details on using hashes here.

File details

Details for the file multidirectorycorpusreader-0.2.1-py3-none-any.whl.

File metadata

File hashes

Hashes for multidirectorycorpusreader-0.2.1-py3-none-any.whl
Algorithm Hash digest
SHA256 6a3d1eb77dabf90ad93ed6bb0e19f15bf3ab9146aa7c2c69239e81f5e0a53386
MD5 af488277ef0a6a821c20fd2a9cd8ce19
BLAKE2b-256 b1cbf4638f411d146a1766ae1a73d638a501f3178cd46cdcecac7de95f141aa0

See more details on using hashes here.

Supported by

AWS AWS Cloud computing and Security Sponsor Datadog Datadog Monitoring Fastly Fastly CDN Google Google Download Analytics Microsoft Microsoft PSF Sponsor Pingdom Pingdom Monitoring Sentry Sentry Error logging StatusPage StatusPage Status page