Skip to main content

No project description provided

Project description

MultiDirectoryCorpusReader

MultiDirectoryCorpusReader provides an easy iterator for multi directory source globbing of raw text files which can be used either streaming or in memory.

Installation

It can be installed directly from github using:

#> python -m pip install git+https://github.com/blackplague/multidirectorycorpusreader.git

or via pip using:

pip install multidirectorycorpusreader

Usage example

The minimum viable usage is to supply a list of source directories and a list of globbing filters.

mdcr = MultiDirectoryCorpusReader(
    source_directories=['data/source1', 'data/source2'],
    glob_filters=['*.txt', '*.msg', '*.doc', '*.text'])

This will make it possible to iterate through the content of files located in data/source1 and data/source2 having the extensions txt, msg, doc and text in the following manner

for file_content in mdcr:
  print(f'File content: {file_content}')

It is possible to pass a preprocess function to the script, this could for example be the simple_preprocess function from the Gensim library. This will also print the progress during the streaming of the files.

from gensim.utils import simple_preprocess

mdcr = MultiDirectoryCorpusReader(
    source_directories=['data/source1', 'data/source2'],
    glob_filters=['*.txt', '*.msg', '*.doc', '*.text'],
    preprocess_function=simple_preprocess,
    print_progress=True)

This example shows how to supply a preprocess function that you have written yourself. In addition this will also read all files into memory and print progress during.

def preprocessor_tokenize_remove_a(s: str) -> List[str]:
    return s.replace('a', '').split(' ')

mdcr = MultiDirectoryCorpusReader(
    source_directories=['data/source1', 'data/source2'],
    glob_filters=['*.txt', '*.msg', '*.doc', '*.text'],
    preprocess_function=preprocessor_tokenize_remove_a,
    in_memory=True,
    print_progress=True)

Release History

  • 0.2.3
    • Unifies _non_recursive(...) and _recursive(...) function to _globber instead.
  • 0.2.2
    • Improved README.md with better example code and installation directions for pip installation
  • 0.2.1
    • Makes the MultiDirectoryCorpusReader available through from multidirectorycorpusreader import MultiDirectoryCorpusReader
  • 0.2.0
    • The first proper release

Meta

Michael Andersen - michael10andersen+mdcr -[at]- gmail.com - Github

Distributed under the LGPL3 license. See LICENSE for more information.

Project details


Download files

Download the file for your platform. If you're not sure which to choose, learn more about installing packages.

Source Distribution

multidirectorycorpusreader-0.2.3.tar.gz (6.8 kB view details)

Uploaded Source

Built Distribution

File details

Details for the file multidirectorycorpusreader-0.2.3.tar.gz.

File metadata

  • Download URL: multidirectorycorpusreader-0.2.3.tar.gz
  • Upload date:
  • Size: 6.8 kB
  • Tags: Source
  • Uploaded using Trusted Publishing? No
  • Uploaded via: poetry/1.1.11 CPython/3.9.1 Linux/5.4.0-47-generic

File hashes

Hashes for multidirectorycorpusreader-0.2.3.tar.gz
Algorithm Hash digest
SHA256 0ed0c78140ebaec1db6fb0aac6f20540665fa2a89fe6cecc2a6c2be3aea68333
MD5 1554e193c3ab7602925951a11e2f2485
BLAKE2b-256 d77e7c658185f798b59e94cc5534c3bfdd253decc3b017b6354e7cd41389ecd9

See more details on using hashes here.

File details

Details for the file multidirectorycorpusreader-0.2.3-py3-none-any.whl.

File metadata

File hashes

Hashes for multidirectorycorpusreader-0.2.3-py3-none-any.whl
Algorithm Hash digest
SHA256 98dccf7087aec72d2bd104b7e0bf41a84a3118f8efa52e0b1a1359ce302cfe2c
MD5 b33cad146c82c14f9b20bfdf1540434d
BLAKE2b-256 6c99e8e0869daa3ba0549f4719fa05f94002913a4721d336f092395307eab211

See more details on using hashes here.

Supported by

AWS AWS Cloud computing and Security Sponsor Datadog Datadog Monitoring Fastly Fastly CDN Google Google Download Analytics Microsoft Microsoft PSF Sponsor Pingdom Pingdom Monitoring Sentry Sentry Error logging StatusPage StatusPage Status page