Skip to main content

No project description provided

Project description

MultiDirectoryCorpusReader

MultiDirectoryCorpusReader provides an easy iterator for multi directory source globbing of raw text files which can be used either streaming or in memory.

Installation

It can be installed directly from github using:

#> python -m pip install git+https://github.com/blackplague/multidirectorycorpusreader.git

Usage example

The minimum viable usage is to supply a list of source directories and a list of globbing filters.

mdcr = MultiDirectoryCorpusReader(
    source_directories=['data/source1', 'data/source2'],
    glob_filters=['*.txt', '*.msg', '*.doc', '*.text'])

It is possible to pass a preprocess function to the script, this could for example be the simple_preprocess function from the Gensim library. This will also print the progress during the streaming of the files.

from gensim.utils import simple_preprocess

mdcr = MultiDirectoryCorpusReader(
    source_directories=['data/source1', 'data/source2'],
    glob_filters=['*.txt', '*.msg', '*.doc', '*.text'],
    preprocess_function=simple_preprocess,
    print_progress=True)

This example shows how to supply a preprocess function that you have written yourself. In addition this will also read all files into memory and print progress during.

def preprocessor_tokenize_remove_a(s: str) -> List[str]:
    return s.replace('a', '').split(' ')

mdcr = MultiDirectoryCorpusReader(
    source_directories=['data/source1', 'data/source2'],
    glob_filters=['*.txt', '*.msg', '*.doc', '*.text'],
    preprocess_function=preprocessor_tokenize_remove_a,
    in_memory=True,
    print_progress=True)

Release History

  • 0.2.0
    • The first proper release

Meta

Michael Andersen - michael10andersen+mdcr -[at]- gmail.com - Github

Distributed under the LGPL3 license. See LICENSE for more information.

Project details


Download files

Download the file for your platform. If you're not sure which to choose, learn more about installing packages.

Source Distribution

multidirectorycorpusreader-0.1.1.tar.gz (6.3 kB view details)

Uploaded Source

Built Distribution

File details

Details for the file multidirectorycorpusreader-0.1.1.tar.gz.

File metadata

  • Download URL: multidirectorycorpusreader-0.1.1.tar.gz
  • Upload date:
  • Size: 6.3 kB
  • Tags: Source
  • Uploaded using Trusted Publishing? No
  • Uploaded via: poetry/1.1.11 CPython/3.9.1 Linux/5.4.0-47-generic

File hashes

Hashes for multidirectorycorpusreader-0.1.1.tar.gz
Algorithm Hash digest
SHA256 c5e395872c3e3f9cefbba288acd251c484edc93d5d721696d4496540ee851e32
MD5 3cc7ca2349e1f6c17187ba1f4a369e1b
BLAKE2b-256 c604eb6a84502e60a1a5bb89d43bde97ef6fcbe0d0b3d123201e6bbfee303da0

See more details on using hashes here.

File details

Details for the file multidirectorycorpusreader-0.1.1-py3-none-any.whl.

File metadata

File hashes

Hashes for multidirectorycorpusreader-0.1.1-py3-none-any.whl
Algorithm Hash digest
SHA256 4114000754f5c6955e3be684c70aa9cf33382b9a41181d2d77bbf00ea69b9554
MD5 98957933dd3fc22e704bce9e5bfca9a3
BLAKE2b-256 172bef664ab6b887ff16fa24656fc4329f4cca13446da4ad73b6eacfc504201f

See more details on using hashes here.

Supported by

AWS AWS Cloud computing and Security Sponsor Datadog Datadog Monitoring Fastly Fastly CDN Google Google Download Analytics Microsoft Microsoft PSF Sponsor Pingdom Pingdom Monitoring Sentry Sentry Error logging StatusPage StatusPage Status page