Skip to main content

Natural language data loading tools

Project description

natlang: Natural Language Data Loading Tools

master: Build Status dev: Build Status

Data loader/common data structures and other tools

Most of the code are Python2/3 compatible. For the version of python for specific modules, please check the second line of each source file.

0. Usage

Install using pip will get you the latest tested version of natlang.

> pip install natlang

Alternatively, you can also install from source using the following command:

> python setup.py install

If you want to load up a dataset, then just do this:

> import natlang as nl
> data = nl.load(filePattern, format=ChoosenFormat)
> # ChoosenFormat here can be an actual imported format or string.
> # Alternatively, you can also pass a loader func in using nl.load(filePatttern, loader=func)

For parallel datasets:

> import natlang as nl
> data = nl.biload(srcPattern, tgtPattern, srcFormat, tgtFormat)
> # Loader option similar to nl.load also applies here. src stands for source, tgt stands for target.

1. Format

All supported formats are placed under src/format. Currently the following formats are tested:

  1. txt: simple text format. Sentences are separated by \n, tokens/words are separated by whitespace.

  2. tree: constituency tree format. Run python -i format/tree.py to play around.

  3. semanticFrame: Propbank/Nombank frame loader. Returns bundles of frames for analysis.

  4. AMR: Abstract Meaning Representation. Run python -i format/AMR.py to play around.

  5. conll: General CoNLL format loader. Default is CoNLL_U. Run python -i format/conll.py to play around.

1.1 Recommended Functions

For formats supporting being loaded from a file, one should implement a load function in the format file (see 2.1).

For formats supporting being exported, each instance of that format should have an export method that outputs a string.

2. Loader

2.1 Individual Loader

Each format has its own loader. It is defined as format.FORMAT.load. The load function has the following interface:

def load(file, linesToLoad=sys.maxsize)

At test time, the load function would be expected to parse the file description and read from it. It will return the first linesToLoad entries as a list.

For example, if one wishes to use load a file in constituency tree format (see example in tests/sampleTree.txt), one could do the following:

from datatool.format import tree
x = tree.load("datatool/tests/sampleTree.txt")

2.2 Class ParallelDataLoader

This class allows one to load parallel corpora (L1, L2) in any format. You can specify the format for L1 and L2 side separately.

from datatool.loader import ParallelDataLoader
loader = ParallelDataLoader(srcFormat='txtOrTree', tgtFormat='txtOrTree')

Here, 'txtOrTree' is the default value for srcFormat and tgtFormat. Note that under the format folder, except for data structures for specific formats, there are also mere loaders and 'txtOrTree' is one that can handle both tree and txt.

After initialising the loader, one can just go ahead and run:

loader.load(fFile, eFile, linesToLoad)

The loader will automatically align the parallel text and output a list of tuples, each containing a single entry in L1 and L2. Entries with either L1 or L2 being None or of length 0 will be omitted.

3. Exporter

Usage:

from datatool.exporter import exportToFile, RealtimeExporter

3.1 Function exportToFile

Export a txt format dataset or tree format dataset (not single entry, but rather a dataset) to file.

3.2 Class RealtimeExporter

The code is pretty self-explanatory. If the export function of a specific format takes quite a bit of time, this method is recommended.

Project details


Download files

Download the file for your platform. If you're not sure which to choose, learn more about installing packages.

Source Distribution

natlang-0.3a29.tar.gz (44.3 kB view details)

Uploaded Source

Built Distribution

natlang-0.3a29-py3-none-any.whl (62.3 kB view details)

Uploaded Python 3

File details

Details for the file natlang-0.3a29.tar.gz.

File metadata

  • Download URL: natlang-0.3a29.tar.gz
  • Upload date:
  • Size: 44.3 kB
  • Tags: Source
  • Uploaded using Trusted Publishing? No
  • Uploaded via: twine/1.13.0 pkginfo/1.5.0.1 requests/2.22.0 setuptools/40.6.2 requests-toolbelt/0.9.1 tqdm/4.32.1 CPython/3.6.8

File hashes

Hashes for natlang-0.3a29.tar.gz
Algorithm Hash digest
SHA256 1baac178c035f1041dd4550328e812f6d81ea2abce49f7bbe0fb547e85b755a8
MD5 205955dd0aa35fa279b5878f4398bf39
BLAKE2b-256 7d216b4f865e8daeebdfdb482c1cc7e229cc16368c5df753b361faaa313d9a69

See more details on using hashes here.

File details

Details for the file natlang-0.3a29-py3-none-any.whl.

File metadata

  • Download URL: natlang-0.3a29-py3-none-any.whl
  • Upload date:
  • Size: 62.3 kB
  • Tags: Python 3
  • Uploaded using Trusted Publishing? No
  • Uploaded via: twine/1.13.0 pkginfo/1.5.0.1 requests/2.22.0 setuptools/40.6.2 requests-toolbelt/0.9.1 tqdm/4.32.1 CPython/3.6.8

File hashes

Hashes for natlang-0.3a29-py3-none-any.whl
Algorithm Hash digest
SHA256 467cd0266cea3d0a71d035f9890562b452b3ca03af848b838277687c274654d8
MD5 6f531889f51600a2b895c812812cb505
BLAKE2b-256 af3bf9981bae298c3d690b08868cd1931528ce3cad83c062e7ea72eba1f450f4

See more details on using hashes here.

Supported by

AWS AWS Cloud computing and Security Sponsor Datadog Datadog Monitoring Fastly Fastly CDN Google Google Download Analytics Microsoft Microsoft PSF Sponsor Pingdom Pingdom Monitoring Sentry Sentry Error logging StatusPage StatusPage Status page