Natural language data loading tools
Project description
natlang: Natural Language Data Loading Tools
Data loader/common data structures and other tools
Most of the code are Python2/3 compatible. For the version of python for specific modules, please check the second line of each source file.
0. Usage
Install using pip will get you the latest tested version of natlang.
> pip install natlang
Alternatively, you can also install from source using the following command:
> python setup.py install
If you want to load up a dataset, then just do this:
> import natlang as nl
> data = nl.load(filePattern, format=ChoosenFormat)
> # ChoosenFormat here can be an actual imported format or string.
> # Alternatively, you can also pass a loader func in using nl.load(filePatttern, loader=func)
For parallel datasets:
> import natlang as nl
> data = nl.biload(srcPattern, tgtPattern, srcFormat, tgtFormat)
> # Loader option similar to nl.load also applies here. src stands for source, tgt stands for target.
1. Format
All supported formats are placed under src/format.
Currently the following formats are tested:
-
txt: simple text format. Sentences are separated by\n, tokens/words are separated by whitespace. -
tree: constituency tree format. Runpython -i format/tree.pyto play around. -
semanticFrame: Propbank/Nombank frame loader. Returns bundles of frames for analysis. -
AMR: Abstract Meaning Representation. Runpython -i format/AMR.pyto play around. -
conll: General CoNLL format loader. Default is CoNLL_U. Runpython -i format/conll.pyto play around.
1.1 Recommended Functions
For formats supporting being loaded from a file, one should implement a load
function in the format file (see 2.1).
For formats supporting being exported, each instance of that format should have
an export method that outputs a string.
2. Loader
2.1 Individual Loader
Each format has its own loader.
It is defined as format.FORMAT.load.
The load function has the following interface:
def load(file, linesToLoad=sys.maxsize)
At test time, the load function would be expected to parse the file
description and read from it.
It will return the first linesToLoad entries as a list.
For example, if one wishes to use load a file in constituency tree format (see
example in tests/sampleTree.txt), one could do the following:
from datatool.format import tree
x = tree.load("datatool/tests/sampleTree.txt")
2.2 Class ParallelDataLoader
This class allows one to load parallel corpora (L1, L2) in any format. You can specify the format for L1 and L2 side separately.
from datatool.loader import ParallelDataLoader
loader = ParallelDataLoader(srcFormat='txtOrTree', tgtFormat='txtOrTree')
Here, 'txtOrTree' is the default value for srcFormat and tgtFormat.
Note that under the format folder, except for data structures for specific
formats, there are also mere loaders and 'txtOrTree' is one that can handle
both tree and txt.
After initialising the loader, one can just go ahead and run:
loader.load(fFile, eFile, linesToLoad)
The loader will automatically align the parallel text and output a list of
tuples, each containing a single entry in L1 and L2.
Entries with either L1 or L2 being None or of length 0 will be omitted.
3. Exporter
Usage:
from datatool.exporter import exportToFile, RealtimeExporter
3.1 Function exportToFile
Export a txt format dataset or tree format dataset (not single entry, but
rather a dataset) to file.
3.2 Class RealtimeExporter
The code is pretty self-explanatory. If the export function of a specific format takes quite a bit of time, this method is recommended.
Project details
Release history Release notifications | RSS feed
Download files
Download the file for your platform. If you're not sure which to choose, learn more about installing packages.
Source Distribution
Built Distribution
Filter files by name, interpreter, ABI, and platform.
If you're not sure about the file name format, learn more about wheel file names.
Copy a direct link to the current filters
File details
Details for the file natlang-0.3a29.tar.gz.
File metadata
- Download URL: natlang-0.3a29.tar.gz
- Upload date:
- Size: 44.3 kB
- Tags: Source
- Uploaded using Trusted Publishing? No
- Uploaded via: twine/1.13.0 pkginfo/1.5.0.1 requests/2.22.0 setuptools/40.6.2 requests-toolbelt/0.9.1 tqdm/4.32.1 CPython/3.6.8
File hashes
| Algorithm | Hash digest | |
|---|---|---|
| SHA256 |
1baac178c035f1041dd4550328e812f6d81ea2abce49f7bbe0fb547e85b755a8
|
|
| MD5 |
205955dd0aa35fa279b5878f4398bf39
|
|
| BLAKE2b-256 |
7d216b4f865e8daeebdfdb482c1cc7e229cc16368c5df753b361faaa313d9a69
|
File details
Details for the file natlang-0.3a29-py3-none-any.whl.
File metadata
- Download URL: natlang-0.3a29-py3-none-any.whl
- Upload date:
- Size: 62.3 kB
- Tags: Python 3
- Uploaded using Trusted Publishing? No
- Uploaded via: twine/1.13.0 pkginfo/1.5.0.1 requests/2.22.0 setuptools/40.6.2 requests-toolbelt/0.9.1 tqdm/4.32.1 CPython/3.6.8
File hashes
| Algorithm | Hash digest | |
|---|---|---|
| SHA256 |
467cd0266cea3d0a71d035f9890562b452b3ca03af848b838277687c274654d8
|
|
| MD5 |
6f531889f51600a2b895c812812cb505
|
|
| BLAKE2b-256 |
af3bf9981bae298c3d690b08868cd1931528ce3cad83c062e7ea72eba1f450f4
|