No project description provided
Project description
arpa2fst
Python wrapper for kaldi's arpa2fst.
Installation
To install kaldilm
, please run:
pip install kaldilm
Please create an issue on GitHub
if you encounter any problems while installing kaldilm
.
Usage
First, let us see the usage information of kaldi's arpa2fst:
kaldi/src/lmbin$ ./arpa2fst
./arpa2fst
Convert an ARPA format language model into an FST
Usage: arpa2fst [opts] <input-arpa> <output-fst>
e.g.: arpa2fst --disambig-symbol=#0 --read-symbol-table=data/lang/words.txt lm/input.arpa G.fst
Note: When called without switches, the output G.fst will contain
an embedded symbol table. This is compatible with the way a previous
version of arpa2fst worked.
Options:
--bos-symbol : Beginning of sentence symbol (string, default = "<s>")
--disambig-symbol : Disambiguator. If provided (e. g. #0), used on input side of backoff links, and <s> and </s> are replaced with epsilons (string, default = "")
--eos-symbol : End of sentence symbol (string, default = "</s>")
--ilabel-sort : Ilabel-sort the output FST (bool, default = true)
--keep-symbols : Store symbol table with FST. Symbols always saved to FST if symbol tables are neither read or written
(otherwise symbols would be lost entirely) (bool, default = false)
--max-arpa-warnings : Maximum warnings to report on ARPA parsing, 0 to disable, -1 to show all (int, default = 30)
--read-symbol-table : Use existing symbol table (string, default = "")
--write-symbol-table : Write generated symbol table to a file (string, default = "")
kaldilm
reuses the same arguments and provides only a single method arpa2fst:
def arpa2fst(input_arpa: str,
output_fst: str,
bos_symbol: str = '<s>',
disambig_symbol: str = '',
eos_symbol: str = '</s>',
ilabel_sort: bool = True,
keep_symbols: bool = False,
max_arpa_warnings: int = 30,
read_symbol_table: str = '',
write_symbol_table: str = '') -> str:
'''Convert an ARPA file to an FST.
This function is a wrapper of kaldi's arpa2fst and
arguments names meanings are not changed.
Args:
The input arpa file.
output_fst:
The output fst file. Note that it is a binary file.
This function will return a text format of it.
bos_symbol:
Beginning of sentence symbol.
disambig_symbol:
Disambiguator. If provided (e.g., #0), used on input side of backoff
links, and <s> and </s> are replaced with epsilons.
eos_symbol:
End of sentence symbol.
ilabel_sort:
Ilabel-sort the output FST.
keep_symbols:
Store symbol table with FST. Symbols always saved to FST if symbol
tables are neither read or written (otherwise symbols would be lost
entirely).
max_arpa_warnings:
Maximum warnings to report on ARPA parsing, 0 to disable, -1 to
show all.
read_symbol_table:
use existing symbol table.
write_symbol_table:
Write generated symbol table to a file.
Returns:
Return a text format of the resulting FST with integer labels.
'''
Example usage
Suppose you have an arpa file input.arpa
with the following content:
\data\
ngram 1=4
ngram 2=2
ngram 3=2
\1-grams:
-5.234679 a -3.3
-3.456783 b
0.0000000 <s> -2.5
-4.333333 </s>
\2-grams:
-1.45678 a b -3.23
-1.30490 <s> a -4.2
\3-grams:
-0.34958 <s> a b
-0.23940 a b </s>
\end\
You can use the following code to convert it to an FSA:
input_file = './input.arpa'
import kaldilm
s = kaldilm.arpa2fst(input_file, 'a.fst', write_symbol_table='words.txt')
with open('a.fst.txt', 'w') as f:
f.write(s)
It generates 3 files:
a.fst
(a binary file in OpenFST format)words.txt
(symbol table ofa.fst
in text format)a.fst.txt
(a text format ofa.fst
with integer labels)`
Their contents are shown below:
cat words.txt
<eps> 0
<s> 1
</s> 2
a 3
b 4
cat a.fst.txt
5 4 1 0
0 1 2 9.97787
0 2 3 12.0533
0 3 4 7.95954
1 0
2 0 0 7.59853
2 6 4 3.35436
3 0 0 -0
4 0 0 5.75646
4 7 3 3.00464
6 3 0 7.43735
6 1 2 0.551239
7 2 0 9.67086
7 6 4 0.804938
fstprint --acceptor a.fst
5 4 1
0 1 2 9.97786808
0 2 3 12.0532942
0 3 4 7.95953703
1
2 0 0 7.59853077
2 6 4 3.35435987
3 0 0
4 0 0 5.75646257
4 7 3 3.00464344
6 3 0 7.4373498
6 1 2 0.551238894
7 2 0 9.67085648
7 6 4 0.804937661
Project details
Release history Release notifications | RSS feed
Download files
Download the file for your platform. If you're not sure which to choose, learn more about installing packages.
Source Distribution
kaldilm-1.0.1.tar.gz
(45.4 kB
view hashes)