Skip to main content

Gluon Audio Toolkit

Project description

Gluon Audio is a toolkit providing deep learning based audio recognition algorithm. The project is still under development, and only Chinese introduction will be provided.

GluonAR Introduction:

GluonAR is based on MXnet-Gluon, if you are new to it, please check out dmlc 60-minute crash course.

虽然名字叫GluonAR, 但是目前以及可以预见的时间内只有Text-Independent Speaker Recognition的内容.

已经实现的feature: - 使用ffmpeg的pythonic binding avlibrosa做audio数据读取 - 模块支持Hybridize(). forward阶段不使用pysound, librosa, scipy, 效率更高, 提供快速训练和end-to-end部署的能力, 包括: - 基于nd.contrib.fft的短时傅里叶变换(STFTBlock)和z-score block, 相比使用numpy和scipy预处理后载入GPU训练效率提高12%. - MelSpectrogram, DCT1D, MFCC, PowerToDB - 1808.00158中提出的SincBlock - gluon风格的VOX数据集载入 - 类似人脸验证的Speaker Verification - 使用频谱图训练声纹特征的例子, 在VOX1上的1:1验证acc: 0.941152+-0.004926

example:

import numpy as np
import mxnet as mx
import librosa as rosa
from gluonar.utils.viz import view_spec
from gluonar.nn.basic_blocks import STFTBlock

data = rosa.load(r"resources/speaker_recognition/speaker0_0.m4a", sr=16000)[0][:35840]
nd_data = mx.nd.array([data], ctx=mx.gpu())

stft = STFTBlock(35840, hop_length=160, win_length=400)
stft.initialize(ctx=mx.gpu())

# stft block forward
ret = stft(nd_data).asnumpy()[0][0]
spec = np.transpose(ret, (1, 0)) ** 2
view_spec(spec)

# stft in librosa
spec = rosa.stft(data, hop_length=160, win_length=400, window="hamming")
spec = np.abs(spec) ** 2
view_spec(spec)

输出:

STFTBlock

STFT in librosa

更多的例子请参考examples/.

Requirements

mxnet-1.5.0+, gluonfr, av, librosa, …

音频库的选择主要考虑数据读取速度, 训练过程中音频的解码相比图像解码会消耗更多时间, 实际测试librosa从磁盘加载一个aac编码的短音频 耗时是pyav的8倍左右.

  • librosa pip install librosa

  • ffmpeg

    # 下载ffmpeg源码, 进入根目录
    ./configure --extra-cflags=-fPIC --enable-shared
    make -j
    sudo make install
  • pyav, 需要先安装ffmpeg pip install av

  • gluonfr
    pip install git+https://github.com/THUFutureLab/gluon-face.git@master

Datasets

TIMIT

The DARPA TIMIT Acoustic-Phonetic Continuous Speech Corpus (TIMIT) Training and Test Data. Before using this dataset please follow the instruction on link.

A copy of this was uploaded to Google Drive by @philipperemy here.

VoxCeleb

VoxCeleb is an audio-visual dataset consisting of short clips of human speech, extracted from interview videos uploaded to YouTube.

For more information, checkout this page.

Pretrained Models

Speaker Recognition

ResNet18 training with VoxCeleb

Download: Baidu, Google Drive

I followed the ideas in paper VoxCeleb2 1806.05622 to train this model, the differences between them:

Res18 in this repo

Res34 in paper

Train ed on

VoxCel eb2

VoxCel eb2

Input spec size

224x22 4

512x30 0

Eval on

Random 9500+ pair sample s from VoxCel eb1 train and test set

Origin al VoxCel eb1 test set

Metri c

Accura cy:0.9 32656+ -0.005 187

EER: 0.0504

Frame work

Mxnet Gluon

Matcon vnet

ROC

TODO

接下来会慢慢补全使用mxnet gluon训练说话人识别的工具链, 预计会花超长时间.

Docs

GluonAR documentation is not available now.

Authors

{ haoxintong }

Discussion

Any suggestions, please open an issue.

Contributes

The final goal of this project is providing an easy using deep learning based audio algorithm library like pytorch-kaldi.

Contribution is welcomed.

References

  1. MXNet Documentation and Tutorials https://zh.diveintodeeplearning.org/

Project details


Download files

Download the file for your platform. If you're not sure which to choose, learn more about installing packages.

Source Distribution

gluonar-0.1.0.tar.gz (15.0 kB view hashes)

Uploaded Source

Supported by

AWS AWS Cloud computing and Security Sponsor Datadog Datadog Monitoring Fastly Fastly CDN Google Google Download Analytics Microsoft Microsoft PSF Sponsor Pingdom Pingdom Monitoring Sentry Sentry Error logging StatusPage StatusPage Status page