dasheng-base开源音频编码器 - 免费处理语音、音乐、环境音等多领域音频信息

首页

Dasheng Base

由 mispeech 开发

大规模通用音频编码器，通过自监督学习训练，支持语音、音乐和环境音等多领域音频信息处理

音频分类

Transformers

开源协议:Apache-2.0 #多领域音频编码 #自监督预训练 #12亿参数大模型

下载量 273

发布时间 : 6/6/2024

模型简介

大声是一个基于大规模自监督学习任务训练的通用音频编码器，旨在捕捉跨语音、音乐和环境音等多领域的丰富音频信息。

模型特点

大规模训练

训练数据涵盖272,356小时多样化音频

多领域适用

能够处理语音、音乐和环境音等多种音频类型

高性能表现

在HEAR基准测试中展现出显著性能提升，超越先前成果

模型能力

音频特征提取

语音分类

音乐分类

环境音分类

音频嵌入生成

使用案例

语音处理

语音命令识别

用于识别语音命令

在Speech Commands任务上表现优异

说话人识别

用于识别不同说话人

在VoxLingua任务上表现优异

音乐分析

音乐分类

对音乐类型进行分类

在音乐分类任务中表现优异

环境音分析

环境音分类

对环境声音进行分类

在环境音分类任务中表现优异

🚀 大声（Dasheng）：大规模通用音频编码器

大声（Dasheng，即Deep Audio-Signal Holistic Embeddings），或者“大声”（中文意为“great sound”），是一个在大规模自监督学习任务上训练的通用音频编码器。大声旨在捕捉包括语音、音乐和环境声音等各个领域的丰富音频信息。该模型在272,356小时的多样化音频数据上进行训练，拥有12亿个参数，并在HEAR基准测试中展现出显著的性能提升。在CREMA - D、LibriCount、语音指令、VoxLingua等任务中，大声超越了以往的工作成果，并且在音乐和环境声音分类任务中也表现出色。

原始仓库：https://github.com/RicherMans/Dasheng

dasheng

🚀 快速开始

✨ 主要特性

通用音频编码：能够捕捉语音、音乐和环境声音等多领域的丰富音频信息。
大规模训练：在272,356小时的多样化音频数据上训练，拥有12亿个参数。
性能优越：在HEAR基准测试中表现出色，在多个音频分类任务中超越以往工作。

📦 安装指南

pip install git+https://github.com/jimbozhang/hf_transformers_custom_model_dasheng.git

💻 使用示例

基础用法

>>> model_name = "mispeech/dasheng-base"

>>> from dasheng_model.feature_extraction_dasheng import DashengFeatureExtractor
>>> from dasheng_model.modeling_dasheng import DashengModel

>>> feature_extractor = DashengFeatureExtractor.from_pretrained(model_name)
>>> model = DashengModel.from_pretrained(model_name, outputdim=None)  # no linear output layer if `outputdim` is `None`

>>> import torchaudio
>>> audio, sampling_rate = torchaudio.load("resources/JeD5V5aaaoI_931_932.wav")
>>> assert sampling_rate == 16000
>>> audio.shape
torch.Size([1, 16000])   # mono audio of 1 second

>>> inputs = feature_extractor(audio, sampling_rate=sampling_rate, return_tensors="pt")
>>> inputs.input_values.shape
torch.Size([1, 64, 101])   # 64 mel-filterbanks, 101 frames

>>> import torch
>>> with torch.no_grad():
...     outputs = model(**inputs)

>>> outputs.hidden_states.shape
torch.Size([1, 25, 768])   # 25 T-F patches (patch size 64x4, no overlap), before mean-pooling

>>> outputs.logits.shape
torch.Size([1, 768])   # mean-pooled embedding (would be logits from a linear layer if `outputdim` was set)

高级用法

点击下面的链接在Colab中打开微调示例：

example_finetune_esc50.ipynb展示了如何在ESC - 50数据集上冻结大声编码器并训练一个线性头部。

📄 许可证

本项目采用Apache - 2.0许可证。

📚 详细文档

如果您在研究中发现大声模型很有用，请考虑引用以下论文：

@inproceedings{dinkel2023scaling,
  title={Scaling up masked audio encoder learning for general audio classification},
  author={Dinkel, Heinrich and Yan, Zhiyong and Wang, Yongqing and Zhang, Junbo and Wang, Yujun and Wang, Bin},
  booktitle={Interspeech 2024},
  year={2024}
}