license: cc-by-4.0
library_name: timm
pipeline_tag: audio-classification
模型卡片:vit_base_patch16_1024_128.audiomae_as2m_ft_as20k
一个用于音频处理的视觉变换器(ViT)。该模型通过自监督掩码自编码器(MAE)方法在AudioSet-2M上进行预训练,并在AudioSet-20k上进行了微调。
- 这是为配合
timm
使用而移植的AudioMAE ViT-B/16权重。命名规则遵循timm
中其他ViT模型的惯例。
- 原始代码库参见:https://github.com/facebookresearch/AudioMAE
- 如需获取仅在AudioSet-2M上预训练(未在Audioset-20k上微调)的检查点,请访问:https://huggingface.co/gaunernst/vit_base_patch16_1024_128.audiomae_as2m
模型详情
- 模型类型: 音频分类/特征骨干网络
- 相关论文:
- 《会倾听的掩码自编码器》:https://arxiv.org/abs/2207.06405
- 预训练数据集: AudioSet-2M
- 原始项目: https://github.com/facebookresearch/AudioMAE
模型使用
音频分类与特征提取
import timm
import torch
import torch.nn.functional as F
from torchaudio.compliance import kaldi
model = timm.create_model("hf_hub:gaunernst/vit_base_patch16_1024_128.audiomae_as2m_ft_as20k", pretrained=True)
model = model.eval()
均值 = -4.2677393
标准差 = 4.5689974
音频 = torch.randn(1, 10 * 16_000)
梅尔谱 = kaldi.fbank(音频, htk_compat=True, window_type="hanning", num_mel_bins=128)
if 梅尔谱.shape[0] < 1024:
梅尔谱 = F.pad(梅尔谱, (0, 0, 0, 1024 - 梅尔谱.shape[0]))
else:
梅尔谱 = 梅尔谱[:1024]
梅尔谱 = (梅尔谱 - 均值) / (标准差 * 2)
梅尔谱 = 梅尔谱.view(1, 1, 1024, 128)
输出 = model(梅尔谱)
概率前五, 类别索引前五 = torch.topk(输出.softmax(dim=1) * 100, k=5)
输出
引用
@inproceedings{huang2022amae,
title = {Masked Autoencoders that Listen},
author = {Huang, Po-Yao and Xu, Hu and Li, Juncheng and Baevski, Alexei and Auli, Michael and Galuba, Wojciech and Metze, Florian and Feichtenhofer, Christoph}
booktitle = {NeurIPS},
year = {2022}
}
@misc{rw2019timm,
author = {Ross Wightman},
title = {PyTorch Image Models},
year = {2019},
publisher = {GitHub},
journal = {GitHub repository},
doi = {10.5281/zenodo.4414861},
howpublished = {\url{https://github.com/huggingface/pytorch-image-models}}
}