kb-whisper-medium开源语音模型 - 基于超5万小时瑞典语训练，精准识别瑞典语音

首页

Kb Whisper Medium

由 KBLab 开发

瑞典国家图书馆发布的基于超过5万小时瑞典语音训练的Whisper模型，在瑞典语音识别任务上表现优异

语音识别

Transformers

其他开源协议:Apache-2.0 #瑞典语语音识别 #低词错误率 #多格式支持

下载量 691

发布时间 : 2/14/2025

模型简介

KB-Whisper是专门针对瑞典语优化的自动语音识别(ASR)模型系列，基于OpenAI的Whisper架构改进，显著提升了瑞典语的识别准确率

模型特点

优化的瑞典语识别

专门针对瑞典语训练，相比OpenAI原版模型平均降低47%的词错误率(WER)

多格式支持

提供Hugging Face、whisper.cpp(GGML)、onnx和ctranslate2等多种格式的检查点

多风格转录

提供三种转录风格选择：简洁的Subtitle版、默认的Stage 2版和详细的Strict版

大规模训练数据

基于超过5万小时的瑞典语音数据训练，分为两个质量阶段

模型能力

瑞典语语音识别

带时间戳的转录

多格式推理支持

批处理转录

使用案例

语音转录

会议记录

将瑞典语会议录音转换为文字记录

相比OpenAI原版模型显著提高准确率

字幕生成

为瑞典语视频内容生成字幕

提供带时间戳的准确转录

语音分析

语音内容分析

分析瑞典语语音内容用于研究或商业智能

🚀 KB-Whisper Medium

瑞典国家图书馆发布了一套全新的Whisper模型，这些模型在超过50,000小时的瑞典语语音数据上进行了训练。在对FLEURS、CommonVoice和NST等数据集的评估中，我们表现最佳的模型与OpenAI的whisper-large-v3相比，平均将单词错误率（WER）降低了47%。较小尺寸的Whisper模型在瑞典语语音上的性能也有了显著提升，例如kb-whisper-small的表现就超过了体积大其六倍的openai/whisper-large-v3。

✨ 主要特性

性能卓越：在多个瑞典语语音数据集评估中，大幅降低单词错误率，小尺寸模型也有出色表现。
多格式支持：提供Hugging Face、whisper.cpp（GGML）、onnx和ctranslate2等不同格式的检查点。
多种转录风格：除默认转录风格外，还有更简洁的Subtitle和更逐字的Strict版本。

📦 安装指南

文档未提及安装步骤，故跳过此章节。

💻 使用示例

基础用法

以下是使用Hugging Face调用KB-Whisper的推理示例：

import torch
from datasets import load_dataset
from transformers import AutoModelForSpeechSeq2Seq, AutoProcessor, pipeline

device = "cuda:0" if torch.cuda.is_available() else "cpu"
torch_dtype = torch.float16 if torch.cuda.is_available() else torch.float32
model_id = "KBLab/kb-whisper-medium"

model = AutoModelForSpeechSeq2Seq.from_pretrained(
    model_id, torch_dtype=torch_dtype, use_safetensors=True, cache_dir="cache"
)
model.to(device)
processor = AutoProcessor.from_pretrained(model_id)

pipe = pipeline(
    "automatic-speech-recognition",
    model=model,
    tokenizer=processor.tokenizer,
    feature_extractor=processor.feature_extractor,
    torch_dtype=torch_dtype,
    device=device,
)

generate_kwargs = {"task": "transcribe", "language": "sv"}
# Add return_timestamps=True for output with timestamps
res = pipe("audio.mp3", 
           chunk_length_s=30,
           generate_kwargs={"task": "transcribe", "language": "sv"})
print(res)

高级用法

Faster-whisper

Faster-whisper通过使用ctranslate2重新实现Whisper，提供快速高效的推理：

#### faster-whisper model ####
from faster_whisper import WhisperModel

model_id = "KBLab/kb-whisper-medium"
model = WhisperModel(
    model_id,
    device="cuda",
    compute_type="float16",
    download_root="cache", # cache directory
    # condition_on_previous_text = False # Can reduce hallucinations if we don't use prompts
)

# Transcribe audio.wav (convert to 16khz mono wav first via ffmpeg)
segments, info = model.transcribe("audio.wav", condition_on_previous_text=False)
print("Detected language '%s' with probability %f" % (info.language, info.language_probability))

for segment in segments:
    print("[%.2fs -> %.2fs] %s" % (segment.start, segment.end, segment.text))

WhisperX

WhisperX提供了一种方便的方法来获取准确的单词级时间戳。该库将Whisper的文本输出与Wav2vec2的准确时间戳相结合。以下是如何将KB-Whisper与KBLab/wav2vec2-large-voxrex-swedish一起使用的示例：

import whisperx

device = "cuda"
audio_file = "audio.wav"
batch_size = 16  # reduce if low on GPU mem
compute_type = "float16"  # change to "int8" if low on GPU mem (may reduce accuracy)

# 1. Transcribe with original whisper (batched)
model = whisperx.load_model(
    "KBLab/kb-whisper-medium", device, compute_type=compute_type, download_root="cache"  # cache_dir
)

audio = whisperx.load_audio(audio_file)
result = model.transcribe(audio, batch_size=batch_size)
print(result["segments"])  # before alignment

# delete model if low on GPU resources
# import gc; gc.collect(); torch.cuda.empty_cache(); del model

# 2. Align whisper output
model_a, metadata = whisperx.load_align_model(
    language_code=result["language"],
    device=device,
    model_name="KBLab/wav2vec2-large-voxrex-swedish",
    model_dir="cache",  # cache_dir
)
result = whisperx.align(
    result["segments"], model_a, metadata, audio, device, return_char_alignments=False
)

print(result["segments"])  # word level timestamps after alignment

Whisper.cpp / GGML

我们提供了可用于whisper.cpp和MacWhisper应用程序的GGML检查点。要在whisper.cpp中使用我们的模型，首先克隆仓库并构建库：

git clone https://github.com/ggerganov/whisper.cpp.git
cd whisper.cpp
cmake -B build
cmake --build build --config Release

要使用该模型，你需要下载我们上传的GGML检查点之一。你可以点击此处的下载按钮，或者使用wget下载：

wget https://huggingface.co/KBLab/kb-whisper-medium/resolve/main/ggml-model-q5_0.bin # Quantized version
# wget https://huggingface.co/KBLab/kb-whisper-medium/resolve/main/ggml-model.bin # Non-quantized version

通过在参数-m后指定模型路径，并将音频文件的路径作为最后一个位置参数来运行推理：

./build/bin/whisper-cli -m ggml-model-q5_0.bin ../audio.wav

onnx (optimum) and transformers.js usage

你可以通过Hugging Face的optimum库以以下方式使用onnx检查点：

from optimum.onnxruntime import ORTModelForSpeechSeq2Seq
from transformers import AutoProcessor

model_id = "KBLab/kb-whisper-medium"
processor = AutoProcessor.from_pretrained(model_id, cache_dir="cache")
model = ORTModelForSpeechSeq2Seq.from_pretrained(
    model_id,
    cache_dir="cache",
    subfolder="onnx",
)

import soundfile as sf
audio = sf.read("audio.wav")

inputs = processor.feature_extractor(audio[0], sampling_rate=16000, return_tensors="pt")
gen_tokens = model.generate(**inputs, max_length=300)
processor.decode(gen_tokens[0], skip_special_tokens=True)

一个使用transformers.js和KB-Whisper在浏览器中进行本地推理的应用示例可以在https://whisper.mesu.re/找到（由Pierre Mesure创建）。一个使用JavaScript设置此类应用的模板可以在https://github.com/xenova/whisper-web找到。

📚 详细文档

训练数据

我们的模型在超过50,000小时带有文本转录的瑞典语音频数据上进行了训练。模型分两个阶段进行训练，每个阶段的特点是应用不同的质量过滤器和相应的阈值。阶段1采用较低的阈值（根据数据集，BLEU值在0到0.30之间），而阶段2使用更严格的阈值（BLEU >= 0.7，加权ROUGE-N >= 0.7，前10个和后10个字符的CER <= 0.2）。

数据集	持续预训练（小时） - 阶段1	微调（小时） - 阶段2
字幕	34,261	3,110
瑞典议会	21,949	5,119
ISOF	54	54
NST	250	250
总计	56,514	8,533

通过Hugging Face加载我们的模型时，默认使用阶段2。不过，我们也上传了持续预训练的检查点并进行了标记。你可以通过在.from_pretrained()中指定revision来加载这些其他检查点。例如，预训练检查点的标签可以在pretrained-checkpoint找到。阶段2的默认模型标签名为standard。我们还提供了另一个阶段2的检查点，其转录风格更简洁，名为subtitle。

评估

单词错误率（WER）

模型大小		FLEURS	CommonVoice	NST
tiny	KBLab	13.2	12.9	11.2
	OpenAI	59.2	67.8	85.2
base	KBLab	9.1	8.7	7.8
	OpenAI	39.6	52.1	53.4
small	KBLab	7.3	6.4	6.6
	OpenAI	20.6	26.4	26.4
medium	KBLab	6.6	5.4	5.8
	OpenAI	12.1	15.8	17.1
large-v3	KBLab	5.4	4.1	5.2
	OpenAI	7.8	9.5	11.3

BLEU分数

模型大小		FLEURS	CommonVoice	NST
tiny	KBLab	76.6	73.7	74.3
	OpenAI	26.9	21.1	24.0
base	KBLab	83.2	79.9	78.3
	OpenAI	41.1	32.5	36.9
small	KBLab	86.6	83.5	79.6
	OpenAI	64.0	56.5	58.2
medium	KBLab	87.6	85.0	80.2
	OpenAI	77.1	70.1	68.9
large-v3	KBLab	89.8	87.2	81.1
	OpenAI	84.9	79.1	75.1