Datagemma-rag-27b-it开源模型 - 助力大语言模型获取整合可靠公共统计数据

首页

Datagemma Rag 27b It

由 google 开发

DataGemma是基于Gemma 2微调的系列模型，专门用于帮助大语言模型访问和整合Data Commons中的可靠公共统计数据。

大型语言模型

Transformers

#检索增强生成 #公共统计查询 #多问题生成

下载量 691

发布时间 : 8/26/2024

模型简介

DataGemma RAG采用检索增强生成技术，经过训练后能接收用户查询并生成可被Data Commons自然语言接口理解的查询列表。

模型特点

检索增强生成

能够生成可被Data Commons自然语言接口理解的查询

公共统计整合

专门设计用于访问和整合Data Commons中的可靠公共统计数据

结构化问题生成

能够根据用户查询生成符合特定格式的统计问题

模型能力

自然语言理解

统计问题生成

数据查询转换

使用案例

数据分析

人口统计查询

生成关于特定地区人口统计数据的查询

如生成'森尼维尔的常住人口是多少？'等结构化问题

经济指标查询

生成关于经济指标（如失业率）的查询

如生成'加利福尼亚州的失业率是多少？'等结构化问题

研究辅助

社会科学研究

帮助研究人员快速获取公共统计数据

自动生成符合Data Commons接口要求的研究问题

🚀 DataGemma RAG模型卡片

DataGemma是一系列经过微调的Gemma 2模型，用于帮助大语言模型（LLMs）在回答中访问并整合来自Data Commons的可靠公共统计数据，为统计问题的解答提供支持。

🚀 快速开始

运行环境准备

要运行该模型，需要安装相关依赖库。以下是不同运行方式下的安装命令：

单GPU或多GPU运行：

pip install -U transformers accelerate

4位量化运行（使用bitsandbytes）：

pip install -U transformers bitsandbytes accelerate

代码示例

以下是运行微调模型的代码片段，这只是DataGemma论文中完整RAG方法的一个步骤。你可以在这个Colab笔记本中尝试端到端的RAG流程。

单GPU或多GPU运行

from transformers import AutoTokenizer, AutoModelForCausalLM
import torch

model_id = 'google/datagemma-rag-27b-it'
tokenizer = AutoTokenizer.from_pretrained(model_id)
model = AutoModelForCausalLM.from_pretrained(
    model_id,
    device_map='auto',
    torch_dtype=torch.bfloat16,
)

input_text = """Your role is that of a Question Generator.  Given Query below, come up with a
maximum of 25 Statistical Questions that help in answering Query.

These are the only forms of Statistical Questions you can generate:
1. What is $METRIC in $PLACE?
2. What is $METRIC in $PLACE $PLACE_TYPE?
3. How has $METRIC changed over time in $PLACE $PLACE_TYPE?

where,
- $METRIC should be a metric on societal topics like demographics, economy, health,
  education, environment, etc.  Examples are unemployment rate and
  life expectancy.
- $PLACE is the name of a place like California, World, Chennai, etc.
- $PLACE_TYPE is an immediate child type within $PLACE, like counties, states,
  districts, etc.

Your response should only have questions, one per line, without any numbering
or bullet.

If you cannot come up with Statistical Questions to ask for a Query, return an
empty response.

Query: What are some interesting trends in Sunnyvale spanning gender, age, race, immigration, health conditions, economic conditions, crime and education?
Statistical Questions:"""
inputs = tokenizer(input_text, return_tensors='pt').to('cuda')

outputs = model.generate(**inputs, max_new_tokens=4096)
answer = tokenizer.batch_decode(outputs[:, inputs['input_ids'].shape[1]:], skip_special_tokens=True)[0].strip()
print(answer)

4位量化运行（使用bitsandbytes）

from transformers import AutoTokenizer, AutoModelForCausalLM, BitsAndBytesConfig
import torch
nf4_config = BitsAndBytesConfig(
   load_in_4bit=True,
   bnb_4bit_quant_type='nf4',
   bnb_4bit_compute_dtype=torch.bfloat16,
)

model_id = 'google/datagemma-rag-27b-it'
tokenizer = AutoTokenizer.from_pretrained(model_id)
model = AutoModelForCausalLM.from_pretrained(
    model_id,
    device_map='auto',
    quantization_config=nf4_config,
    torch_dtype=torch.bfloat16,
)
input_text = """Your role is that of a Question Generator.  Given Query below, come up with a
maximum of 25 Statistical Questions that help in answering Query.
These are the only forms of Statistical Questions you can generate:
1. What is $METRIC in $PLACE?
2. What is $METRIC in $PLACE $PLACE_TYPE?
3. How has $METRIC changed over time in $PLACE $PLACE_TYPE?
where,
- $METRIC should be a metric on societal topics like demographics, economy, health,
  education, environment, etc.  Examples are unemployment rate and
  life expectancy.
- $PLACE is the name of a place like California, World, Chennai, etc.
- $PLACE_TYPE is an immediate child type within $PLACE, like counties, states,
  districts, etc.

Your response should only have questions, one per line, without any numbering
or bullet.

If you cannot come up with Statistical Questions to ask for a Query, return an
empty response.

Query: What are some interesting trends in Sunnyvale spanning gender, age, race, immigration, health conditions, economic conditions, crime and education?
Statistical Questions:"""
inputs = tokenizer(input_text, return_tensors='pt').to('cuda')

outputs = model.generate(**inputs, max_new_tokens=4096)
answer = tokenizer.batch_decode(outputs[:, inputs['input_ids'].shape[1]:], skip_special_tokens=True)[0].strip()
print(answer)

示例输出

点击查看示例输出

What is the population of Sunnyvale?
What is the population of Sunnyvale males?
What is the population of Sunnyvale females?
What is the population of Sunnyvale asians?
What is the population of Sunnyvale blacks?
What is the population of Sunnyvale whites?
What is the population of Sunnyvale males in their 20s?
What is the population of Sunnyvale females in their 20s?
What is the population of Sunnyvale males in their 30s?
What is the population of Sunnyvale females in their 30s?
What is the population of Sunnyvale males in their 40s?
What is the population of Sunnyvale females in their 40s?
What is the population of Sunnyvale males in their 50s?
What is the population of Sunnyvale females in their 50s?
What is the population of Sunnyvale males in their 60s?
What is the population of Sunnyvale females in their 60s?
How has the population of Sunnyvale changed over time?
How has the population of Sunnyvale males changed over time?
How has the population of Sunnyvale females changed over time?
How has the population of Sunnyvale asian people changed over time?
How has the population of Sunnyvale black people changed over time?
How has the population of Sunnyvale hispanic people changed over time?
How has the population of Sunnyvale white people changed over time?
How has the score on Sunnyvale schools changed over time?
How has the number of students enrolled in Sunnyvale schools changed over time?
How has the number of students enrolled in Sunnyvale charter schools changed over time?
How has the number of students enrolled in Sunnyvale private schools changed over time?

✨ 主要特性

数据整合：DataGemma是一系列经过微调的Gemma 2模型，可帮助大语言模型（LLMs）在回答中访问并整合来自Data Commons的可靠公共统计数据。
检索增强生成（RAG）：DataGemma RAG结合了检索增强生成技术，经过训练可以接收用户查询，并生成能够被Data Commons现有自然语言接口理解的自然语言查询。

📚 详细文档

模型信息

描述

DataGemma是一系列经过微调的Gemma 2模型，用于帮助大语言模型（LLMs）在回答中访问并整合来自Data Commons的可靠公共统计数据。DataGemma RAG用于检索增强生成（RAG），它经过训练可以接收用户查询，并生成能够被Data Commons现有自然语言接口理解的自然语言查询。更多信息可参考这篇研究论文。

输入和输出

输入：包含用户查询的文本字符串，带有询问统计问题的提示。
输出：一个自然语言查询列表，可用于回答用户查询，并能被Data Commons现有自然语言接口理解。

以下是一个用于为用户查询[User Query]获取统计问题的提示示例：

Your role is that of a Question Generator.  Given Query below, come up with a
maximum of 25 Statistical Questions that help in answering Query.

These are the only forms of Statistical Questions you can generate:
1. What is $METRIC in $PLACE?
2. What is $METRIC in $PLACE $PLACE_TYPE?
3. How has $METRIC changed over time in $PLACE $PLACE_TYPE?

where,
- $METRIC should a metric on societal topics like demographics, economy, health,
  education, environment, etc.  Examples are unemployment rate and
  life expectancy.
- $PLACE is the name of a place like California, World, Chennai, etc.
- $PLACE_TYPE is an immediate child type within $PLACE, like counties, states,
  districts, etc.

Your response should only have questions, one per line, without any numbering
or bullet.

If you cannot come up with Statistical Questions to ask for a Query, return an
empty response.

Query: [User Query]
Statistical Questions:

模型数据

基础模型在包含多种来源的文本数据集上进行训练，更多详细信息请参考Gemma 2文档。DataGemma RAG模型在合成生成的数据上进行微调。更多详细信息可参考DataGemma论文。

实现信息

与Gemma类似，DataGemma RAG在TPUv5e上使用JAX进行训练。

评估

模型评估是作为完整RAG工作流程评估的一部分进行的，并记录在DataGemma论文中。

伦理与安全

我们正在发布模型的早期版本，这些模型仅用于学术和研究目的，尚未准备好用于商业或面向公众使用。此版本在非常小的示例语料库上进行训练，可能会出现意外的、有时甚至是有争议或煽动性的行为。在我们积极开发这个大语言模型接口时，请预期会存在错误和局限性。

我们在发布前对Data Commons自然语言接口进行了红队测试，并针对一组可能导致误导性、有争议或煽动性结果的潜在危险查询进行了检查。
我们对RIG和RAG模型的输出运行了相同的查询，发现有一些查询响应存在争议，但并不危险。
由于此模型仅用于学术和研究目的，尚未经过我们通常的安全评估。

使用和限制

这些模型存在一定的局限性，用户应该了解这些情况。

这是DataGemma RAG的一个非常早期的版本，仅供受信任的测试人员使用（主要用于学术和研究用途），尚未准备好用于商业或面向公众使用。此版本在非常小的示例语料库上进行训练，可能会出现意外的、有时甚至是有争议或煽动性的行为。在我们积极开发这个大语言模型接口时，请预期会存在错误和局限性。

你的反馈和评估对于改进DataGemma的性能至关重要，并将直接有助于其训练过程。已知的局限性在DataGemma论文中有详细说明，我们鼓励你查阅该论文以全面了解DataGemma的当前能力。

🔧 技术细节

引用

@misc{radhakrishnan2024knowing,
      title={Knowing When to Ask - Bridging Large Language Models and Data}, 
      author={Prashanth Radhakrishnan and Jennifer Chen and Bo Xu and Prem Ramaswami and Hannah Pho and Adriana Olmos and James Manyika and R. V. Guha},
      year={2024},
      eprint={},
      archivePrefix={arXiv},
      primaryClass={cs.CL},
      url={https://datacommons.org/link/DataGemmaPaper}, 
}