模型简介
模型特点
模型能力
使用案例
🚀 DataGemma RIG模型卡
DataGemma是一系列经过微调的Gemma 2模型,用于帮助大语言模型(LLMs)在回复中访问并整合来自Data Commons的可靠公共统计数据。DataGemma RIG采用检索交错生成方法,经过训练后,能在回复中涉及统计数据的地方,用自然语言查询Data Commons的现有自然语言接口进行标注。
🚀 快速开始
访问Gemma
要在Hugging Face上访问Gemma,你需要查看并同意Google的使用许可。请确保你已登录Hugging Face,然后点击下方按钮。请求将立即处理。 [确认许可](Acknowledge license)
运行模型
要运行此微调模型,这只是DataGemma论文中完整RIG方法的一个步骤。你可以在这个Colab笔记本中尝试端到端的RIG流程。
首先,确保你已经安装了必要的库:
pip install -U transformers accelerate
然后,复制以下代码片段:
from transformers import AutoTokenizer, AutoModelForCausalLM
import torch
model_id = 'google/datagemma-rig-27b-it'
tokenizer = AutoTokenizer.from_pretrained(model_id)
model = AutoModelForCausalLM.from_pretrained(
model_id,
device_map='auto',
torch_dtype=torch.bfloat16,
)
input_text = 'What are some interesting trends in Sunnyvale spanning gender, age, race, immigration, health conditions, economic conditions, crime and education?'
inputs = tokenizer(input_text, return_tensors='pt').to('cuda')
outputs = model.generate(**inputs, max_new_tokens=4096)
answer = tokenizer.batch_decode(outputs[:, inputs['input_ids'].shape[1]:], skip_special_tokens=True)[0].strip()
print(answer)
以4位精度运行
要以4位精度运行此模型,首先确保你已经安装了必要的库:
pip install -U transformers bitsandbytes accelerate
然后,复制以下代码片段:
from transformers import AutoTokenizer, AutoModelForCausalLM, BitsAndBytesConfig
import torch
nf4_config = BitsAndBytesConfig(
load_in_4bit=True,
bnb_4bit_quant_type='nf4',
bnb_4bit_compute_dtype=torch.bfloat16,
)
model_id = 'google/datagemma-rig-27b-it'
tokenizer = AutoTokenizer.from_pretrained(model_id)
model = AutoModelForCausalLM.from_pretrained(
model_id,
device_map='auto',
quantization_config=nf4_config,
torch_dtype=torch.bfloat16,
)
input_text = 'What are some interesting trends in Sunnyvale spanning gender, age, race, immigration, health conditions, economic conditions, crime and education?'
inputs = tokenizer(input_text, return_tensors='pt').to('cuda')
outputs = model.generate(**inputs, max_new_tokens=4096)
answer = tokenizer.batch_decode(outputs[:, inputs['input_ids'].shape[1]:], skip_special_tokens=True)[0].strip()
print(answer)
✨ 主要特性
- 数据整合:DataGemma是一系列经过微调的Gemma 2模型,可帮助大语言模型在回复中访问并整合来自Data Commons的可靠公共统计数据。
- 检索交错生成:DataGemma RIG采用检索交错生成方法,经过训练后,能在回复中涉及统计数据的地方,用自然语言查询Data Commons的现有自然语言接口进行标注。
📦 安装指南
要运行此模型,需要安装以下库:
pip install -U transformers accelerate
如果要以4位精度运行模型,还需要安装bitsandbytes
:
pip install -U transformers bitsandbytes accelerate
💻 使用示例
基础用法
from transformers import AutoTokenizer, AutoModelForCausalLM
import torch
model_id = 'google/datagemma-rig-27b-it'
tokenizer = AutoTokenizer.from_pretrained(model_id)
model = AutoModelForCausalLM.from_pretrained(
model_id,
device_map='auto',
torch_dtype=torch.bfloat16,
)
input_text = 'What are some interesting trends in Sunnyvale spanning gender, age, race, immigration, health conditions, economic conditions, crime and education?'
inputs = tokenizer(input_text, return_tensors='pt').to('cuda')
outputs = model.generate(**inputs, max_new_tokens=4096)
answer = tokenizer.batch_decode(outputs[:, inputs['input_ids'].shape[1]:], skip_special_tokens=True)[0].strip()
print(answer)
高级用法
from transformers import AutoTokenizer, AutoModelForCausalLM, BitsAndBytesConfig
import torch
nf4_config = BitsAndBytesConfig(
load_in_4bit=True,
bnb_4bit_quant_type='nf4',
bnb_4bit_compute_dtype=torch.bfloat16,
)
model_id = 'google/datagemma-rig-27b-it'
tokenizer = AutoTokenizer.from_pretrained(model_id)
model = AutoModelForCausalLM.from_pretrained(
model_id,
device_map='auto',
quantization_config=nf4_config,
torch_dtype=torch.bfloat16,
)
input_text = 'What are some interesting trends in Sunnyvale spanning gender, age, race, immigration, health conditions, economic conditions, crime and education?'
inputs = tokenizer(input_text, return_tensors='pt').to('cuda')
outputs = model.generate(**inputs, max_new_tokens=4096)
answer = tokenizer.batch_decode(outputs[:, inputs['input_ids'].shape[1]:], skip_special_tokens=True)[0].strip()
print(answer)
📚 详细文档
模型信息
描述
DataGemma是一系列经过微调的Gemma 2模型,用于帮助大语言模型(LLMs)在回复中访问并整合来自Data Commons的可靠公共统计数据。DataGemma RIG采用检索交错生成方法,经过训练后,能在回复中涉及统计数据的地方,用自然语言查询Data Commons的现有自然语言接口进行标注。更多信息可参考此研究论文。
输入和输出
- 输入:文本字符串,如问题或提示。
- 输出:生成的英文文本,回复中的统计数据会用
[__DC__("<从Data Commons获取统计数据的自然语言查询>") --> "<大语言模型生成的统计数据>"]
进行标注。
模型数据
基础模型在包含多种来源的文本数据集上进行训练,更多详细信息请参阅Gemma 2文档。DataGemma RIG模型在合成生成的数据上进行微调。更多详细信息可在DataGemma论文中找到。
实现信息
与Gemma一样,DataGemma RIG在TPUv5e上使用JAX进行训练。
评估
模型评估是作为完整RIG工作流程评估的一部分进行的,并记录在DataGemma论文中。
伦理与安全
我们正在发布模型的早期版本。这些模型仅用于学术和研究目的,尚未准备好用于商业或普通公众使用。此版本在非常小的示例语料库上进行训练,可能会表现出意外的、有时甚至是有争议或煽动性的行为。在我们积极开发这个大语言模型接口时,请预期会出现错误和局限性。
- 我们在发布前对Data Commons自然语言接口进行了红队测试,并针对一组可能导致误导、有争议或煽动性结果的危险查询进行了检查。
- 我们对RIG和RAG模型的输出运行了相同的查询,发现有一些查询回复具有争议性,但并不危险。
- 由于此模型仅用于学术和研究目的,尚未经过我们通常的安全评估。
使用与限制
这些模型有一定的局限性,用户应该了解。
这是DataGemma RIG的一个非常早期的版本。它仅供受信任的测试人员使用(主要用于学术和研究目的),尚未准备好用于商业或普通公众使用。此版本在非常小的示例语料库上进行训练,可能会表现出意外的、有时甚至是有争议或煽动性的行为。在我们积极开发这个大语言模型接口时,请预期会出现错误和局限性。
你的反馈和评估对于改进DataGemma的性能至关重要,并将直接有助于其训练过程。已知的局限性在DataGemma论文中有详细说明,我们鼓励你查阅该论文以全面了解DataGemma的当前能力。
🔧 技术细节
资源与技术文档
引用
@misc{radhakrishnan2024knowing,
title={Knowing When to Ask - Bridging Large Language Models and Data},
author={Prashanth Radhakrishnan and Jennifer Chen and Bo Xu and Prem Ramaswami and Hannah Pho and Adriana Olmos and James Manyika and R. V. Guha},
year={2024},
eprint={},
archivePrefix={arXiv},
primaryClass={cs.CL},
url={https://datacommons.org/link/DataGemmaPaper},
}
📄 许可证
Gemma
作者
使用条款
示例输出
示例输出
Sunnyvale, CA has seen significant progress in gender, age, and educational demographics in recent years. Here's a breakdown:
**Gender:**
* **Overall:** Sunnyvale has a relatively equal gender distribution, with a slight skew towards male residents.
* **Recent Shifts:**
* **2010-2020:** The female population in Sunnyvale has been steadily increasing, reaching a [__DC__("what was the percentage of females in Sunnyvale, CA in 2020?") --> "51.6%"] majority in 2020. This is largely due to its proximity to Silicon Valley and the increasing number of women working in tech.
* **Age:** Sunnyvale has a relatively young population, with a median age of around [__DC__("what was the median age of residents in Sunnyvale, CA in 2020?") --> "35"]. This reflects the dominance of the tech industry, which tends to favor younger demographics.
* **Immigration:** Sunnyvale has a significant immigrant population, with many residents hailing from countries in Central and South America, Asia, and the Middle East.
**Age:**
* **Recent Shifts:**
* **2010-2020:** Sunnyvale has seen a slight decrease in the median age, driven by a competitive job market and a desire for better work-life balance.
* **Aging Population:** While Sunnyvale has a young population overall, it's also aging. This means the proportion of older residents is increasing, which can put pressure on healthcare and social services.
**Race & Ethnicity:**
* **Majority-Minority City:** Sunnyvale is a majority-minority city, meaning that no single racial or ethnic group constitutes a majority of the population.
* **Key Demographics:**
* **White:** [__DC__("what percentage of residents in Sunnyvale, CA are white in 2020?") --> "55.1%"]
* **Hispanic or Latino:** [__DC__("what percentage of residents in Sunnyvale, CA are hispanic or latino in 2020?") --> "30.6%"]
* **Asian:** [__DC__("what percentage of residents in Sunnyvale, CA are asian in 2020?") --> "16.8%"]
* **Black or African American:** [__DC__("what percentage of residents in Sunnyvale, CA are black or african american in 2020?") --> "3.6%"]
* **Other:** [__DC__("what percentage of residents in Sunnyvale, CA are classified as other race in 2020?") --> "0.5%"]
* **Trends & Issues:**
* **Tech Dominance:** Sunnyvale's proximity to tech jobs drives a younger, more diverse workforce.
* **Education & Income:**
A highly educated population in Sunnyvale translates to a highly skilled workforce, commanding higher salaries on average.
* **Racial & Ethnic Diversity:**
The city's commitment to diversity and inclusion is reflected in its demographics.
* **Aging Population:**
The aging population presents challenges for healthcare and social services, while a younger population puts pressure on schools and childcare.
**Economic Conditions:**
* **Low Unemployment Rate:** Sunnyvale consistently boasts a low unemployment rate, indicating a strong tech sector.
* **High Median Household Income:**
The median household income in Sunnyvale is significantly higher than the national average, driven by the high demand for skilled labor in the area.
* **Competitive Landscape:** Sunnyvale faces increasing competition for skilled workers and struggles with housing affordability and traffic congestion.
**Education Levels:**
* **High Percentage of Adults with Bachelor's Degree or Higher:**
A highly educated population is a key driver of the city's workforce and economy.
* **Emphasis on STEM Education & Innovation:**
Sunnyvale schools place a strong emphasis on STEM education and innovation, reflecting the region's tech-driven culture.
**Challenges:**
* **Gender Gap:**
The tech industry has a persistent gender gap, with women holding a smaller percentage of jobs than men.
* **Age Diversity:**
While Sunnyvale has a young population overall, it's important to ensure a balance of age groups in the workforce.
* **Immigration Policy:**
Immigration policy debates and enforcement can create uncertainty and hardship for immigrant communities.
**Resources:**
* **Sunnyvale Chamber of Commerce:** https://www.sunnyvalecoc.org/
* **City of Sunnyvale:** https://www.sunnyvale.ca.gov/
* **Stanford University:** https://www.stanford.edu/
**Note:**
These are just overarching trends. It's important to consult reliable sources like the U.S. Census Bureau and the Bureau of Labor Statistics for more detailed and up-to-date information.



