开源DataGemma-Rig-27b-it模型 - 免费整合公共统计数据并标注来源

首页

Datagemma Rig 27b It

由 google 开发

DataGemma是基于Gemma 2微调的系列模型，用于整合Data Commons的公共统计数据，采用检索交错生成方法标注数据来源。

大型语言模型

Transformers

#公共数据整合 #检索交错生成 #统计查询标注

下载量 1,574

发布时间 : 8/27/2024

模型简介

DataGemma RIG模型通过自然语言查询访问Data Commons的统计数据，并在生成回复中标注数据来源，增强统计数据的可靠性。

模型特点

数据整合能力

可自动访问并整合Data Commons的公共统计数据，增强回复的可靠性。

检索交错生成

在生成过程中动态检索数据，并用标准化格式标注统计数据的来源。

学术研究导向

专为学术和研究目的设计，提供可验证的数据引用机制。

模型能力

公共数据查询

统计数据分析

多领域趋势报告生成

结构化数据标注

使用案例

社会趋势分析

城市人口统计

分析城市人口在性别、年龄、种族等方面的分布和变化趋势。

生成带数据来源标注的详细报告（如示例输出所示）。

经济研究

区域经济指标

查询和比较不同地区的就业率、收入水平等经济指标。

🚀 DataGemma RIG模型卡

DataGemma是一系列经过微调的Gemma 2模型，用于帮助大语言模型（LLMs）在回复中访问并整合来自Data Commons的可靠公共统计数据。DataGemma RIG采用检索交错生成方法，经过训练后，能在回复中涉及统计数据的地方，用自然语言查询Data Commons的现有自然语言接口进行标注。

🚀 快速开始

访问Gemma

要在Hugging Face上访问Gemma，你需要查看并同意Google的使用许可。请确保你已登录Hugging Face，然后点击下方按钮。请求将立即处理。 [确认许可](Acknowledge license)

运行模型

要运行此微调模型，这只是DataGemma论文中完整RIG方法的一个步骤。你可以在这个Colab笔记本中尝试端到端的RIG流程。

首先，确保你已经安装了必要的库：

pip install -U transformers accelerate

然后，复制以下代码片段：

from transformers import AutoTokenizer, AutoModelForCausalLM
import torch

model_id = 'google/datagemma-rig-27b-it'
tokenizer = AutoTokenizer.from_pretrained(model_id)
model = AutoModelForCausalLM.from_pretrained(
    model_id,
    device_map='auto',
    torch_dtype=torch.bfloat16,
)

input_text = 'What are some interesting trends in Sunnyvale spanning gender, age, race, immigration, health conditions, economic conditions, crime and education?'
inputs = tokenizer(input_text, return_tensors='pt').to('cuda')

outputs = model.generate(**inputs, max_new_tokens=4096)
answer = tokenizer.batch_decode(outputs[:, inputs['input_ids'].shape[1]:], skip_special_tokens=True)[0].strip()
print(answer)

以4位精度运行

要以4位精度运行此模型，首先确保你已经安装了必要的库：

pip install -U transformers bitsandbytes accelerate

然后，复制以下代码片段：

from transformers import AutoTokenizer, AutoModelForCausalLM, BitsAndBytesConfig
import torch
nf4_config = BitsAndBytesConfig(
   load_in_4bit=True,
   bnb_4bit_quant_type='nf4',
   bnb_4bit_compute_dtype=torch.bfloat16,
)
model_id = 'google/datagemma-rig-27b-it'
tokenizer = AutoTokenizer.from_pretrained(model_id)
model = AutoModelForCausalLM.from_pretrained(
    model_id,
    device_map='auto',
    quantization_config=nf4_config,
    torch_dtype=torch.bfloat16,
)
input_text = 'What are some interesting trends in Sunnyvale spanning gender, age, race, immigration, health conditions, economic conditions, crime and education?'
inputs = tokenizer(input_text, return_tensors='pt').to('cuda')

outputs = model.generate(**inputs, max_new_tokens=4096)
answer = tokenizer.batch_decode(outputs[:, inputs['input_ids'].shape[1]:], skip_special_tokens=True)[0].strip()
print(answer)

✨ 主要特性

数据整合：DataGemma是一系列经过微调的Gemma 2模型，可帮助大语言模型在回复中访问并整合来自Data Commons的可靠公共统计数据。
检索交错生成：DataGemma RIG采用检索交错生成方法，经过训练后，能在回复中涉及统计数据的地方，用自然语言查询Data Commons的现有自然语言接口进行标注。

📦 安装指南

要运行此模型，需要安装以下库：

pip install -U transformers accelerate

如果要以4位精度运行模型，还需要安装bitsandbytes：

pip install -U transformers bitsandbytes accelerate

💻 使用示例

基础用法

from transformers import AutoTokenizer, AutoModelForCausalLM
import torch

model_id = 'google/datagemma-rig-27b-it'
tokenizer = AutoTokenizer.from_pretrained(model_id)
model = AutoModelForCausalLM.from_pretrained(
    model_id,
    device_map='auto',
    torch_dtype=torch.bfloat16,
)

input_text = 'What are some interesting trends in Sunnyvale spanning gender, age, race, immigration, health conditions, economic conditions, crime and education?'
inputs = tokenizer(input_text, return_tensors='pt').to('cuda')

outputs = model.generate(**inputs, max_new_tokens=4096)
answer = tokenizer.batch_decode(outputs[:, inputs['input_ids'].shape[1]:], skip_special_tokens=True)[0].strip()
print(answer)

高级用法

from transformers import AutoTokenizer, AutoModelForCausalLM, BitsAndBytesConfig
import torch
nf4_config = BitsAndBytesConfig(
   load_in_4bit=True,
   bnb_4bit_quant_type='nf4',
   bnb_4bit_compute_dtype=torch.bfloat16,
)
model_id = 'google/datagemma-rig-27b-it'
tokenizer = AutoTokenizer.from_pretrained(model_id)
model = AutoModelForCausalLM.from_pretrained(
    model_id,
    device_map='auto',
    quantization_config=nf4_config,
    torch_dtype=torch.bfloat16,
)
input_text = 'What are some interesting trends in Sunnyvale spanning gender, age, race, immigration, health conditions, economic conditions, crime and education?'
inputs = tokenizer(input_text, return_tensors='pt').to('cuda')

outputs = model.generate(**inputs, max_new_tokens=4096)
answer = tokenizer.batch_decode(outputs[:, inputs['input_ids'].shape[1]:], skip_special_tokens=True)[0].strip()
print(answer)

📚 详细文档

模型信息

描述

输入和输出

输入：文本字符串，如问题或提示。
输出：生成的英文文本，回复中的统计数据会用[__DC__("<从Data Commons获取统计数据的自然语言查询>") --> "<大语言模型生成的统计数据>"]进行标注。

模型数据

基础模型在包含多种来源的文本数据集上进行训练，更多详细信息请参阅Gemma 2文档。DataGemma RIG模型在合成生成的数据上进行微调。更多详细信息可在DataGemma论文中找到。

实现信息

与Gemma一样，DataGemma RIG在TPUv5e上使用JAX进行训练。

评估

模型评估是作为完整RIG工作流程评估的一部分进行的，并记录在DataGemma论文中。

伦理与安全

我们正在发布模型的早期版本。这些模型仅用于学术和研究目的，尚未准备好用于商业或普通公众使用。此版本在非常小的示例语料库上进行训练，可能会表现出意外的、有时甚至是有争议或煽动性的行为。在我们积极开发这个大语言模型接口时，请预期会出现错误和局限性。

我们在发布前对Data Commons自然语言接口进行了红队测试，并针对一组可能导致误导、有争议或煽动性结果的危险查询进行了检查。
我们对RIG和RAG模型的输出运行了相同的查询，发现有一些查询回复具有争议性，但并不危险。
由于此模型仅用于学术和研究目的，尚未经过我们通常的安全评估。

使用与限制

这些模型有一定的局限性，用户应该了解。

这是DataGemma RIG的一个非常早期的版本。它仅供受信任的测试人员使用（主要用于学术和研究目的），尚未准备好用于商业或普通公众使用。此版本在非常小的示例语料库上进行训练，可能会表现出意外的、有时甚至是有争议或煽动性的行为。在我们积极开发这个大语言模型接口时，请预期会出现错误和局限性。

你的反馈和评估对于改进DataGemma的性能至关重要，并将直接有助于其训练过程。已知的局限性在DataGemma论文中有详细说明，我们鼓励你查阅该论文以全面了解DataGemma的当前能力。

🔧 技术细节

资源与技术文档

引用

@misc{radhakrishnan2024knowing,
      title={Knowing When to Ask - Bridging Large Language Models and Data}, 
      author={Prashanth Radhakrishnan and Jennifer Chen and Bo Xu and Prem Ramaswami and Hannah Pho and Adriana Olmos and James Manyika and R. V. Guha},
      year={2024},
      eprint={},
      archivePrefix={arXiv},
      primaryClass={cs.CL},
      url={https://datacommons.org/link/DataGemmaPaper}, 
}

📄 许可证

Gemma

作者

Google

使用条款

条款

示例输出

Sunnyvale, CA has seen significant progress in gender, age, and educational demographics in recent years. Here's a breakdown:

**Gender:**

* **Overall:** Sunnyvale has a relatively equal gender distribution, with a slight skew towards male residents.
* **Recent Shifts:**

    * **2010-2020:** The female population in Sunnyvale has been steadily increasing, reaching a [__DC__("what was the percentage of females in Sunnyvale, CA in 2020?") --> "51.6%"] majority in 2020. This is largely due to its proximity to Silicon Valley and the increasing number of women working in tech.
    * **Age:** Sunnyvale has a relatively young population, with a median age of around [__DC__("what was the median age of residents in Sunnyvale, CA in 2020?") --> "35"]. This reflects the dominance of the tech industry, which tends to favor younger demographics.
    * **Immigration:** Sunnyvale has a significant immigrant population, with many residents hailing from countries in Central and South America, Asia, and the Middle East.

**Age:**

* **Recent Shifts:**

    * **2010-2020:** Sunnyvale has seen a slight decrease in the median age, driven by a competitive job market and a desire for better work-life balance.
    * **Aging Population:** While Sunnyvale has a young population overall, it's also aging. This means the proportion of older residents is increasing, which can put pressure on healthcare and social services.

**Race & Ethnicity:**

* **Majority-Minority City:** Sunnyvale is a majority-minority city, meaning that no single racial or ethnic group constitutes a majority of the population.
* **Key Demographics:**

    * **White:** [__DC__("what percentage of residents in Sunnyvale, CA are white in 2020?") --> "55.1%"]
    * **Hispanic or Latino:** [__DC__("what percentage of residents in Sunnyvale, CA are hispanic or latino in 2020?") --> "30.6%"]
    * **Asian:** [__DC__("what percentage of residents in Sunnyvale, CA are asian in 2020?") --> "16.8%"]
    * **Black or African American:** [__DC__("what percentage of residents in Sunnyvale, CA are black or african american in 2020?") --> "3.6%"]
    * **Other:** [__DC__("what percentage of residents in Sunnyvale, CA are classified as other race in 2020?") --> "0.5%"]
* **Trends & Issues:**

    * **Tech Dominance:** Sunnyvale's proximity to tech jobs drives a younger, more diverse workforce.
    * **Education & Income:**

A highly educated population in Sunnyvale translates to a highly skilled workforce, commanding higher salaries on average.
* **Racial & Ethnic Diversity:**

The city's commitment to diversity and inclusion is reflected in its demographics.
* **Aging Population:**

The aging population presents challenges for healthcare and social services, while a younger population puts pressure on schools and childcare.

**Economic Conditions:**

* **Low Unemployment Rate:** Sunnyvale consistently boasts a low unemployment rate, indicating a strong tech sector.
* **High Median Household Income:**

The median household income in Sunnyvale is significantly higher than the national average, driven by the high demand for skilled labor in the area.
* **Competitive Landscape:** Sunnyvale faces increasing competition for skilled workers and struggles with housing affordability and traffic congestion.

**Education Levels:**

* **High Percentage of Adults with Bachelor's Degree or Higher:**

A highly educated population is a key driver of the city's workforce and economy.
* **Emphasis on STEM Education & Innovation:**

Sunnyvale schools place a strong emphasis on STEM education and innovation, reflecting the region's tech-driven culture.

**Challenges:**

* **Gender Gap:**

The tech industry has a persistent gender gap, with women holding a smaller percentage of jobs than men.

* **Age Diversity:**

While Sunnyvale has a young population overall, it's important to ensure a balance of age groups in the workforce.
* **Immigration Policy:**

Immigration policy debates and enforcement can create uncertainty and hardship for immigrant communities.

**Resources:**

* **Sunnyvale Chamber of Commerce:** https://www.sunnyvalecoc.org/
* **City of Sunnyvale:** https://www.sunnyvale.ca.gov/
* **Stanford University:** https://www.stanford.edu/

**Note:**

These are just overarching trends. It's important to consult reliable sources like the U.S. Census Bureau and the Bureau of Labor Statistics for more detailed and up-to-date information.