Firefly单卡复刻Vicuna-13B,Open LLM榜单略高0.2分-程序员宅基地

来自:YeungNLP

进NLP群—>加入NLP交流群

01

前言

在过去的几个月里,经过不断的迭代更新,Firefly项目已支持对LLaMA-2、Ziya、Baichuan、InternLM、LLaMA-1、Bloom等模型进行指令微调,并且开源了多个经过中文指令微调的模型权重,获得了很多同学的支持。目前本项目在Github上获得了1100+star️,感谢大家的关注和支持。

7626ec4f77f266702c14ed17be1c3013.jpeg

此前,我们花了较多精力在优化训练流程之上,而随着项目的不断迭代,我们越发感受到模型评测的重要性。此前我们更多是以看case的形式,主观判断模型的效果,不严谨其说服力不足,无法衡量Firefly的能力。

我们也在逐步完善项目的模型评测工作。此次,我们开源了firefly-llama-13b,并且在Hugging Face的Open LLM榜单上进行了客观的评测。

在榜单上,firefly-llama-13b取得了不错的效果,比vicuna-13b-1.1略高0.2分比llams-2-13b-chat略低0.5分比vicuna-13b-v1.3略低0.6分。从评测分数来看,firefly-llama-13b与vicuna-13b、llama-2-13b-chat的水平非常接近。

注:表中各列依次为Average、ARC、HellaSwag、MMLU、TruthfulQA (MC)。

0dcb4f149b14d1ddb83ff0745e4c8ba3.jpeg

Firefly项目地址:

https://github.com/yangjianxin1/Firefly

firefly-llama-13b权重:

https://huggingface.co/YeungNLP/firefly-llama-13b

 Open LLM排行榜:

https://huggingface.co/spaces/HuggingFaceH4/open_llm_leaderboard

02

模型指标

 Open LLM排行榜是Hugging Face设立的一个用于评测开放大语言模型的公开榜单。该榜单一共包括以下四个评测基准:

  • AI2 Reasoning Challenge:一套小学科学问题。

  • HellaSwag:一个测试常识推理的测试,对人类来说很容易(~95%),但对语言模型来说却具有挑战性。

  • MMLU:一个测试文本模型的多任务准确性的测试。测试覆盖57个任务,包括基础数学、美国历史、计算机科学、法律等。

  • Truthful QA MC:一个测试语言模型在生成问题答案时是否真实的基准。

61127fa9efcb4296ec17de02442307f3.png

firefly-llama-13b在Open LLM榜单上的评测结果如下:

  1. firefly-llama-13b以59.4分的成绩,超越vicuna-13b-1.1的59.2分和guanaco-13b的59.1分

  2. 仅比最近大火的Llama-2-13b-chat-hf的59.9分,略低0.5分;比vicuna-13b-v1.3的60分,略低0.6分。

  3. 大幅领先vicuna-7b-v1.3和mpt-7b-chat等7B参数量的模型。

  4. firefly-llama-13b在ARC评测集上比vicuna-13b-v1.3高4.4分,在TruthfulQA评测集上比Llama-2-13b-chat-hf高5.5分。美中不足的是,firefly-llama-13b在MMLU和TruthfulQA评测集上的表现和vicuna-13b还有一定的差距。

d3311b6be398983370e4eecc633be95c.jpeg

该评测结果验证了Firefly的QLoRA训练流程的有效性。对于TruthfulQA与MMLU评测效果的不足之处,我们也正在进行一项数据实验。

值得注意的是,vicuna-13b模型采用的是全量参数微调,对训练资源的要求十分高。而firefly-llama-13b采用的则是QLoRA微调,最少仅需16G显存,即可对13B的模型进行微调。

我们使用Firefly的QLoRA训练pipeline对llama-13b进行指令微调,当长度为1024,batch size为1时,显存占用不到16G,一张24G的显卡便可满足训练条件,极大地降低了百亿大模型的训练门槛。

8f82d2c8c825f18cf90a5a1bb5de66c5.jpeg

02

训练策略

我们采用QLoRA策略来训练firefly-llama-13b。对于QLoRA的原理和训练流程尚不熟悉的同学,可参考我们往期的文章介绍:

  1. 微调百川Baichuan-13B保姆式教程,手把手教你训练百亿大模型

  2. Firefly-13B开源,QLoRA+百万数据,单卡可训百亿大模型

  3. Firefly|百川baichuan-7B实测,QLoRA+百万指令数据微调

  4. Firefly | QLoRA+百万数据,多卡高效微调bloom-7b1模型

  5. 【QLoRA实战】使用单卡高效微调bloom-7b1,效果惊艳

我们采用huggyllama/llama-13b作为基座模型,使用UltraChat数据集,这是由清华大学开源的英文多轮对话数据,包含140万条多轮对话数据。我们仅训练一个epoch。

我们将训练数据整理成jsonl格式,每条数据格式如下,其中conversation字段是必需的。

{
    "conversation_id":"4",
    "dataset":"UltraChat",
    "conversation":[
        {
            "human":"What is the most popular smartphone brand these days?",
            "assistant":"As an AI language model, I don't have updated information on the current market scenario. However, according to recent statistics, Samsung, Apple, and Huawei are among the most popular smartphone brands worldwide."
        },
        {
            "human":"Interesting! Among these three brands, which one do you think has the most features and benefits for its price?",
            "assistant":"As an AI language model, I cannot have personal opinions or bias towards any brand. However, all three brands - Samsung, Apple, and Huawei - offer different features and benefits at different price points. It ultimately comes down to your personal preferences and requirements, like camera quality, design, software, operating system, and other factors, to choose the right brand and model that fits your needs and budget."
        },
        {
            "human":"Can you tell me more about the camera quality of these three brands? Which one has the best camera performance?",
            "assistant":"Sure, I can provide you with some information about the camera quality of these three smartphone brands.\n\n1. Apple: iPhones are well known for their camera performance. The latest iPhone models come with a dual-camera or triple-camera setup that is capable of capturing stunning pictures and videos. Apple also introduced features like Night mode, Deep Fusion, and LiDAR Scanner that enhance the camera quality even further.\n\n2. Samsung: Samsung's flagship smartphone models like Galaxy S and Galaxy Note series are known for their camera capabilities. The latest models of Samsung smartphones come with a high-resolution camera that produces excellent pictures with vibrant colors and sharp details. Samsung also offers features like a 108-megapixel camera, Space Zoom, and Single Take mode that enhance the camera quality.\n\n3. Huawei: Huawei smartphones are known for their camera performance and innovation. The latest models like P40 Pro and P40 Pro+ come with a high-resolution camera, and Huawei's proprietary camera technology produces excellent picture quality. Huawei phones also offer features like SuperZoom, AI Cinema, and other editing tools to enhance the camera experience.\n\nAll three brands offer excellent camera quality, but the best camera among these three brands would depend on individual preferences and usage. It's recommended to compare the camera specifications, sample pictures, and features of each brand's models to choose the one that fits your requirements."
        },
        {
            "human":"That's helpful. I'm particularly interested in low-light photography. Which of these three brands would you suggest for taking good quality pictures in low light?",
            "assistant":"All three brands - Apple, Samsung, and Huawei - offer excellent low-light photography options, but their approach to low-light photography is different, and each brand has its own strengths and weaknesses when it comes to capturing photos in low light. Here's a brief overview of each brand's approach to low-light photography.\n\n1. Apple: Apple's Night mode feature, which is available on recent iPhone models, is designed to take bright and detailed photos in low light. Night mode uses a combination of multi-frame image processing and advanced machine learning to capture crisp and clear images in both indoor and outdoor low-light environments.\n\n2. Samsung: Samsung's low-light photography capabilities are mainly powered by its camera's sensor and lens technology. The newer Samsung Galaxy S and Galaxy Note models come with a larger sensor and wider aperture, which allows more light to enter the device, resulting in brighter and more detailed photos. These models also use Multi-Frame Noise Reduction and Scene Optimizer to improve the quality of photos taken in low light.\n\n3. Huawei: Huawei is known for its exceptional low-light photography capabilities. Huawei's proprietary camera technology, which uses a combination of large sensors, advanced algorithms, and AI, produces bright and detailed low-light photos. The Huawei P40 Pro and P40 Pro+ models come with a SuperSensing 50-megapixel main lens, which enables capturing clear images in low-light conditions.\n\nAll three brands offer great options, but if you are looking for capturing low light photos primarily, then Huawei's P40 Pro, P40 Pro+ or Mate 40 Pro might be a good choice for you due to the brand's focus on low-light photography features."
        },
        {
            "human":"I'm also curious about the battery life of these smartphones. Which brand offers the longest battery life on average?",
            "assistant":"Battery life is an important factor for anyone looking to buy a smartphone, and all three brands - Apple, Samsung, and Huawei - offer different battery life options. 1. Apple: Battery life on iPhones can last anywhere from 4 to 10 hours, depending on the model you choose. Apple's latest iPhone models come with improved battery life with all-day battery performance. However, if you tend to use your smartphone frequently or for more power-intensive tasks, such as gaming or video playback, you may need to charge your iPhone more frequently. 2. Samsung: Samsung's Galaxy series has excellent battery life, and the newer models like the Galaxy S21 and Galaxy Note 20 are designed to last all day long. The battery life on Samsung phones varies from one model to another, but in general, they offer better battery life compared to iPhones. 3. Huawei: Huawei phones are often praised for their long battery life. They come with large-capacity batteries that can last up to two days, depending on usage. The latest Huawei models like the P40 Pro and Mate 40 Pro, also have fast charging features, which means you can quickly refuel your smartphone's battery when it's running low. In conclusion, all three brands offer decent battery life, but Huawei typically offers the longest battery life on average, followed by Samsung."
        },
        {
            "human":"That's good to know. I'm also wondering about the durability of these smartphones. Which brand offers the most durable phones? I tend to drop my phone frequently, so I want something that can handle a few falls.",
            "assistant":"All three Smartphone brands - Apple, Samsung, and Huawei - offer durable smartphones, but it's hard to say which one is the most durable since different models have different levels of durability. That being said, here is some information on their durability.\n\n1. Apple: Apple is known for offering durable and sturdy smartphones that can withstand some rough handling. Recent iPhone models come with reinforced glass and are water and dust-resistant. However, with its sleek designs, the iPhone models may be more prone to shattering and cracking if dropped.\n\n2. Samsung: Samsung smartphones like the Galaxy S and Galaxy Note series are generally sturdy and are made with scratch-resistant glass. The newer models are water and dust resistant and come with reinforced corners to protect against drops and bumps.\n\n3. Huawei: Huawei also offers durable smartphones that can handle some falls and rough handling. The latest Huawei models usually come with a sturdy build, reinforced corners, and scratch-resistant glass. The Huawei Mate 40 Pro even has a ceramic back, making it extra durable.\n\nIn conclusion, all three brands - Apple, Samsung, and Huawei - offer durable smartphones that can handle some drops, but the level of durability varies from model to model. If frequent drops are a concern, it's recommended to choose a phone with a rugged case or invest in a protective case to ensure better protection."
        }


    ]
}

训练时,我们将多轮对话拼接成如下格式,然后进行tokenize。

<s>input1</s>target1</s>input2</s>target2</s>...

在计算loss时,我们通过mask的方式,input部分的loss不参与参数更新,只有“target</s>”部分的loss参与参数更新。这种方式充分利用了模型并行计算的优势,训练更加高效,且多轮对话中的每个target部分都参与了训练,训练更充分。否则,就需要把一个n轮对话,拆分成n条数据,且只计算最后一个target的loss,大大降低了训练效率。

loss计算的实现方式可参考以下代码:

https://github.com/yangjianxin1/Firefly/blob/master/component/loss.py#L3

对于QLoRA,除了embedding和lm_head外,我们在所有全连接层都插入adapter,其中lora_rank为64,lora_alpha为16,lora_dropout为0.05。最终参与训练的参数量约为2.5亿

训练超参数如下所示:

max length 1024
lr_scheduler_type constant_with_warmup
batch size 64
lr 1e-4
warmup step 3000
optimizer paged_adamw_32bit
training step 23k

模型的训练损失的变化趋势如下图所示。

5d1eaede38b4d43358a79750e3afeced.png

03

效果对比

我们使用Vicuna官网中的部分输入进行测试,以对比vicuna-13b与firefly-llama-13b的生成结果。

注:vicuna-13b的结果为其官网展示样例。

示例1:【撰写一篇引人入胜的旅行博客文章,内容涉及最近去夏威夷旅行,强调文化经历和必看的景点。】

114065b396741cd03ace0ef114f3673d.png

2acb41b1e2b6543ee51bf8388f5a1eaa.png

示例2:【如果14世纪没有发生黑死病会怎样?】

9e76131e22eb9694fef809b6ebeaa2aa.png

0f403841cb7e557d7b6d964ea858421d.png

示例3:【如果艾萨克·牛顿专注于生物学而不是物理学会怎样?】

ce9882127e28d285ae65f1d22d62cd3d.png

a5dab6a8738d702a599df437a346696f.png

示例4:【使用一次性塑料瓶而不是可重复使用的瓶子,对环境和人类健康有什么潜在影响?】

377a1fc88de24bc4ea3cada8397dbfe5.png

658a90b1c9814882f75cb2296db28250.png

我们将在共享文档中补充更多的对比结果:

71a582a7a3b19ce797e9cfb701fe7386.jpeg

从一些生成例子来看,firefly-llama-13b与vicuna-13b的性能比较接近。

04

结语

此次我们开源了firefly-llama-13b,在Open LLM排行榜上的评测结果与vicuna-13b、Llama-2-13b-chat非常接近,验证了Firefly的QLoRA训练流程的有效性。

QLoRA工作一经发布,Firefly项目便开始研究和探索QLoRA训练方案。随着不断迭代更新,项目支持对越来越多的模型进行微调,也开源了一些中文模型,但始终缺乏评测环节。此次的评测工作,弥补了项目在这方面的缺失,验证了训练流程的有效性。


进NLP群—>加入NLP交流群

版权声明:本文为博主原创文章,遵循 CC 4.0 BY-SA 版权协议,转载请附上原文出处链接和本声明。
本文链接:https://blog.csdn.net/qq_27590277/article/details/131874928

智能推荐

分布式光纤传感器的全球与中国市场2022-2028年:技术、参与者、趋势、市场规模及占有率研究报告_预计2026年中国分布式传感器市场规模有多大-程序员宅基地

文章浏览阅读3.2k次。本文研究全球与中国市场分布式光纤传感器的发展现状及未来发展趋势,分别从生产和消费的角度分析分布式光纤传感器的主要生产地区、主要消费地区以及主要的生产商。重点分析全球与中国市场的主要厂商产品特点、产品规格、不同规格产品的价格、产量、产值及全球和中国市场主要生产商的市场份额。主要生产商包括:FISO TechnologiesBrugg KabelSensor HighwayOmnisensAFL GlobalQinetiQ GroupLockheed MartinOSENSA Innovati_预计2026年中国分布式传感器市场规模有多大

07_08 常用组合逻辑电路结构——为IC设计的延时估计铺垫_基4布斯算法代码-程序员宅基地

文章浏览阅读1.1k次,点赞2次,收藏12次。常用组合逻辑电路结构——为IC设计的延时估计铺垫学习目的:估计模块间的delay,确保写的代码的timing 综合能给到多少HZ,以满足需求!_基4布斯算法代码

OpenAI Manager助手(基于SpringBoot和Vue)_chatgpt网页版-程序员宅基地

文章浏览阅读3.3k次,点赞3次,收藏5次。OpenAI Manager助手(基于SpringBoot和Vue)_chatgpt网页版

关于美国计算机奥赛USACO,你想知道的都在这_usaco可以多次提交吗-程序员宅基地

文章浏览阅读2.2k次。USACO自1992年举办,到目前为止已经举办了27届,目的是为了帮助美国信息学国家队选拔IOI的队员,目前逐渐发展为全球热门的线上赛事,成为美国大学申请条件下,含金量相当高的官方竞赛。USACO的比赛成绩可以助力计算机专业留学,越来越多的学生进入了康奈尔,麻省理工,普林斯顿,哈佛和耶鲁等大学,这些同学的共同点是他们都参加了美国计算机科学竞赛(USACO),并且取得过非常好的成绩。适合参赛人群USACO适合国内在读学生有意向申请美国大学的或者想锻炼自己编程能力的同学,高三学生也可以参加12月的第_usaco可以多次提交吗

MySQL存储过程和自定义函数_mysql自定义函数和存储过程-程序员宅基地

文章浏览阅读394次。1.1 存储程序1.2 创建存储过程1.3 创建自定义函数1.3.1 示例1.4 自定义函数和存储过程的区别1.5 变量的使用1.6 定义条件和处理程序1.6.1 定义条件1.6.1.1 示例1.6.2 定义处理程序1.6.2.1 示例1.7 光标的使用1.7.1 声明光标1.7.2 打开光标1.7.3 使用光标1.7.4 关闭光标1.8 流程控制的使用1.8.1 IF语句1.8.2 CASE语句1.8.3 LOOP语句1.8.4 LEAVE语句1.8.5 ITERATE语句1.8.6 REPEAT语句。_mysql自定义函数和存储过程

半导体基础知识与PN结_本征半导体电流为0-程序员宅基地

文章浏览阅读188次。半导体二极管——集成电路最小组成单元。_本征半导体电流为0

随便推点

【Unity3d Shader】水面和岩浆效果_unity 岩浆shader-程序员宅基地

文章浏览阅读2.8k次,点赞3次,收藏18次。游戏水面特效实现方式太多。咱们这边介绍的是一最简单的UV动画(无顶点位移),整个mesh由4个顶点构成。实现了水面效果(左图),不动代码稍微修改下参数和贴图可以实现岩浆效果(右图)。有要思路是1,uv按时间去做正弦波移动2,在1的基础上加个凹凸图混合uv3,在1、2的基础上加个水流方向4,加上对雾效的支持,如没必要请自行删除雾效代码(把包含fog的几行代码删除)S..._unity 岩浆shader

广义线性模型——Logistic回归模型(1)_广义线性回归模型-程序员宅基地

文章浏览阅读5k次。广义线性模型是线性模型的扩展,它通过连接函数建立响应变量的数学期望值与线性组合的预测变量之间的关系。广义线性模型拟合的形式为:其中g(μY)是条件均值的函数(称为连接函数)。另外,你可放松Y为正态分布的假设,改为Y 服从指数分布族中的一种分布即可。设定好连接函数和概率分布后,便可以通过最大似然估计的多次迭代推导出各参数值。在大部分情况下,线性模型就可以通过一系列连续型或类别型预测变量来预测正态分布的响应变量的工作。但是,有时候我们要进行非正态因变量的分析,例如:(1)类别型.._广义线性回归模型

HTML+CSS大作业 环境网页设计与实现(垃圾分类) web前端开发技术 web课程设计 网页规划与设计_垃圾分类网页设计目标怎么写-程序员宅基地

文章浏览阅读69次。环境保护、 保护地球、 校园环保、垃圾分类、绿色家园、等网站的设计与制作。 总结了一些学生网页制作的经验:一般的网页需要融入以下知识点:div+css布局、浮动、定位、高级css、表格、表单及验证、js轮播图、音频 视频 Flash的应用、ul li、下拉导航栏、鼠标划过效果等知识点,网页的风格主题也很全面:如爱好、风景、校园、美食、动漫、游戏、咖啡、音乐、家乡、电影、名人、商城以及个人主页等主题,学生、新手可参考下方页面的布局和设计和HTML源码(有用点赞△) 一套A+的网_垃圾分类网页设计目标怎么写

C# .Net 发布后,把dll全部放在一个文件夹中,让软件目录更整洁_.net dll 全局目录-程序员宅基地

文章浏览阅读614次,点赞7次,收藏11次。之前找到一个修改 exe 中 DLL地址 的方法, 不太好使,虽然能正确启动, 但无法改变 exe 的工作目录,这就影响了.Net 中很多获取 exe 执行目录来拼接的地址 ( 相对路径 ),比如 wwwroot 和 代码中相对目录还有一些复制到目录的普通文件 等等,它们的地址都会指向原来 exe 的目录, 而不是自定义的 “lib” 目录,根本原因就是没有修改 exe 的工作目录这次来搞一个启动程序,把 .net 的所有东西都放在一个文件夹,在文件夹同级的目录制作一个 exe._.net dll 全局目录

BRIEF特征点描述算法_breif description calculation 特征点-程序员宅基地

文章浏览阅读1.5k次。本文为转载,原博客地址:http://blog.csdn.net/hujingshuang/article/details/46910259简介 BRIEF是2010年的一篇名为《BRIEF:Binary Robust Independent Elementary Features》的文章中提出,BRIEF是对已检测到的特征点进行描述,它是一种二进制编码的描述子,摈弃了利用区域灰度..._breif description calculation 特征点

房屋租赁管理系统的设计和实现,SpringBoot计算机毕业设计论文_基于spring boot的房屋租赁系统论文-程序员宅基地

文章浏览阅读4.1k次,点赞21次,收藏79次。本文是《基于SpringBoot的房屋租赁管理系统》的配套原创说明文档,可以给应届毕业生提供格式撰写参考,也可以给开发类似系统的朋友们提供功能业务设计思路。_基于spring boot的房屋租赁系统论文