AutoGPTQ

一个基于 GPTQ 算法，简单易用且拥有用户友好型接口的大语言模型量化工具包。

Note: The English README is likely to be more up to date.

通向 v1.0.0 之路

嗨，社区的伙伴们，好久不见！很抱歉这段时间由于个人原因，我没能以较高的频率来更新这个项目。过去几周对我的职业生涯规划而言意义重大。在不久前，我正式告别了毕业后便加入两年之久的创业团队，非常感谢团队的领导和同事们给予我的信任与指导，让我能够在两年时间里飞速地成长；同时也十分感激团队允许我自 AutoGPTQ 项目创立以来一直无偿使用内部的 A100 GPU 服务器集群以完成各项实验与性能测评。（当然今后是无法继续使用了，因此若有新的硬件赞助我将感激不尽！）过去的两年里，我在这个团队中担任算法工程师的角色，负责基于大语言模型的对话系统架构设计与开发，我们曾成功推出一款名为 gemsouls 的产品，但不幸的是它已经停止运营。而现在，这个团队即将推出一款名为 modelize 的新产品，这是一个大模型原生的 AI 智能体平台，用户可以使用多个 AI 智能体搭建一个高度自动化的团队，让它们在工作流中相互合作，高效完成复杂的项目。

话归正题，我非常兴奋地看到，在过去几个月的时间里，针对大语言模型推理性能优化的研究取得了巨大的进展，如今我们不仅能够在高端显卡上完成大语言模型的推理，甚至在 CPU 和边缘设备上都可以轻松运行大语言模型。一系列的技术进步，让我同样迫不及待地在开源社区上做出更多的贡献，因此，首先，我将用约四周的时间将 AutoGPTQ 迭代至 v1.0.0 正式版本，在此期间，也会有 2~3 个小版本发布以让用户能够及时体验性能优化和新特性。在我的愿景里，到 v1.0.0 版本正式发布时，AutoGPTQ 将能够作为一个灵活可拓展的、支持所有 GPTQ-like 方法的量化后端，自动地完成各种基于 Pytorch 编写的大语言模型的量化工作。我在这里详细介绍了开发计划，欢迎移步至此进行讨论并给出你们的建议！

新闻或更新

2023-08-23 - (新闻) - 🤗 Transformers、optimum 和 peft 完成了对 auto-gptq 的集成，现在使用 GPTQ 模型进行推理和训练将变得更容易！阅读这篇博客和相关资源以了解更多细节！
2023-08-21 - (新闻) - 通义千问团队发布了基于 auto-gptq 的 Qwen-7B 4bit 量化版本模型，并提供了详尽的测评结果
2023-08-06 - (更新) - 支持 exllama 的 q4 CUDA 算子使得 int4 量化模型能够获得至少1.3倍的推理速度提升.
2023-08-04 - (更新) - 支持 RoCm 使得 AMD GPU 的用户能够使用 auto-gptq 的 CUDA 拓展.
2023-07-26 - (更新) - 一个优雅的 PPL 测评脚本以获得可以与诸如 llama.cpp 等代码库进行公平比较的结果。
2023-06-05 - (更新) - 集成 🤗 peft 来使用 gptq 量化过的模型训练适应层，支持 LoRA，AdaLoRA，AdaptionPrompt 等。
2023-05-30 - (更新) - 支持从 🤗 Hub 下载量化好的模型或上次量化好的模型到 🤗 Hub。

获取更多的历史信息，请转至这里

性能对比

推理速度

以下结果通过这个脚本生成，文本输入的 batch size 为1，解码策略为 beam search 并且强制模型生成512个 token，速度的计量单位为 tokens/s（越大越好）。

量化模型通过能够最大化推理速度的方式加载。

model	GPU	num_beams	fp16	gptq-int4
llama-7b	1xA100-40G	1	18.87	25.53
llama-7b	1xA100-40G	4	68.79	91.30
moss-moon 16b	1xA100-40G	1	12.48	15.25
moss-moon 16b	1xA100-40G	4	OOM	42.67
moss-moon 16b	2xA100-40G	1	06.83	06.78
moss-moon 16b	2xA100-40G	4	13.10	10.80
gpt-j 6b	1xRTX3060-12G	1	OOM	29.55
gpt-j 6b	1xRTX3060-12G	4	OOM	47.36

困惑度（PPL）

对于困惑度的对比，你可以参考这里和这里

安装

快速安装

你可以通过 pip 来安装与 PyTorch 2.0.1 相兼容的最新稳定版本的 AutoGPTQ 的预构建轮子文件：

对于 CUDA 11.7： pip install auto-gptq --extra-index-url https://huggingface.github.io/autogptq-index/whl/cu117/
对于 CUDA 11.8： pip install auto-gptq --extra-index-url https://huggingface.github.io/autogptq-index/whl/cu118/
对于 RoCm 5.4.2： pip install auto-gptq --extra-index-url https://huggingface.github.io/autogptq-index/whl/rocm542/

警告： 预构建的轮子文件不一定在 PyTorch 的 nightly 版本上有效。如果要使用 PyTorch 的 nightly 版本，请从源码安装 AutoGPTQ。

取消 cuda 拓展的安装

默认情况下，在 torch 和 cuda 已经于你的机器上被安装时，cuda 拓展将被自动安装，如果你不想要这些拓展的话，采用以下安装命令：

BUILD_CUDA_EXT=0 pip install auto-gptq

同时为确保该拓展——autogptq_cuda 不再存在于你的虚拟环境，执行以下命令：

pip uninstall autogptq_cuda -y

支持使用 triton 加速

若想使用 triton 加速模型推理，使用以下命令：

警告：目前 triton 仅支持 linux 操作系统；当使用 triton 时 3-bit 数值类型的量化将不被支持

pip install auto-gptq[triton]

从源码安装

点击以查看详情

克隆源码:

git clone https://github.com/PanQiWei/AutoGPTQ.git && cd AutoGPTQ

然后，从项目目录安装:

pip install .

正如在快速安装一节，你可以使用 BUILD_CUDA_EXT=0 来取消构建 cuda 拓展。

如果你想要使用 triton 加速且其能够被你的操作系统所支持，请使用 .[triton]。

对应 AMD GPUs，为了从源码安装以支持 RoCm，请设置 ROCM_VERSION 环境变量。同时通过设置 PYTORCH_ROCM_ARCH (reference) 可提升编译速度，例如：对于 MI200 系列设备，该变量可设为 gfx90a。例子：

ROCM_VERSION=5.6 pip install .

对于 RoCm 系统，在从源码安装时额外需要提前安装以下包：rocsparse-dev, hipsparse-dev, rocthrust-dev, rocblas-dev and hipblas-dev。

快速开始

量化和推理

警告：这里仅是对 AutoGPTQ 中基本接口的用法展示，只使用了一条文本来量化一个特别小的模型，因此其结果的表现可能不如在大模型上执行量化后预期的那样好。

以下展示了使用 auto_gptq 进行量化和推理的最简单用法：

from transformers import AutoTokenizer, TextGenerationPipeline
from auto_gptq import AutoGPTQForCausalLM, BaseQuantizeConfig


pretrained_model_dir = "facebook/opt-125m"
quantized_model_dir = "opt-125m-4bit"


tokenizer = AutoTokenizer.from_pretrained(pretrained_model_dir, use_fast=True)
examples = [
    tokenizer(
        "auto-gptq is an easy-to-use model quantization library with user-friendly apis, based on GPTQ algorithm."
    )
]

quantize_config = BaseQuantizeConfig(
    bits=4,  # 将模型量化为 4-bit 数值类型
    group_size=128,  # 一般推荐将此参数的值设置为 128
    desc_act=False,  # 设为 False 可以显著提升推理速度，但是 ppl 可能会轻微地变差
)

# 加载未量化的模型，默认情况下，模型总是会被加载到 CPU 内存中
model = AutoGPTQForCausalLM.from_pretrained(pretrained_model_dir, quantize_config)

# 量化模型, 样本的数据类型应该为 List[Dict]，其中字典的键有且仅有 input_ids 和 attention_mask
model.quantize(examples)

# 保存量化好的模型
model.save_quantized(quantized_model_dir)

# 使用 safetensors 保存量化好的模型
model.save_quantized(quantized_model_dir, use_safetensors=True)

# 将量化好的模型直接上传至 Hugging Face Hub 
# 当使用 use_auth_token=True 时, 确保你已经首先使用 huggingface-cli login 进行了登录
# 或者可以使用 use_auth_token="hf_xxxxxxx" 来显式地添加账户认证 token
# （取消下面三行代码的注释来使用该功能）
# repo_id = f"YourUserName/{quantized_model_dir}"
# commit_message = f"AutoGPTQ model for {pretrained_model_dir}: {quantize_config.bits}bits, gr{quantize_config.group_size}, desc_act={quantize_config.desc_act}"
# model.push_to_hub(repo_id, commit_message=commit_message, use_auth_token=True)

# 或者你也可以同时将量化好的模型保存到本地并上传至 Hugging Face Hub
# （取消下面三行代码的注释来使用该功能）
# repo_id = f"YourUserName/{quantized_model_dir}"
# commit_message = f"AutoGPTQ model for {pretrained_model_dir}: {quantize_config.bits}bits, gr{quantize_config.group_size}, desc_act={quantize_config.desc_act}"
# model.push_to_hub(repo_id, save_dir=quantized_model_dir, use_safetensors=True, commit_message=commit_message, use_auth_token=True)

# 加载量化好的模型到能被识别到的第一块显卡中
model = AutoGPTQForCausalLM.from_quantized(quantized_model_dir, device="cuda:0")

# 从 Hugging Face Hub 下载量化好的模型并加载到能被识别到的第一块显卡中
# model = AutoGPTQForCausalLM.from_quantized(repo_id, device="cuda:0", use_safetensors=True, use_triton=False)

# 使用 model.generate 执行推理
print(tokenizer.decode(model.generate(**tokenizer("auto_gptq is", return_tensors="pt").to(model.device))[0]))

# 或者使用 TextGenerationPipeline
pipeline = TextGenerationPipeline(model=model, tokenizer=tokenizer)
print(pipeline("auto-gptq is")[0]["generated_text"])

参考此样例脚本以了解进阶的用法。

自定义模型

以下展示了如何拓展 `auto_gptq` 以支持 `OPT` 模型，如你所见，这非常简单：

from auto_gptq.modeling import BaseGPTQForCausalLM


class OPTGPTQForCausalLM(BaseGPTQForCausalLM):
    # chained attribute name of transformer layer block
    layers_block_name = "model.decoder.layers"
    # chained attribute names of other nn modules that in the same level as the transformer layer block
    outside_layer_modules = [
        "model.decoder.embed_tokens", "model.decoder.embed_positions", "model.decoder.project_out",
        "model.decoder.project_in", "model.decoder.final_layer_norm"
    ]
    # chained attribute names of linear layers in transformer layer module
    # normally, there are four sub lists, for each one the modules in it can be seen as one operation, 
    # and the order should be the order when they are truly executed, in this case (and usually in most cases), 
    # they are: attention q_k_v projection, attention output projection, MLP project input, MLP project output
    inside_layer_modules = [
        ["self_attn.k_proj", "self_attn.v_proj", "self_attn.q_proj"],
        ["self_attn.out_proj"],
        ["fc1"],
        ["fc2"]
    ]

然后, 你就可以像在基本用法一节中展示的那样使用 OPTGPTQForCausalLM.from_pretrained 和其他方法。

在下游任务上执行评估

你可以使用在 auto_gptq.eval_tasks 中定义的任务来评估量化前后的模型在某个特定下游任务上的表现。

这些预定义的模型支持所有在 🤗 transformers和本项目中被实现了的 causal-language-models。

以下是使用 `cardiffnlp/tweet_sentiment_multilingual` 数据集在序列分类（文本分类）任务上评估 `EleutherAI/gpt-j-6b` 模型的示例:

from functools import partial

import datasets
from transformers import AutoTokenizer, AutoModelForCausalLM, GenerationConfig

from auto_gptq import AutoGPTQForCausalLM, BaseQuantizeConfig
from auto_gptq.eval_tasks import SequenceClassificationTask


MODEL = "EleutherAI/gpt-j-6b"
DATASET = "cardiffnlp/tweet_sentiment_multilingual"
TEMPLATE = "Question:What's the sentiment of the given text? Choices are {labels}.\nText: {text}\nAnswer:"
ID2LABEL = {
    0: "negative",
    1: "neutral",
    2: "positive"
}
LABELS = list(ID2LABEL.values())


def ds_refactor_fn(samples):
    text_data = samples["text"]
    label_data = samples["label"]

    new_samples = {"prompt": [], "label": []}
    for text, label in zip(text_data, label_data):
        prompt = TEMPLATE.format(labels=LABELS, text=text)
        new_samples["prompt"].append(prompt)
        new_samples["label"].append(ID2LABEL[label])

    return new_samples


#  model = AutoModelForCausalLM.from_pretrained(MODEL).eval().half().to("cuda:0")
model = AutoGPTQForCausalLM.from_pretrained(MODEL, BaseQuantizeConfig())
tokenizer = AutoTokenizer.from_pretrained(MODEL)

task = SequenceClassificationTask(
        model=model,
        tokenizer=tokenizer,
        classes=LABELS,
        data_name_or_path=DATASET,
        prompt_col_name="prompt",
        label_col_name="label",
        **{
            "num_samples": 1000,  # how many samples will be sampled to evaluation
            "sample_max_len": 1024,  # max tokens for each sample
            "block_max_len": 2048,  # max tokens for each data block
            # function to load dataset, one must only accept data_name_or_path as input 
            # and return datasets.Dataset
            "load_fn": partial(datasets.load_dataset, name="english"),  
            # function to preprocess dataset, which is used for datasets.Dataset.map, 
            # must return Dict[str, list] with only two keys: [prompt_col_name, label_col_name]
            "preprocess_fn": ds_refactor_fn,  
            # truncate label when sample's length exceed sample_max_len
            "truncate_prompt": False  
        }
    )

# note that max_new_tokens will be automatically specified internally based on given classes
print(task.run())

# self-consistency
print(
    task.run(
        generation_config=GenerationConfig(
            num_beams=3,
            num_return_sequences=3,
            do_sample=True
        )
    )
)

了解更多

教程提供了将 auto_gptq 集成到你的项目中的手把手指导和最佳实践准则。

示例提供了大量示例脚本以将 auto_gptq 用于不同领域。

支持的模型

你可以使用 model.config.model_type 来对照下表以检查你正在使用的一个模型是否被 auto_gptq 所支持。

比如， WizardLM，vicuna 和 gpt4all 模型的 model_type 皆为 llama，因此这些模型皆被 auto_gptq 所支持。

model type	quantization	inference	peft-lora	peft-ada-lora	peft-adaption_prompt
bloom	✅	✅	✅	✅
gpt2	✅	✅	✅	✅
gpt_neox	✅	✅	✅	✅	✅要求该分支的 peft
gptj	✅	✅	✅	✅	✅要求该分支的 peft
llama	✅	✅	✅	✅	✅
moss	✅	✅	✅	✅	✅要求该分支的 peft
opt	✅	✅	✅	✅
gpt_bigcode	✅	✅	✅	✅
codegen	✅	✅	✅	✅
falcon(RefinedWebModel/RefinedWeb)	✅	✅	✅	✅

支持的评估任务

目前， auto_gptq 支持以下评估任务： LanguageModelingTask, SequenceClassificationTask 和 TextSummarizationTask；更多的评估任务即将到来！

致谢

特别感谢 Elias Frantar， Saleh Ashkboos， Torsten Hoefler 和 Dan Alistarh 提出 GPTQ 算法并开源代码。
特别感谢 qwopqwop200，本项目中涉及到模型量化的代码主要参考自 GPTQ-for-LLaMa。

russhuang / AutoGPTQ

AutoGPTQ

通向 v1.0.0 之路

新闻或更新

性能对比

推理速度

困惑度（PPL）

安装

快速安装

取消 cuda 拓展的安装

支持使用 triton 加速

从源码安装

快速开始

量化和推理

自定义模型

在下游任务上执行评估

了解更多

支持的模型

支持的评估任务

致谢

简介

发行版

贡献者

近期动态

russhuang / AutoGPTQ .gitee-modal { width: 500px !important; }

AutoGPTQ

通向 v1.0.0 之路

新闻或更新

性能对比

推理速度

困惑度（PPL）

安装

快速安装

取消 cuda 拓展的安装

支持使用 triton 加速

从源码安装

快速开始

量化和推理

自定义模型

在下游任务上执行评估

了解更多

支持的模型

支持的评估任务

致谢

简介

发行版

贡献者

近期动态

搜索帮助

russhuang / AutoGPTQ