手把手教你本地部署清華大學(xué)KEG的ChatGLM-6B模型（CPU/GPU）

HZAAAAAAA 2023-04-27 發(fā)布于廣東

展開全文

ChatGLM-6B是清華大學(xué)知識工程和數(shù)據(jù)挖掘小組發(fā)布的一個類似ChatGPT的開源對話機(jī)器人，由于該模型是經(jīng)過約1T標(biāo)識符的中英文訓(xùn)練，且大部分都是中文，因此十分適合國內(nèi)使用。

本文來自DataLearner官方博客：手把手教你本地部署清華大學(xué)KEG的ChatGLM-6B模型——Windows+6GB顯卡版本和CPU版本的本地部署 | 數(shù)據(jù)學(xué)習(xí)者官方網(wǎng)站(Datalearner)

ChatGLM-6B在DataLearner官方的模型卡信息：ChatGLM-6B（ChatGLM-6B）詳情 | 數(shù)據(jù)學(xué)習(xí) (DataLearner)

根據(jù)GitHub開源項目公開的信息，ChatGLM-6B完整版本需要13GB顯存做推理，但是INT4量化版本只需要6GB顯存即可運行，因此對于個人本地部署來說十分友好。遺憾的是，官方的文檔中缺少了一些內(nèi)容導(dǎo)致大家本地部署會有很多問題，本文將詳細(xì)記錄如何在Windows環(huán)境下基于GPU和CPU兩種方式部署使用ChatGLM-6B，并說明如何規(guī)避其中的問題。

安裝前說明

部署前安裝環(huán)境

1、下載官方代碼，安裝Python依賴的庫

2、下載INT4量化后的預(yù)訓(xùn)練結(jié)果文件

Windows+GPU部署方案

1、Windows+GPU方案的必備條件

2、運行部署GPU版本的INT4量化的ChatGLM-6B模型

Windows+CPU部署方案

1、Windows+CPU方案的必備條件

2、運行部署CPU版本的INT4量化的ChatGLM-6B模型

總結(jié)

安裝前說明

盡管ChatGLM-6B的GitHub上提供了安裝部署的教程，但是由于其提供的代碼、預(yù)訓(xùn)練模型、配置文件并不是統(tǒng)一在一個地方，因此對于一些新手來說很容易出現(xiàn)各種錯誤。

此外，由于大多數(shù)人可能只有較少內(nèi)存的GPU，甚至是只有CPU，那么只能部署量化版本的模型，這里也會有不一樣的。

最后，部署ChatGLM-6B目前涉及到從GitHub、HuggingFace以及清華云的地址，下面我們將詳細(xì)說明如何操作。

部署前安裝環(huán)境

在部署ChatGLM-6B之前，我們需要安裝好運行的環(huán)境。下面2個步驟是不管你部署CPU還是GPU版本都需要做的。

1、下載官方代碼，安裝Python依賴的庫

首先，我們需要從GitHub上下載ChatGLM的requirements.txt來幫助我們安裝依賴的庫。大家只需要在GitHub上下載requirements.txt即可。下載地址：https://github.com/THUDM/ChatGLM-6B

文件如下圖所示：

手把手教你本地部署清華大學(xué)KEG的ChatGLM-6B模型（CPU/GPU）

這個文件記錄了ChatGLM-6B依賴的Python庫及版本，大家點擊右上角Code里面有Download ZIP，下載到本地解壓之后就能獲取這個文件。然后執(zhí)行如下命令即可

pip install -r requirements.txt

注意，這是從cmd進(jìn)入到requirements.txt文件所在的目錄執(zhí)行的結(jié)果，這部分屬于Python基礎(chǔ)，就不贅述了。

需要注意的是，ChatGLM依賴HuggingFace的transformers庫，盡管官方說：

使用 pip 安裝依賴：pip install -r requirements.txt，其中 transformers 庫版本推薦為 4.27.1，但理論上不低于 4.23.1 即可。

但是實際上，必須是4.27.1及以上的版本才可以，更低版本的transformers會出現(xiàn)如下錯誤：AttributeError: 'Logger' object has no attribute ''warning_once''

所以，一定要查看自己的transformers版本是否正確。

另外，ChatGLM-6B依賴torch，如果你有GPU，且高于6G內(nèi)存，那么建議部署GPU版本，但是需要下載支持cuda的torch，而不是默認(rèn)的CPU版本的torch。

2、下載INT4量化后的預(yù)訓(xùn)練結(jié)果文件

在上述的依賴環(huán)境安裝完畢之后，大家接下來就要下載預(yù)訓(xùn)練結(jié)果。

INT4量化的預(yù)訓(xùn)練文件下載地址：https:///THUDM/chatglm-6b-int4/tree/main

需要注意的是，在GitHub上，官方提供了模型在清華云上的下載地址，但是那個只包含預(yù)訓(xùn)練結(jié)果文件，即bin文件，但實際上ChatGLM-6B的運行需要模型的配置文件，即config.json等，如下圖所示：

因此建議大家全部從HuggingFace上下載所有文件到本地。上述文件全部下載之后保存到本地的一個目錄下即可，我們保存在：D:\\data\\llm\\chatglm-6b-int4

Windows+GPU部署方案

1、Windows+GPU方案的必備條件

部署GPU版本的ChatGLM-6B需要安裝cuda版本的torch，大家需要檢測自己的torch是否正確，可以通過如下命令檢查（下面是python代碼）：

import torchprint(torch.cuda.is_available())

如果以上代碼輸出的是True，那么恭喜你，你安裝的是cuda版本的torch（注意，有顯卡也需要大家下載cuda和cudann安裝成功才可以，這部分大家可以去網(wǎng)上找教程）。如下圖所示：

注意，目前ChatGLM-6B有3個版本可以使用，沒有量化的版本做推理需要13G的GPU顯存，INT8量化需要8GB的顯存，而INT4量化的版本需要6GB的顯存。

模型量化會帶來一定的性能損失，經(jīng)過測試，ChatGLM-6B 在 4-bit 量化下仍然能夠進(jìn)行自然流暢的生成。

本機(jī)只有6GB的顯存，只能使用INT4版本了。

2、運行部署GPU版本的INT4量化的ChatGLM-6B模型

GPU版本的模型部署很簡單，上述兩個步驟完成之后即可運行。代碼如下：from transformers import AutoTokenizer, AutoModeltokenizer = AutoTokenizer.from_pretrained('D:\\data\\llm\\chatglm-6b-int4', trust_remote_code=True, revision='')model = AutoModel.from_pretrained('D:\\data\\llm\\chatglm-6b-int4', trust_remote_code=True, revision='').half().cuda()model = model.eval()response, history = model.chat(tokenizer, '你好', history=[])print(response)

注意，這里有幾個地方需要和大家說明一下。

首先，這里的地址都是D:\\data\\llm\\chatglm-6b-int4寫法，即\\，不能寫成D:/data/llm/chatglm-6b-int4。否則可能會出現(xiàn)如下錯誤：

>>> tokenizer = AutoTokenizer.from_pretrained('D:\\OneDrive\\Programs\\llm\\chatglm-6b-int4', trust_remote_code=True)Explicitly passing a `revision` is encouraged when loading a model with custom code to ensure no malicious code has been contributed in a newer revision.>>> model = AutoModel.from_pretrained('D:/OneDrive/Programs/llm/chatglm-6b-int4', trust_remote_code=True).half().cuda()Explicitly passing a `revision` is encouraged when loading a configuration with custom code to ensure no malicious code has been contributed in a newer revision.Traceback (most recent call last): File '', line 1, inFile 'C:\Users\DataLearner\AppData\Local\Programs\Python\Python39\lib\site-packages\transformers\models\auto\auto_factory.py', line 441, in from_pretrained config, kwargs = AutoConfig.from_pretrained( File 'C:\Users\DataLearner\AppData\Local\Programs\Python\Python39\lib\site-packages\transformers\models\auto\configuration_auto.py', line 911, in from_pretrained config_class = get_class_from_dynamic_module( File 'C:\Users\DataLearner\AppData\Local\Programs\Python\Python39\lib\site-packages\transformers\dynamic_module_utils.py', line 388, in get_class_from_dynamic_module final_module = get_cached_module_file( File 'C:\Users\DataLearner\AppData\Local\Programs\Python\Python39\lib\site-packages\transformers\dynamic_module_utils.py', line 273, in get_cached_module_file create_dynamic_module(full_submodule) File 'C:\Users\DataLearner\AppData\Local\Programs\Python\Python39\lib\site-packages\transformers\dynamic_module_utils.py', line 59, in create_dynamic_module os.makedirs(dynamic_module_path, exist_ok=True) File 'C:\Users\DataLearner\AppData\Local\Programs\Python\Python39\lib\os.py', line 215, in makedirs makedirs(head, exist_ok=exist_ok) File 'C:\Users\DataLearner\AppData\Local\Programs\Python\Python39\lib\os.py', line 215, in makedirs makedirs(head, exist_ok=exist_ok) File 'C:\Users\DataLearner\AppData\Local\Programs\Python\Python39\lib\os.py', line 215, in makedirs makedirs(head, exist_ok=exist_ok) [Previous line repeated 1 more time] File 'C:\Users\DataLearner\AppData\Local\Programs\Python\Python39\lib\os.py', line 225, in makedirs mkdir(name, mode)OSError: [WinError 123] 文件名、目錄名或卷標(biāo)語法不正確。: 'C:\\Users\\DataLearner\\.cache\\huggingface\\modules\\transformers_modules\\D:'

這是因為Windows版本路徑分隔符的問題導(dǎo)致的。需要注意！

此外，我們的代碼中加了revision=''參數(shù)，主要是規(guī)避如下告警：Explicitly passing a revision is encouraged when loading a configuration with custom code to ensure no malicious code has been contributed in a newer revision.

它不影響運行，但是會影響觀感~~~

通過以上步驟我們可以得到如下結(jié)果：

十分完美，我測試的結(jié)果，GPU版本大約只需要1-2秒即可獲得結(jié)果（不嚴(yán)謹(jǐn)，沒有測試復(fù)雜的輸入！）

Windows+CPU部署方案

1、Windows+CPU方案的必備條件

CPU版本的ChatGLM-6B部署比GPU版本稍微麻煩一點，主要涉及到一個kernel的編譯問題。

在安裝之前，除了上面需要安裝好requirements.txt中所有的Python依賴外，torch需要安裝好正常的CPU版本即可。

但是，除了這些CPU版本的安裝還需要大家在本地的Windows下安裝好C/C++的編譯環(huán)境。推薦安裝TDM-GCC，下載地址：https://jmeubank./tdm-gcc/

大家直接點擊上述頁面中TDM-GCC 10.3.0 release下載安裝即可，注意，安裝的時候直接選擇全部安裝就好。安裝完在cmd中運行”gcc -v”測試是否成功即可。

安裝這個主要是為了編譯之前下載的文件中的quantization_kernels.c和

quantization_kernels_parallel.c這兩個文件。如果大家在運行中遇到了如下錯誤提示：

No compiled kernel found.Compiling kernels : C:\Users\DuFei\.cache\huggingface\modules\transformers_modules\chatglm-6b-int4\quantization_kernels_parallel.cCompiling gcc -O3 -fPIC -pthread -fopenmp -std=c99 C:\Users\DuFei\.cache\huggingface\modules\transformers_modules\chatglm-6b-int4\quantization_kernels_parallel.c -shared -o C:\Users\DuFei\.cache\huggingface\modules\transformers_modules\chatglm-6b-int4\quantization_kernels_parallel.soKernels compiled : C:\Users\DuFei\.cache\huggingface\modules\transformers_modules\chatglm-6b-int4\quantization_kernels_parallel.soCannot load cpu kernel, don't use quantized model on cpu.Using quantization cacheApplying quantization to glm layers

那么就是這兩個文件編譯出問題了。那么就需要我們手動去編譯這兩個文件：

即在上面下載的D:\\data\\llm\\chatglm-6b-int4本地目錄下進(jìn)入cmd，運行如下兩個編譯命令：gcc -fPIC -pthread -fopenmp -std=c99 quantization_kernels.c -shared -o quantization_kernels.sogcc -fPIC -pthread -fopenmp -std=c99 quantization_kernels_parallel.c -shared -o quantization_kernels_parallel.so

如下圖所示即為運行成功

然后，大家就可以在D:\\data\\llm\\chatglm-6b-int4目錄下看到下面兩個新的文件：

quantization_kernels_parallel.so和quantization_kernels.so。說明編譯成功，后面我們手動載入即可。

2、運行部署CPU版本的INT4量化的ChatGLM-6B模型

CPU版本量化模型的代碼與GPU版本稍微有點差異，代碼如下：

from transformers import AutoTokenizer, AutoModeltokenizer = AutoTokenizer.from_pretrained('D:\\data\\llm\\chatglm-6b-int4', trust_remote_code=True, revision='')model = AutoModel.from_pretrained('D:\\data\\llm\\chatglm-6b-int4',trust_remote_code=True, revision='').float()model = model.eval()response, history = model.chat(tokenizer, '你好', history=[])print(response)

注意，其實就是第三行代碼最后的float()有差異：model = AutoModel.from_pretrained('D:\\data\\llm\\chatglm-6b-int4', trust_remote_code=True, revision='').float()

GPU版本后面是.half().cuda()，而這里是float()。

如果你運行上面的代碼出現(xiàn)如下錯誤：

The dtype of attention mask (torch.int64) is not boolTraceback (most recent call last): File '', line 1, inFile 'C:\Users\DuFei\AppData\Local\Programs\Python\Python39\lib\site-packages\torch\utils\_contextlib.py', line 115, in decorate_context return func(*args, **kwargs) File 'C:\Users\DuFei/.cache\huggingface\modules\transformers_modules\chatglm-6b-int4\modeling_chatglm.py', line 1255, in chat outputs = self.generate(**inputs, **gen_kwargs) File 'C:\Users\DuFei\AppData\Local\Programs\Python\Python39\lib\site-packages\torch\utils\_contextlib.py', line 115, in decorate_context return func(*args, **kwargs) File 'C:\Users\DuFei\AppData\Local\Programs\Python\Python39\lib\site-packages\transformers\generation\utils.py', line 1452, in generate return self.sample( File 'C:\Users\DuFei\AppData\Local\Programs\Python\Python39\lib\site-packages\transformers\generation\utils.py', line 2468, in sample outputs = self( File 'C:\Users\DuFei\AppData\Local\Programs\Python\Python39\lib\site-packages\torch\nn\modules\module.py', line 1501, in _call_impl return forward_call(*args, **kwargs) File 'C:\Users\DuFei/.cache\huggingface\modules\transformers_modules\chatglm-6b-int4\modeling_chatglm.py', line 1160, in forward transformer_outputs = self.transformer( File 'C:\Users\DuFei\AppData\Local\Programs\Python\Python39\lib\site-packages\torch\nn\modules\module.py', line 1501, in _call_impl return forward_call(*args, **kwargs) File 'C:\Users\DuFei/.cache\huggingface\modules\transformers_modules\chatglm-6b-int4\modeling_chatglm.py', line 973, in forward layer_ret = layer( File 'C:\Users\DuFei\AppData\Local\Programs\Python\Python39\lib\site-packages\torch\nn\modules\module.py', line 1501, in _call_impl return forward_call(*args, **kwargs) File 'C:\Users\DuFei/.cache\huggingface\modules\transformers_modules\chatglm-6b-int4\modeling_chatglm.py', line 614, in forward attention_outputs = self.attention( File 'C:\Users\DuFei\AppData\Local\Programs\Python\Python39\lib\site-packages\torch\nn\modules\module.py', line 1501, in _call_impl return forward_call(*args, **kwargs) File 'C:\Users\DuFei/.cache\huggingface\modules\transformers_modules\chatglm-6b-int4\modeling_chatglm.py', line 439, in forward mixed_raw_layer = self.query_key_value(hidden_states) File 'C:\Users\DuFei\AppData\Local\Programs\Python\Python39\lib\site-packages\torch\nn\modules\module.py', line 1501, in _call_impl return forward_call(*args, **kwargs) File 'C:\Users\DuFei/.cache\huggingface\modules\transformers_modules\chatglm-6b-int4\quantization.py', line 338, in forward output = W8A16LinearCPU.apply(input, self.weight, self.weight_scale, self.weight_bit_width, self.quantization_cache) File 'C:\Users\DuFei\AppData\Local\Programs\Python\Python39\lib\site-packages\torch\autograd\function.py', line 506, in apply return super().apply(*args, **kwargs) # type: ignore[misc] File 'C:\Users\DuFei/.cache\huggingface\modules\transformers_modules\chatglm-6b-int4\quantization.py', line 76, in forward weight = extract_weight_to_float(quant_w, scale_w, weight_bit_width, quantization_cache=quantization_cache) File 'C:\Users\DuFei/.cache\huggingface\modules\transformers_modules\chatglm-6b-int4\quantization.py', line 260, in extract_weight_to_float func = cpu_kernels.int4WeightExtractionFloatAttributeError: 'NoneType' object has no attribute 'int4WeightExtractionFloat'

那么就是前面說的編譯文件出了問題，那么就必須做上面說的編譯操作，得到那2個so文件，然后手動加載。新代碼如下：from transformers import AutoTokenizer, AutoModeltokenizer = AutoTokenizer.from_pretrained('D:\\data\\llm\\chatglm-6b-int4', trust_remote_code=True, revision='')model = AutoModel.from_pretrained('D:\\data\\llm\\chatglm-6b-int4',trust_remote_code=True, revision='').float()model = model.quantize(bits=4, kernel_file='D:\\data\\llm\\chatglm-6b-int4\\quantization_kernels.so')model = model.eval()response, history = model.chat(tokenizer, '你好', history=[])print(response)

比原來的代碼多了model = model.quantize(bits=4, kernel_file='D:\\data\\llm\\chatglm-6b-int4\\quantization_kernels.so')一行手動加載的內(nèi)容。

接下來你就可以看到如下界面：