開源大模型生成式人工智能自我評估和自我進化

taotao_2016 2024-10-06

展開全文

ELHS 通訊 · 第12期

應用生成式人工智能 (GenAI) 和智能醫(yī)療健康系統(tǒng) (LHS)，探索全球公平健康道路

請訪問平臺 https:///zh

親愛的朋友們，

~~~~~~~~

上個月的學術界收獲頗豐，許多新的GenAI研究成果陸續(xù)發(fā)表，包括：

英國的一項調(diào)查顯示，20%的基層醫(yī)生（GP）使用GenAI工具。
大模型可準確評估成人患者在急診科就診時的病情程度。
新的基礎病理模型在癌癥診斷中表現(xiàn)優(yōu)于現(xiàn)有的最佳模型。
GenAI對話機器人可有效地向患者提供癌癥遺傳信息服務。

AI自動糾錯的研究計劃

美國衛(wèi)生部ARPA-H宣布了一項新的研究計劃，即通過“人工智能性能和可靠性評估與持續(xù)改進計劃”（PRECISE-AI）提供研究資金。PRECISE-AI計劃有著非常宏大的目標：開發(fā)新技術，能夠在真實臨床環(huán)境中檢測到AI工具何時可能偏離其基礎訓練數(shù)據(jù)，并且能夠自動糾正這些AI工具，以保持其最佳性能。

AI自我學習評估器

為了降低大模型開發(fā)的基于模型評估的成本，Meta提出了一個新方法，旨在無需人工注釋，僅通過合成訓練數(shù)據(jù)來改進評估器。Meta的自我學習評估器在沒有任何標記偏好數(shù)據(jù)的情況下，將強大的大模型（Llama3-70B-Instruct）的RewardBench評分從75.4提升到88.3。它的表現(xiàn)優(yōu)于常用的大模型評審者（如GPT-4），并且與通過標記示例訓練的頂級獎勵模型表現(xiàn)相當。

反思微調(diào)方法

HuggingFace宣布了一個新的開源大模型——ReflectionLlama-3.170B，該模型通過一種稱為反思微調(diào)的新技術進行訓練，該技術教導大模型識別推理中的錯誤并進行修正。在反思微調(diào)中，大模型在合成的結(jié)構(gòu)化數(shù)據(jù)上進行訓練，以學習推理和自我糾正。我認為，盡管其初步基準測試顯示其優(yōu)于OpenAI GPT-4o，但是否能在所有基準測試中超越GPT-4o還有待觀察。

解釋大模型

解釋大模型如何處理輸入信息一直是一個挑戰(zhàn)。此前，Anthropic團隊報告了如何解釋其神經(jīng)網(wǎng)絡的單層或小型網(wǎng)絡?，F(xiàn)在，谷歌發(fā)布了一種名為Gemma Scope的系統(tǒng)，可以揭示Gemma 2大型語言模型中的每一層如何響應給定的輸入符號。Gemma Scope擁有一套稀疏自動編碼器（SAEs），這些編碼器在Gemma 2的2B和9B模型的所有層上進行訓練。SAE是一種無監(jiān)督的方法，用于學習神經(jīng)網(wǎng)絡潛在表示的稀疏分解，并將其轉(zhuǎn)化為可解釋的特征。Gemma Scope的開放SAE權重和教程將幫助AI社區(qū)更容易開展安全性和可解釋性研究。你還可以體驗Gemma Scope交互式演示。

下面請繼續(xù)閱讀最新文獻和我與ChatGPT有關AI自學成長的對話。

致以溫暖的問候，

AJ
陳安均博士 | 公平健康系統(tǒng)研究所 | https://

圖片來源：ARPA-H

來自Page Mill的專題資訊

近期發(fā)表論文、新聞和事件，展示GenAI和LHS的進展

Blease CR, Locher C, Gaab J, H?gglund M, Mandl KD. Generative artificial intelligence in primary care: an online survey of UK general practitioners. BMJ Health Care Inform. 2024 Sep 17;31(1):e101102. doi: 10.1136/bmjhci-2024-101102.

https://informatics./content/31/1/e101102

[2024/9] A UK survey showed about 20% primary care doctors are using generative AI tools in clinical practice. Top applications are writing documentation after patient appointments and assisting with differential diagnosis.

Williams CYK, Zack T, Miao BY, et al. Use of a Large Language Model to Assess Clinical Acuity of Adults in the Emergency Department. JAMA Netw Open. 2024;7(5):e248895. doi:10.1001/jamanetworkopen.2024.8895

https:///journals/jamanetworkopen/fullarticle/2818387

[2024/5] This cross-sectional study of 251?401 adult ED visits investigated the potential for an LLM to classify acuity levels of patients in the ED based on the Emergency Severity Index across 10?000 patient pairs. The LLM demonstrated accuracy of 89% and was comparable with human physician classification in a 500-pair subsample.

Wang, X., Zhao, J., Marostica, E.et al. A pathology foundation model for cancer diagnosis and prognosis prediction. Nature (2024). https:///10.1038/s41586-024-07894-z

https://www./articles/s41586-024-07894-z

[2024/9] We devised the Clinical Histopathology Imaging Evaluation Foundation (CHIEF) model, a general-purpose weakly supervised machine learning framework to extract pathology imaging features for systematic cancer evaluation.We successfully validated CHIEF using 19,491 whole-slide images from 32 independent slide sets collected from 24 hospitals and cohorts internationally. Overall, CHIEF outperformed the state-of-the-art deep learning methods by up to 36.1%, showing its ability to address domain shifts observed in samples from diverse populations and processed by different slide preparation methods. CHIEF provides a generalizable foundation for efficient digital pathology evaluation for patients with cancer.

Zimolzak AJ?, Wei L?, Mir U?, et al. Machine learning to enhance electronic detection of diagnostic errors. JAMA Netw Open. 2024;7(9):e2431982. doi:10.1001/jamanetworkopen.2024.31982

https:///journals/jamanetworkopen/fullarticle/2823187

[2024/9] Based on expert input and existing frameworks, we designed rules-based e-triggers to find possible MODs in emergency departments (ED). Using Veterans Affairs national EHR data covering more than 20 million unique individuals, we identified 2 high-risk cohorts: (1) patients with stroke risk factors discharged from ED after presenting with dizziness or vertigo who were subsequently hospitalized for stroke or TIA within 30 days; and (2) patients discharged from ED with abdominal pain and abnormal temperature who were subsequently hospitalized within 10 days. The best-performing machine learning algorithm achieved a positive predictive value for MODs in the stroke cohort of 92% and 93% among the patients with abdominal pain and fever.

Kaphingst KA, Kohlmann WK, Lorenz Chambers R, et al. Uptake of Cancer Genetic Services for Chatbot vs Standard-of-Care Delivery Models: The BRIDGE Randomized Clinical Trial. JAMA Netw Open. 2024;7(9):e2432143. doi:10.1001/jamanetworkopen.2024.32143

https:///journals/jamanetworkopen/fullarticle/2823183

[2024/9] Despite this trial’s limitations, its findings advance the state of the science in delivery of cancer genetic services to unaffected patients meeting criteria for genetic evaluation. By comparing chatbot service delivery vs SOC using a multisite RCT design, the findings of the BRIDGE equivalence trial support the use of chatbot approaches to offer cancer genetic services. The findings show that use of chatbots to deliver pretest genetic services has strong potential to increase access to these services for unaffected patients. These trial findings therefore support the implementation of systemwide population health management strategies to deliver cancer genetic services.

Liu T, Hetherington TC, Stephens C, et al. AI-Powered Clinical Documentation and Clinicians’ Electronic Health Record Experience: A Nonrandomized Clinical Trial. JAMA Netw Open. 2024;7(9):e2432460. doi:10.1001/jamanetworkopen.2024.32460

https:///journals/jamanetworkopen/fullarticle/2823302

[2024/9] Approximately half of clinicians using the AI-powered clinical documentation tool based on interest reported a positive outcome, potentially reducing burnout. However, a significant subset did not find time-saving benefits or improved EHR experience. Study limitations include potential selection bias and recall bias in both groups. Further research is needed to identify opportunities for improvement and understand the impact on different clinician subsets and health systems.

Pfohl SR, Cole-Lewis H, Sayres R, et al. A toolbox for surfacing health equity harms and biases in large language models.Nat Med. 2024 Sep 23. doi: 10.1038/s41591-024-03258-2.

https://www./articles/s41591-024-03258-2

[2024/9] Reliably evaluating equity-related model failures is a critical step toward developing systems that promote health equity. Through our empirical study, we find that our approach surfaces biases that may be missed by narrower evaluation approaches. Our experience underscores the importance of using diverse assessment methodologies and involving raters of varying backgrounds and expertise.

生成式AI機器眼中的人類健康世界

AJ與ChatGPT對話，揭示機器怎么看人類醫(yī)療健康

Meta最近報告了一個用于LLM的自我學習評估器，這是什么意思？

ChatGPT

這意味著他們開發(fā)了一種無需人工標注數(shù)據(jù)就能評估大型語言模型（LLM）性能的系統(tǒng)。與依賴人工反饋或標注數(shù)據(jù)不同，自我學習評估器使用合成數(shù)據(jù)來判斷和改進模型的性能。這種方法使LLM能夠自我評估并自動提高其能力，從而減少了對昂貴且耗時的人工標注的需求。

既然評估可以自我學習，那么在開發(fā)過程中，LLM是否有可能自動提高其性能？

ChatGPT

是的，LLM在開發(fā)過程中確實有可能自動提高其性能，自我學習成長。通過使用合成數(shù)據(jù)生成、自我評估和強化學習等技術，LLM可以識別自己的錯誤和弱點，并據(jù)此進行調(diào)整和再訓練。這就形成了一個反饋循環(huán)，使模型在不依賴人為干預的情況下持續(xù)改進其能力。隨著時間的推移，這種自主過程可以顯著提升模型的性能。

哇，太棒了！LLM是怎樣自動提高性能的？請?zhí)峁┰敿毩鞒獭?/span>

ChatGPT

LLM自動提高性能的過程是通過一系列步驟來實現(xiàn)的，這些步驟構(gòu)成了一個學習和自我改進的反饋循環(huán)。以下是詳細的流程：

1. 初始訓練

首先，LLM使用傳統(tǒng)的監(jiān)督學習或自監(jiān)督學習方法在大型數(shù)據(jù)集上進行初始訓練。在這個階段，模型通過所接觸的數(shù)據(jù)學習生成相應的輸出。

2. 生成合成數(shù)據(jù)

完成初始訓練后，LLM開始生成自己的合成數(shù)據(jù)。這些數(shù)據(jù)可以基于模型尚未訓練過的任務或問題，也可以是它已經(jīng)見過的數(shù)據(jù)的變體。此步驟使模型能夠探索原始數(shù)據(jù)集中可能不存在的新場景。

3. 自我評估

接下來，模型使用自我學習評估器來評估其自身性能。在此階段，LLM使用一些指標或?qū)W習到的評分系統(tǒng)來評估其輸出質(zhì)量。例如，它可以將其輸出與預期模式、邏輯一致性或先前生成的示例進行比較。

4. 強化學習

根據(jù)評估結(jié)果，LLM進行某種形式的強化學習，對其輸出進行獎勵或懲罰。這一步幫助模型學習什么是更好的響應形式。在某些情況下，模型可能會發(fā)現(xiàn)其推理中的缺陷，并據(jù)此進行調(diào)整。

5. 自我修正與微調(diào)

LLM根據(jù)評估階段的反饋對自身進行微調(diào)。這包括調(diào)整其內(nèi)部的權重和參數(shù)，以改進未來的輸出。這個過程可以反復進行，使模型逐步減少錯誤并提高其能力。

6. 持續(xù)自我學習

LLM繼續(xù)這一合成數(shù)據(jù)生成、自我評估、強化學習和微調(diào)的循環(huán)。隨著時間的推移，它可以自主發(fā)現(xiàn)性能中的弱點，并在不需要人工干預的情況下改進自己。

7. 驗證

最后，模型可以在真實世界的數(shù)據(jù)或測試用例上進行驗證，以確保其自動改進能夠轉(zhuǎn)化為實際的性能提升。

是否有關于自主LLM開發(fā)或大模型自我學習成長的研究發(fā)表？

ChatGPT

是的，確實有一些關于自主LLM開發(fā)的研究發(fā)表。這些研究探討了自我學習、自我評估、強化學習和自監(jiān)督學習等技術，使得LLM能夠在極少人工干預的情況下提高其性能。例如，Meta報告了他們的自我學習評估器，使得LLM能夠使用合成數(shù)據(jù)自行評估。些進展是創(chuàng)建更多自主AI系統(tǒng)的研究領域的一部分。

一些關鍵的研究領域包括：

自我學習評估器，用于LLM性能的自主評估。
無人工反饋的強化學習，使LLM能夠從自身的輸出中學習。
自監(jiān)督學習，LLM利用未標注的數(shù)據(jù)不斷改進。

來自OpenAI、Meta、Google DeepMind等機構(gòu)的研究人員在這一領域有重要貢獻，許多相關論文可以在arXiv等平臺上找到。

生成式智能/醫(yī)療系統(tǒng)工具箱

幫助構(gòu)建您生成式智能未來的工具

ELHS生成式智能醫(yī)學助手平臺 (https:///zh)

智能助手平臺讓醫(yī)學生和醫(yī)療專業(yè)人員輕松開啟大模型智能醫(yī)療服務和學習之旅：

免費使用多個大模型對話機器人，包括GPT-4o、Gemini 1.5pro和Llama 3.1。
免費試用自己專屬的開源大模型和對話機器人。
部署并微調(diào)自己專屬的開源大模型和對話機器人。

學習大模型的推理能力

OpenAI推出了OpenAI o1 (https:///index/learning-to-reason-with-llms/)，這是一種通過強化學習訓練的大型語言模型，能夠執(zhí)行復雜推理。o1在回答之前會進行思考。在物理、生物和化學問題的基準測試 (GPQA) 上，OpenAI o1超越了人類博士級別的準確率。