應用生成式人工智能 (GenAI) 和智能醫(yī)療健康系統(tǒng) (LHS),探索全球公平健康道路 請訪問平臺 https:///zh 上個月的學術界收獲頗豐,許多新的GenAI研究成果陸續(xù)發(fā)表,包括:英國的一項調(diào)查顯示,20%的基層醫(yī)生(GP)使用GenAI工具。 大模型可準確評估成人患者在急診科就診時的病情程度。 新的基礎病理模型在癌癥診斷中表現(xiàn)優(yōu)于現(xiàn)有的最佳模型。 GenAI對話機器人可有效地向患者提供癌癥遺傳信息服務。 美國衛(wèi)生部ARPA-H宣布了一項新的研究計劃,即通過“人工智能性能和可靠性評估與持續(xù)改進計劃”(PRECISE-AI)提供研究資金。PRECISE-AI計劃有著非常宏大的目標:開發(fā)新技術,能夠在真實臨床環(huán)境中檢測到AI工具何時可能偏離其基礎訓練數(shù)據(jù),并且能夠自動糾正這些AI工具,以保持其最佳性能。為了降低大模型開發(fā)的基于模型評估的成本,Meta提出了一個新方法,旨在無需人工注釋,僅通過合成訓練數(shù)據(jù)來改進評估器。Meta的自我學習評估器在沒有任何標記偏好數(shù)據(jù)的情況下,將強大的大模型(Llama3-70B-Instruct)的RewardBench評分從75.4提升到88.3。它的表現(xiàn)優(yōu)于常用的大模型評審者(如GPT-4),并且與通過標記示例訓練的頂級獎勵模型表現(xiàn)相當。HuggingFace宣布了一個新的開源大模型——ReflectionLlama-3.170B,該模型通過一種稱為反思微調(diào)的新技術進行訓練,該技術教導大模型識別推理中的錯誤并進行修正。在反思微調(diào)中,大模型在合成的結(jié)構(gòu)化數(shù)據(jù)上進行訓練,以學習推理和自我糾正。我認為,盡管其初步基準測試顯示其優(yōu)于OpenAI GPT-4o,但是否能在所有基準測試中超越GPT-4o還有待觀察。解釋大模型如何處理輸入信息一直是一個挑戰(zhàn)。此前,Anthropic團隊報告了如何解釋其神經(jīng)網(wǎng)絡的單層或小型網(wǎng)絡?,F(xiàn)在,谷歌發(fā)布了一種名為Gemma Scope的系統(tǒng),可以揭示Gemma 2大型語言模型中的每一層如何響應給定的輸入符號。Gemma Scope擁有一套稀疏自動編碼器(SAEs),這些編碼器在Gemma 2的2B和9B模型的所有層上進行訓練。SAE是一種無監(jiān)督的方法,用于學習神經(jīng)網(wǎng)絡潛在表示的稀疏分解,并將其轉(zhuǎn)化為可解釋的特征。Gemma Scope的開放SAE權重和教程將幫助AI社區(qū)更容易開展安全性和可解釋性研究。你還可以體驗Gemma Scope交互式演示。下面請繼續(xù)閱讀最新文獻和我與ChatGPT有關AI自學成長的對話。
致以溫暖的問候,AJ 陳安均博士 | 公平健康系統(tǒng)研究所 | https://近期發(fā)表論文、新聞和事件,展示GenAI和LHS的進展 Blease CR, Locher C, Gaab J, H?gglund M,
Mandl KD. Generative artificial intelligence in primary care: an online survey
of UK general practitioners. BMJ Health Care Inform. 2024 Sep 17;31(1):e101102.
doi: 10.1136/bmjhci-2024-101102.https://informatics./content/31/1/e101102[2024/9]
A UK survey showed about 20% primary care doctors are using generative AI tools
in clinical practice. Top applications are writing documentation after patient
appointments and assisting with differential diagnosis.Williams CYK, Zack T, Miao BY, et
al. Use of a Large Language Model to Assess Clinical Acuity of Adults in the
Emergency Department. JAMA Netw Open. 2024;7(5):e248895.
doi:10.1001/jamanetworkopen.2024.8895https:///journals/jamanetworkopen/fullarticle/2818387[2024/5] This cross-sectional study of 251?401 adult
ED visits investigated the potential for an LLM to classify acuity levels of
patients in the ED based on the Emergency Severity Index across 10?000 patient pairs. The LLM demonstrated accuracy of 89% and was comparable
with human physician classification in a 500-pair subsample.Wang, X., Zhao, J., Marostica, E.et
al. A pathology foundation model for cancer diagnosis and prognosis
prediction. Nature (2024). https:///10.1038/s41586-024-07894-zhttps://www./articles/s41586-024-07894-z[2024/9] We devised the Clinical
Histopathology Imaging Evaluation Foundation (CHIEF) model, a general-purpose
weakly supervised machine learning framework to extract pathology imaging
features for systematic cancer evaluation.We successfully validated CHIEF
using 19,491 whole-slide images from 32 independent slide sets collected from
24 hospitals and cohorts internationally. Overall, CHIEF outperformed the
state-of-the-art deep learning methods by up to 36.1%, showing its ability to
address domain shifts observed in samples from diverse populations and
processed by different slide preparation methods. CHIEF provides a
generalizable foundation for efficient digital pathology evaluation for
patients with cancer.Zimolzak AJ?, Wei L?, Mir U?, et al. Machine learning to enhance electronic detection
of diagnostic errors. JAMA Netw Open.
2024;7(9):e2431982. doi:10.1001/jamanetworkopen.2024.31982 https:///journals/jamanetworkopen/fullarticle/2823187[2024/9] Based on expert input and existing
frameworks, we designed rules-based e-triggers to find possible MODs in
emergency departments (ED). Using Veterans Affairs national EHR data covering
more than 20 million unique individuals, we identified 2 high-risk cohorts: (1)
patients with stroke risk factors discharged from ED after presenting with
dizziness or vertigo who were subsequently hospitalized for stroke or TIA
within 30 days; and (2) patients discharged from ED with abdominal pain and
abnormal temperature who were subsequently hospitalized within 10 days. The
best-performing machine learning algorithm achieved a positive predictive value
for MODs in the stroke cohort of 92% and 93% among the patients with abdominal
pain and fever.Kaphingst
KA, Kohlmann WK, Lorenz Chambers R, et al. Uptake of Cancer Genetic
Services for Chatbot vs Standard-of-Care Delivery Models: The BRIDGE
Randomized Clinical Trial. JAMA Netw Open. 2024;7(9):e2432143.
doi:10.1001/jamanetworkopen.2024.32143https:///journals/jamanetworkopen/fullarticle/2823183[2024/9]
Despite this trial’s limitations, its findings advance the state of the science
in delivery of cancer genetic services to unaffected patients meeting criteria
for genetic evaluation. By comparing chatbot service delivery vs SOC using a
multisite RCT design, the findings of the BRIDGE equivalence trial support the
use of chatbot approaches to offer cancer genetic services. The findings show
that use of chatbots to deliver pretest genetic services has strong potential
to increase access to these services for unaffected patients. These trial
findings therefore support the implementation of systemwide population health
management strategies to deliver cancer genetic services.Liu T, Hetherington TC, Stephens
C, et al. AI-Powered Clinical Documentation and Clinicians’ Electronic Health
Record Experience: A Nonrandomized Clinical Trial. JAMA Netw Open. 2024;7(9):e2432460.
doi:10.1001/jamanetworkopen.2024.32460https:///journals/jamanetworkopen/fullarticle/2823302[2024/9] Approximately half of clinicians
using the AI-powered clinical documentation tool based on interest reported a
positive outcome, potentially reducing burnout. However, a significant subset
did not find time-saving benefits or improved EHR experience. Study limitations
include potential selection bias and recall bias in both groups. Further
research is needed to identify opportunities for improvement and understand the
impact on different clinician subsets and health systems.Pfohl SR, Cole-Lewis H, Sayres R, et al. A
toolbox for surfacing health equity harms and biases in large language models.Nat Med. 2024 Sep 23. doi: 10.1038/s41591-024-03258-2.https://www./articles/s41591-024-03258-2[2024/9] Reliably evaluating equity-related
model failures is a critical step toward developing systems that promote health
equity. Through our empirical study, we find that our approach surfaces
biases that may be missed by narrower evaluation approaches. Our experience
underscores the importance of using diverse assessment methodologies and
involving raters of varying backgrounds and expertise. AJ與ChatGPT對話,揭示機器怎么看人類醫(yī)療健康 Meta最近報告了一個用于LLM的自我學習評估器,這是什么意思?這意味著他們開發(fā)了一種無需人工標注數(shù)據(jù)就能評估大型語言模型(LLM)性能的系統(tǒng)。與依賴人工反饋或標注數(shù)據(jù)不同,自我學習評估器使用合成數(shù)據(jù)來判斷和改進模型的性能。這種方法使LLM能夠自我評估并自動提高其能力,從而減少了對昂貴且耗時的人工標注的需求。既然評估可以自我學習,那么在開發(fā)過程中,LLM是否有可能自動提高其性能?是的,LLM在開發(fā)過程中確實有可能自動提高其性能,自我學習成長。通過使用合成數(shù)據(jù)生成、自我評估和強化學習等技術,LLM可以識別自己的錯誤和弱點,并據(jù)此進行調(diào)整和再訓練。這就形成了一個反饋循環(huán),使模型在不依賴人為干預的情況下持續(xù)改進其能力。隨著時間的推移,這種自主過程可以顯著提升模型的性能。 哇,太棒了!LLM是怎樣自動提高性能的?請?zhí)峁┰敿毩鞒獭?/span>LLM自動提高性能的過程是通過一系列步驟來實現(xiàn)的,這些步驟構(gòu)成了一個學習和自我改進的反饋循環(huán)。以下是詳細的流程:是否有關于自主LLM開發(fā)或大模型自我學習成長的研究發(fā)表?是的,確實有一些關于自主LLM開發(fā)的研究發(fā)表。這些研究探討了自我學習、自我評估、強化學習和自監(jiān)督學習等技術,使得LLM能夠在極少人工干預的情況下提高其性能。例如,Meta報告了他們的 自我學習評估器,使得LLM能夠使用合成數(shù)據(jù)自行評估。些進展是創(chuàng)建更多自主AI系統(tǒng)的研究領域的一部分。自我學習評估器,用于LLM性能的自主評估。 無人工反饋的強化學習,使LLM能夠從自身的輸出中學習。 自監(jiān)督學習,LLM利用未標注的數(shù)據(jù)不斷改進。 來自OpenAI、Meta、Google DeepMind等機構(gòu)的研究人員在這一領域有重要貢獻,許多相關論文可以在arXiv等平臺上找到。
ELHS生成式智能醫(yī)學助手平臺 (https:///zh)智能助手平臺讓醫(yī)學生和醫(yī)療專業(yè)人員輕松開啟大模型智能醫(yī)療服務和學習之旅:免費使用多個大模型對話機器人,包括GPT-4o、Gemini 1.5pro和Llama 3.1。 免費試用自己專屬的開源大模型和對話機器人。
部署并微調(diào)自己專屬的開源大模型和對話機器人。
OpenAI推出了OpenAI
o1 (https:///index/learning-to-reason-with-llms/),這是一種通過強化學習訓練的大型語言模型,能夠執(zhí)行復雜推理。o1在回答之前會進行思考。在物理、生物和化學問題的基準測試 (GPQA) 上,OpenAI
o1超越了人類博士級別的準確率。
Meta
Llama 3.2 (https://ai./blog/llama-3-2-connect-2024-vision-edge-mobile-devices/)
Meta 發(fā)布了Llama 3.2,包含小型和中型視覺LLM(11B和90B),以及輕量級、僅文本的模型(1B和3B),這些模型可以運行在邊緣設備和移動設備上,提供預訓練和指令微調(diào)版本。
千問2.5對話機器人演示 (https:///spaces/Qwen/Qwen2.5) 千問模型在多種語言數(shù)據(jù)上進行了預訓練,覆蓋多個行業(yè)和領域,Qwen-72B訓練了驚人的3萬億個數(shù)據(jù)標記。其功能包括多模態(tài)理解與生成,以及API支持。
GenAI資源: 生成式智能醫(yī)學助手平臺 (https:///zh) 公平健康系統(tǒng)研究所官網(wǎng) (https://elhsi.org) 英文版《ELHS通訊》(https:///Newsletters) LHS技術國際論壇 (http://www./2020-lhs-technology-forum) 頭條視頻號:AJ24機器對話 微信視頻號:陳安均AJ2.4
|