一区二区三区日韩精品-日韩经典一区二区三区-五月激情综合丁香婷婷-欧美精品中文字幕专区

分享

大型語(yǔ)言模型的下一個(gè)前沿是生物學(xué)

 天承辦公室 2023-08-29 發(fā)布于北京

圖片


導(dǎo)語(yǔ)


正如許多現(xiàn)代觀察家所指出的,21世紀(jì)是生物學(xué)的世紀(jì)。過(guò)去幾年,以AlphaFold為代表的人工智能技術(shù)已經(jīng)成為破解生命復(fù)雜性難題的重要工具,而新興的大語(yǔ)言模型有望從多個(gè)層次進(jìn)一步提升生命科學(xué)研究。

研究領(lǐng)域:大型語(yǔ)言模型,人工智能,AI for Science,蛋白質(zhì)設(shè)計(jì),生物學(xué)基礎(chǔ)模型
圖片


來(lái)源:圖靈人
工智能
作者:Rob Toews
來(lái)源:
科技世代千高原
 




像 GPT-4 這樣的大型語(yǔ)言模型因其對(duì)自然語(yǔ)言的驚人掌握而席卷了世界。然而,大語(yǔ)言模型最重要的長(zhǎng)期機(jī)會(huì)將需要一種完全不同類(lèi)型的語(yǔ)言:生物學(xué)語(yǔ)言。

過(guò)去一個(gè)世紀(jì),生物化學(xué)、分子生物學(xué)和遺傳學(xué)研究進(jìn)展的長(zhǎng)征中出現(xiàn)了一個(gè)引人注目的主題:事實(shí)證明,生物學(xué)是一個(gè)可破譯、可編程、在某些方面甚至是數(shù)字系統(tǒng)。

DNA 僅使用四個(gè)變量——A(腺嘌呤)、C(胞嘧啶)、G(鳥(niǎo)嘌呤)和 T(胸腺嘧啶)來(lái)編碼地球上每個(gè)生物體的完整遺傳指令。將此與現(xiàn)代計(jì)算系統(tǒng)進(jìn)行比較,現(xiàn)代計(jì)算系統(tǒng)使用兩個(gè)變量(0 和 1)對(duì)世界上所有的數(shù)字電子信息進(jìn)行編碼。一個(gè)系統(tǒng)是二元系統(tǒng),另一個(gè)是四元系統(tǒng),但這兩個(gè)系統(tǒng)在概念上有驚人的重疊;這兩個(gè)系統(tǒng)都可以正確地被認(rèn)為是數(shù)字化的。

再舉一個(gè)例子,每個(gè)生物體中的每種蛋白質(zhì)都由以特定順序連接在一起的一維氨基酸串組成并定義。蛋白質(zhì)的長(zhǎng)度從幾十到幾千個(gè)氨基酸不等,有 20 種不同的氨基酸可供選擇。

這也代表了一種非??捎?jì)算的系統(tǒng),語(yǔ)言模型非常適合學(xué)習(xí)該系統(tǒng)。

正如 DeepMind 首席執(zhí)行官兼聯(lián)合創(chuàng)始人Demis Hassabis 所說(shuō):“在最基本的層面上,我認(rèn)為生物學(xué)可以被視為一種信息處理系統(tǒng),盡管它是一個(gè)極其復(fù)雜和動(dòng)態(tài)的系統(tǒng)。正如數(shù)學(xué)被證明是物理學(xué)的正確描述語(yǔ)言一樣,生物學(xué)也可能成為人工智能應(yīng)用的完美類(lèi)型?!?/span>

當(dāng)大型語(yǔ)言模型能夠利用大量信號(hào)豐富的數(shù)據(jù),推斷出遠(yuǎn)遠(yuǎn)超出任何人類(lèi)吸收能力的潛在模式和深層結(jié)構(gòu)時(shí),它們就會(huì)變得最強(qiáng)大。然后,他們可以利用對(duì)主題的復(fù)雜理解來(lái)生成新穎、令人驚嘆的復(fù)雜輸出。

例如,通過(guò)吸收互聯(lián)網(wǎng)上的所有文本,ChatGPT 等工具已經(jīng)學(xué)會(huì)了就任何可以想象的話(huà)題進(jìn)行深思熟慮和細(xì)致入微的交談。通過(guò)攝取數(shù)十億張圖像, Midjourney等文本到圖像模型已經(jīng)學(xué)會(huì)了按需生成創(chuàng)意原創(chuàng)圖像。

將大型語(yǔ)言模型指向生物數(shù)據(jù)——使它們能夠?qū)W習(xí)生命的語(yǔ)言——將釋放各種可能性,使自然語(yǔ)言和圖像相比之下顯得微不足道。

具體來(lái)說(shuō),這會(huì)是什么樣子?

短期內(nèi),在生命科學(xué)中應(yīng)用大型語(yǔ)言模型的最引人注目的機(jī)會(huì)是設(shè)計(jì)新型蛋白質(zhì)。



蛋白質(zhì)101



蛋白質(zhì)是生命本身的中心。正如著名生物學(xué)家阿瑟·萊斯克 (Arthur Lesk) 說(shuō)道:“在分子尺度的生命戲劇中,蛋白質(zhì)是發(fā)揮作用的地方?!?/span>

蛋白質(zhì)幾乎參與所有生物體內(nèi)發(fā)生的每項(xiàng)重要活動(dòng):消化食物、收縮肌肉、在全身輸送氧氣、攻擊外來(lái)病毒。你的荷爾蒙是由蛋白質(zhì)組成的;你的頭發(fā)也是如此。

蛋白質(zhì)非常重要,因?yàn)樗鼈冇猛緩V泛。它們能夠承擔(dān)大量不同的結(jié)構(gòu)和功能,遠(yuǎn)遠(yuǎn)超過(guò)任何其他類(lèi)型的生物分子。這種令人難以置信的多功能性是蛋白質(zhì)構(gòu)建方式的直接結(jié)果。

如上所述,每種蛋白質(zhì)都由一串按特定順序串在一起的稱(chēng)為氨基酸的結(jié)構(gòu)單元組成?;谶@種一維氨基酸序列,蛋白質(zhì)折疊成復(fù)雜的三維形狀,使它們能夠發(fā)揮其生物功能。

蛋白質(zhì)的形狀與其功能密切相關(guān)。舉個(gè)例子,抗體蛋白折疊成形狀,使它們能夠精確識(shí)別和瞄準(zhǔn)異物,就像鑰匙插入鎖中一樣。另一個(gè)例子,酶——加速生化反應(yīng)的蛋白質(zhì)——經(jīng)過(guò)專(zhuān)門(mén)設(shè)計(jì),可以與特定分子結(jié)合,從而催化特定反應(yīng)。因此,了解蛋白質(zhì)折疊成的形狀對(duì)于了解生物體如何運(yùn)作以及最終了解生命本身如何運(yùn)作至關(guān)重要。

半個(gè)多世紀(jì)以來(lái),僅根據(jù)蛋白質(zhì)的一維氨基酸序列確定蛋白質(zhì)的三維結(jié)構(gòu)一直是生物學(xué)領(lǐng)域的一項(xiàng)巨大挑戰(zhàn)。它被稱(chēng)為“蛋白質(zhì)折疊問(wèn)題”,困擾了幾代科學(xué)家。2007 年,一位評(píng)論員將蛋白質(zhì)折疊問(wèn)題描述為“現(xiàn)代科學(xué)中最重要但尚未解決的問(wèn)題之一”。



深度學(xué)習(xí)和蛋白質(zhì):天作之合



2020 年底,在生物學(xué)和計(jì)算領(lǐng)域的分水嶺時(shí)刻,一個(gè)名為AlphaFold的人工智能系統(tǒng)提出了蛋白質(zhì)折疊問(wèn)題的解決方案。AlphaFold由 Alphabet 的 DeepMind 構(gòu)建,能夠在大約一個(gè)原子的寬度內(nèi)正確預(yù)測(cè)蛋白質(zhì)的三維形狀,遠(yuǎn)遠(yuǎn)優(yōu)于人類(lèi)曾經(jīng)設(shè)計(jì)過(guò)的任何其他方法。
 
AlphaFold 的重要性怎么強(qiáng)調(diào)都不為過(guò)。長(zhǎng)期蛋白質(zhì)折疊專(zhuān)家約翰·莫爾特 總結(jié)得很好:“這是人工智能第一次解決嚴(yán)重的科學(xué)問(wèn)題?!?/span>

然而,當(dāng)談到人工智能和蛋白質(zhì)時(shí), AlphaFold 僅僅是一個(gè)開(kāi)始。

AlphaFold不是使用大型語(yǔ)言模型構(gòu)建的。它依賴(lài)于一種稱(chēng)為多重序列比對(duì)(MSA)的較舊的生物信息學(xué)結(jié)構(gòu),其中將蛋白質(zhì)的序列與進(jìn)化上相似的蛋白質(zhì)進(jìn)行比較,以推斷其結(jié)構(gòu)。
 
AlphaFold 所表明的那樣,MSA 非常強(qiáng)大,但它也有局限性。

其一,它速度慢且計(jì)算量大,因?yàn)樗枰獏⒖荚S多不同的蛋白質(zhì)序列才能確定任何一種蛋白質(zhì)的結(jié)構(gòu)。更重要的是,由于 MSA 需要存在大量進(jìn)化和結(jié)構(gòu)相似的蛋白質(zhì)才能推理出新的蛋白質(zhì)序列,因此它對(duì)于所謂的“孤兒蛋白質(zhì)”(幾乎沒(méi)有或沒(méi)有相似類(lèi)似物的蛋白質(zhì))的用途有限。這些孤兒蛋白大約占所有已知蛋白序列的 20%。

最近,研究人員開(kāi)始探索一種有趣的替代方法:使用大型語(yǔ)言模型而不是多重序列比對(duì)來(lái)預(yù)測(cè)蛋白質(zhì)結(jié)構(gòu)。

結(jié)構(gòu)和功能之間的復(fù)雜模式和相互關(guān)系:比如,如何改變蛋白質(zhì)某些部分的某些氨基酸。蛋白質(zhì)的序列會(huì)影響蛋白質(zhì)折疊的形狀。如果您愿意,蛋白質(zhì)語(yǔ)言模型能夠?qū)W習(xí)蛋白質(zhì)的語(yǔ)法或語(yǔ)言學(xué)。
 
蛋白質(zhì)語(yǔ)言模型的想法可以追溯到哈佛大學(xué) George Church 實(shí)驗(yàn)室2019 年UniRep 的工作
(盡管UniRep使用 LSTM 而不是當(dāng)今最先進(jìn)的 Transformer 模型)。

2022 年底,Meta推出了ESM-2和ESMFold ,這是迄今為止發(fā)布的最大、最復(fù)雜的蛋白質(zhì)語(yǔ)言模型之一,參數(shù)達(dá) 150 億個(gè)。(ESM-2 是 LLM 本身;ESMFold是其相關(guān)的結(jié)構(gòu)預(yù)測(cè)工具。)
 
ESM-2/ ESMFold在預(yù)測(cè)蛋白質(zhì)三維結(jié)構(gòu)方面與AlphaFold大致相同。但與AlphaFold不同的是,它能夠基于單個(gè)蛋白質(zhì)序列生成結(jié)構(gòu),而不需要任何結(jié)構(gòu)信息作為輸入。因此,它比AlphaFold快60倍。當(dāng)研究人員希望在蛋白質(zhì)工程工作流程中同時(shí)篩選數(shù)百萬(wàn)個(gè)蛋白質(zhì)序列時(shí),這種速度優(yōu)勢(shì)會(huì)產(chǎn)生巨大的差異。對(duì)于缺乏進(jìn)化上相似的類(lèi)似物的孤兒蛋白, ESMFold 還可以比 AlphaFold 產(chǎn)生更準(zhǔn)確的結(jié)構(gòu)預(yù)測(cè)。

語(yǔ)言模型能夠?qū)Φ鞍踪|(zhì)的“潛在空間”產(chǎn)生普遍的理解,這為蛋白質(zhì)科學(xué)開(kāi)辟了令人興奮的可能性。
 
AlphaFold以來(lái)的幾年里,更強(qiáng)大的概念進(jìn)步已經(jīng)發(fā)生。

簡(jiǎn)而言之,這些蛋白質(zhì)模型可以逆轉(zhuǎn):不是根據(jù)蛋白質(zhì)的序列來(lái)預(yù)測(cè)蛋白質(zhì)的結(jié)構(gòu),而是可以逆轉(zhuǎn)ESM-2等模型,并根據(jù)所需的特性用于生成自然界中不存在的全新蛋白質(zhì)序列。
 



發(fā)明新蛋白質(zhì)



當(dāng)今世界上存在的所有蛋白質(zhì)僅代表理論上可能存在的所有蛋白質(zhì)的極小一部分。機(jī)會(huì)就在于此。

給出一些粗略的數(shù)字:人體中存在的全部蛋白質(zhì)(即所謂的“人類(lèi)蛋白質(zhì)組”)估計(jì)有 80,000 到 400,000 種蛋白質(zhì)。與此同時(shí),理論上可能存在的蛋白質(zhì)數(shù)量約為101300,這是一個(gè)大得難以想象的數(shù)字,比宇宙中原子的數(shù)量還要多很多倍。(需要明確的是,并非所有這 101300 種可能的氨基酸組合都會(huì)產(chǎn)生生物學(xué)上可行的蛋白質(zhì)。遠(yuǎn)非如此。但某些子集會(huì)。)

數(shù)百萬(wàn)年來(lái),蜿蜒的進(jìn)化過(guò)程偶然發(fā)現(xiàn)了數(shù)萬(wàn)或數(shù)十萬(wàn)種這樣的可行組合。但這只是冰山一角。

用領(lǐng)先的蛋白質(zhì)人工智能初創(chuàng)公司Generate Biomedicines的聯(lián)合創(chuàng)始人莫莉·吉布森(Molly Gibson)的話(huà)來(lái)說(shuō):“大自然在生命歷史中采樣的序列空間量幾乎相當(dāng)于地球所有海洋中的一滴水。”
 
我們有機(jī)會(huì)改善自然。畢竟,盡管自然選擇的進(jìn)化力量非常強(qiáng)大,但它并不是無(wú)所不能的。它不提前計(jì)劃;它不會(huì)以自上而下的方式進(jìn)行推理或優(yōu)化。它隨機(jī)且機(jī)會(huì)主義地展開(kāi),傳播恰好有效的組合。

使用人工智能,我們可以第一次系統(tǒng)地、全面地探索蛋白質(zhì)空間的廣闊未知領(lǐng)域,以便設(shè)計(jì)出不同于自然界中曾經(jīng)存在的任何蛋白質(zhì),專(zhuān)為我們的醫(yī)療和商業(yè)需求而設(shè)計(jì)。
 
我們將能夠設(shè)計(jì)新的蛋白質(zhì)療法來(lái)解決所有人類(lèi)疾病——從癌癥到自身免疫性疾病,從糖尿病到神經(jīng)退行性疾病。除了醫(yī)學(xué)之外,我們將能夠創(chuàng)造出新型蛋白質(zhì),并在農(nóng)業(yè)、工業(yè)、材料科學(xué)、環(huán)境修復(fù)等領(lǐng)域具有革命性的應(yīng)用。

一些利用深度學(xué)習(xí)進(jìn)行從頭蛋白質(zhì)設(shè)計(jì)的早期努力并未利用大型語(yǔ)言模型。

一個(gè)突出的例子是ProteinMPNN ,它來(lái)自華盛頓大學(xué)世界著名的 David Baker 實(shí)驗(yàn)室。ProteinMPNN架構(gòu)不使用 LLM,而是嚴(yán)重依賴(lài)蛋白質(zhì)結(jié)構(gòu)數(shù)據(jù)來(lái)生成新蛋白質(zhì)。
 
Baker 實(shí)驗(yàn)室最近發(fā)布了RFdiffusion ,這是一種更先進(jìn)、更通用的蛋白質(zhì)設(shè)計(jì)模型。顧名思義, RFdiffusion 是使用擴(kuò)散模型構(gòu)建的,這種人工智能技術(shù)也為Midjourney 和 Stable Diffusion 等文本到圖像模型提供支持。RFdiffusion可以生成新穎的、可定制的蛋白質(zhì)“骨架”,即蛋白質(zhì)的整體結(jié)構(gòu)支架,然后可以將序列分層到其上。

ProteinMPNN和RFdiffusion等以結(jié)構(gòu)為中心的模型取得了令人印象深刻的成就,推動(dòng)了基于人工智能的蛋白質(zhì)設(shè)計(jì)的最先進(jìn)水平。然而,由于大型語(yǔ)言模型的變革能力,我們可能正處于該領(lǐng)域新的變革的風(fēng)口浪尖。
 
與蛋白質(zhì)設(shè)計(jì)的其他計(jì)算方法相比,為什么語(yǔ)言模型是一條如此有前途的道路?關(guān)鍵原因之一:規(guī)?;?/span>(scaling)
 



標(biāo)度律



人工智能近期取得的巨大進(jìn)展背后的關(guān)鍵力量之一是所謂的“標(biāo)度律”(Scaling law) :事實(shí)上,LLM 參數(shù)數(shù)量、訓(xùn)練數(shù)據(jù)和計(jì)算的持續(xù)增加帶來(lái)了幾乎令人難以置信的性能提升。
 
在規(guī)模每增加一個(gè)數(shù)量級(jí)時(shí),語(yǔ)言模型都表現(xiàn)出了非凡的、意想不到的、新興的新能力,超越了較小規(guī)模下的可能性。

近年來(lái),正是OpenAI對(duì)擴(kuò)展原則的承諾,使該組織躋身人工智能領(lǐng)域的最前沿。隨著OpenAI從 GPT-2 轉(zhuǎn)向 GPT-3、GPT-4 及更高版本,他們構(gòu)建了更大的模型,部署了更多的計(jì)算,并在更大的數(shù)據(jù)集上進(jìn)行了訓(xùn)練,比世界上任何其他組織都解鎖了令人驚嘆的、前所未有的 AI 功能。

標(biāo)度律與蛋白質(zhì)領(lǐng)域有何關(guān)系?

過(guò)去二十年來(lái),科學(xué)突破使得基因測(cè)序變得更加便宜且更容易獲得,可用于訓(xùn)練人工智能模型的 DNA 和蛋白質(zhì)序列數(shù)據(jù)的數(shù)量呈指數(shù)級(jí)增長(zhǎng),遠(yuǎn)遠(yuǎn)超過(guò)了蛋白質(zhì)結(jié)構(gòu)數(shù)據(jù)。
 
蛋白質(zhì)序列數(shù)據(jù)可以被標(biāo)記化,并且出于所有意圖和目的被視為文本數(shù)據(jù);畢竟,它由按一定順序排列的線(xiàn)性氨基酸串組成,就像句子中的單詞一樣。大型語(yǔ)言模型可以?xún)H針對(duì)蛋白質(zhì)序列進(jìn)行訓(xùn)練,以深入了解蛋白質(zhì)結(jié)構(gòu)和生物學(xué)。

因此,這個(gè)領(lǐng)域已經(jīng)成熟,可以進(jìn)行由大語(yǔ)言模型支持的大規(guī)模擴(kuò)展工作,這些努力可能會(huì)在蛋白質(zhì)科學(xué)領(lǐng)域帶來(lái)驚人的新見(jiàn)解和能力。

第一個(gè)使用基于 Transformer 的LLM 來(lái)設(shè)計(jì)從頭蛋白質(zhì)的作品是Salesforce Research 于 2020 年發(fā)表的ProGen 。最初的ProGen模型有 12 億個(gè)參數(shù)。
 
ProGen的首席研究員Ali Madani此后成立了一家名為 Profluu Bio 的初創(chuàng)公司,以推進(jìn)大語(yǔ)言模型驅(qū)動(dòng)的蛋白質(zhì)設(shè)計(jì)并將其商業(yè)化。

Madani率先使用大語(yǔ)言模型進(jìn)行蛋白質(zhì)設(shè)計(jì),但他也清楚地意識(shí)到,僅靠原始蛋白質(zhì)序列訓(xùn)練的現(xiàn)成語(yǔ)言模型并不是應(yīng)對(duì)這一挑戰(zhàn)的最有力方法。結(jié)合結(jié)構(gòu)和功能數(shù)據(jù)至關(guān)重要。

“蛋白質(zhì)設(shè)計(jì)的最大進(jìn)步將在于來(lái)自不同來(lái)源的仔細(xì)數(shù)據(jù)管理和可以靈活地從這些數(shù)據(jù)中學(xué)習(xí)的通用模型的交叉點(diǎn),”馬達(dá)尼說(shuō)?!斑@需要利用我們掌握的所有高信號(hào)數(shù)據(jù),包括來(lái)自實(shí)驗(yàn)室的蛋白質(zhì)結(jié)構(gòu)和功能信息。”
 
Nabla Bio是另一家應(yīng)用大語(yǔ)言模型設(shè)計(jì)新型蛋白質(zhì)療法的有趣的早期初創(chuàng)公司。Nabla是從哈佛大學(xué) George Church 實(shí)驗(yàn)室分離出來(lái)的,由UniRep背后的團(tuán)隊(duì)領(lǐng)導(dǎo),專(zhuān)門(mén)專(zhuān)注于抗體研究。鑒于當(dāng)今 60% 的蛋白質(zhì)治療藥物都是抗體,并且世界上最暢銷(xiāo)的兩種藥物都是抗體治療藥物,因此選擇這一選擇并不令人意外。

Nabla決定不開(kāi)發(fā)自己的療法,而是向生物制藥合作伙伴提供其尖端技術(shù),作為幫助他們開(kāi)發(fā)自己的藥物的工具。

隨著世界逐漸認(rèn)識(shí)到蛋白質(zhì)設(shè)計(jì)是一個(gè)巨大且尚未充分開(kāi)發(fā)的領(lǐng)域,可以在其中應(yīng)用大型語(yǔ)言模型看似神奇的功能,預(yù)計(jì)未來(lái)數(shù)月乃至數(shù)年該領(lǐng)域?qū)⒊霈F(xiàn)更多的創(chuàng)業(yè)活動(dòng)。
 



前方的路



弗朗西斯·阿諾德 (Frances Arnold) 在 2018 年諾貝爾化學(xué)獎(jiǎng)獲獎(jiǎng)感言中表示:“今天,我們可以出于各種實(shí)際目的讀取、寫(xiě)入和編輯任何 DNA 序列,但我們無(wú)法合成它。生命的密碼是一首交響樂(lè),引導(dǎo)著無(wú)數(shù)演奏者和樂(lè)器演奏出復(fù)雜而優(yōu)美的部分。也許我們可以從大自然的成分中剪切和粘貼片段,但我們不知道如何為單個(gè)酶通道寫(xiě)出條形?!?br> 
就在五年前,這也是事實(shí)。

但人工智能可能在生命史上第一次讓我們有能力從頭開(kāi)始真正構(gòu)建全新的蛋白質(zhì)(及其相關(guān)的遺傳密碼),專(zhuān)門(mén)為我們的需求而構(gòu)建。這是一個(gè)令人驚嘆的可能性。

這些新型蛋白質(zhì)將作為多種人類(lèi)疾病的治療藥物,從傳染病到癌癥;他們將幫助基因編輯成為現(xiàn)實(shí);他們將改變材料科學(xué);它們將提高農(nóng)業(yè)產(chǎn)量;它們將中和環(huán)境中的污染物;以及更多我們甚至無(wú)法想象的事情。
 
人工智能驅(qū)動(dòng)
(尤其是大語(yǔ)言模型驅(qū)動(dòng))的蛋白質(zhì)設(shè)計(jì)領(lǐng)域仍處于新生階段且未經(jīng)證實(shí)。有意義的科學(xué)、工程、臨床和商業(yè)障礙仍然存在。將這些新療法和產(chǎn)品推向市場(chǎng)需要數(shù)年時(shí)間。

但從長(zhǎng)遠(yuǎn)來(lái)看,人工智能的市場(chǎng)應(yīng)用很少有比這更具有前景的。

在未來(lái)的文章中,我們將深入研究蛋白質(zhì)設(shè)計(jì)的大語(yǔ)言模型,包括探索該技術(shù)最引人注目的商業(yè)應(yīng)用,以及計(jì)算結(jié)果和現(xiàn)實(shí)世界濕實(shí)驗(yàn)室實(shí)驗(yàn)之間的復(fù)雜關(guān)系。
 
從頭蛋白質(zhì)設(shè)計(jì)并不是生命科學(xué)中大型語(yǔ)言模型唯一令人興奮的機(jī)會(huì)。

語(yǔ)言模型可用于生成其他類(lèi)別的生物分子,特別是核酸。例如,一家名為 Inceptive 的熱門(mén)初創(chuàng)公司正在應(yīng)用大語(yǔ)言模型來(lái)開(kāi)發(fā)新型 RNA 療法。

其他團(tuán)體有著更廣泛的愿望,旨在建立通用的“生物學(xué)基礎(chǔ)模型”,可以融合基因組學(xué)、蛋白質(zhì)序列、細(xì)胞結(jié)構(gòu)、表觀遺傳狀態(tài)、細(xì)胞圖像、質(zhì)譜、空間轉(zhuǎn)錄組學(xué)等多種數(shù)據(jù)類(lèi)型。
 
最終目標(biāo)是超越對(duì)蛋白質(zhì)等單個(gè)分子的建模,轉(zhuǎn)而對(duì)蛋白質(zhì)與其他分子的相互作用進(jìn)行建模,然后對(duì)整個(gè)細(xì)胞、組織、器官進(jìn)行建模,最終對(duì)整個(gè)生物體進(jìn)行建模。

設(shè)計(jì)復(fù)雜生物系統(tǒng)的每一個(gè)復(fù)雜細(xì)節(jié)的人工智能系統(tǒng)的想法是令人難以置信的。隨著時(shí)間的推移,這將在我們的掌握之中。
 
二十世紀(jì)是由物理學(xué)的基本進(jìn)步定義的:從阿爾伯特·愛(ài)因斯坦的相對(duì)論到量子力學(xué)的發(fā)現(xiàn),從核彈到晶體管。正如許多現(xiàn)代觀察家所指出的,二十一世紀(jì)正在成為生物學(xué)的世紀(jì)。人工智能和大型語(yǔ)言模型將在未來(lái)幾十年解開(kāi)生物學(xué)的秘密并釋放其可能性方面發(fā)揮核心作用。

系好安全帶。


圖片

原文題目:

The Next Frontier For Large Language Models Is Biology

原文地址

https://www./sites/robtoews/2023/07/16/the-next-frontier-for-large-language-models-is-biology/

圖片

圖片來(lái)源: U OF W, ROYAL SOCIETY, HARVARD

Large language models like GPT-4 have taken the world by storm thanks to their astonishing command of natural language. Yet the most significant long-term opportunity for LLMs will entail an entirely different type of language: the language of biology.
One striking theme has emerged from the long march of research progress across biochemistry, molecular biology and genetics over the past century: it turns out that biology is a decipherable, programmable, in some ways even digital system.
DNA encodes the complete genetic instructions for every living organism on earth using just four variables—A (adenine), C (cytosine), G (guanine) and T (thymine). Compare this to modern computing systems, which use two variables—0 and 1—to encode all the world’s digital electronic information. One system is binary and the other is quaternary, but the two have a surprising amount of conceptual overlap; both systems can properly be thought of as digital.
To take another example, every protein in every living being consists of and is defined by a one-dimensional string of amino acids linked together in a particular order. Proteins range from a few dozen to several thousand amino acids in length, with 20 different amino acids to choose from.
This, too, represents an eminently computable system, one that language models are well-suited to learn.
As DeepMind CEO/cofounder Demis Hassabis put it: “At its most fundamental level, I think biology can be thought of as an information processing system, albeit an extraordinarily complex and dynamic one. Just as mathematics turned out to be the right description language for physics, biology may turn out to be the perfect type of regime for the application of AI.”
Large language models are at their most powerful when they can feast on vast volumes of signal-rich data, inferring latent patterns and deep structure that go well beyond the capacity of any human to absorb. They can then use this intricate understanding of the subject matter to generate novel, breathtakingly sophisticated output.
By ingesting all of the text on the internet, for instance, tools like ChatGPT have learned to converse with thoughtfulness and nuance on any imaginable topic. By ingesting billions of images, text-to-image models like Midjourney have learned to produce creative original imagery on demand.
Pointing large language models at biological data—enabling them to learn the language of life—will unlock possibilities that will make natural language and images seem almost trivial by comparison.
What, concretely, will this look like?
In the near term, the most compelling opportunity to apply large language models in the life sciences is to design novel proteins.

Proteins 101

Proteins are at the center of life itself. As prominent biologist Arthur Lesk put it, “In the drama of life at a molecular scale, proteins are where the action is.”
Proteins are involved in virtually every important activity that happens inside every living thing: digesting food, contracting muscles, moving oxygen throughout the body, attacking foreign viruses. Your hormones are made out of proteins; so is your hair.
Proteins are so important because they are so versatile. They are able to undertake a vast array of different structures and functions, far more than any other type of biomolecule. This incredible versatility is a direct consequence of how proteins are built.
As mentioned above, every protein consists of a string of building blocks known as amino acids strung together in a particular order. Based on this one-dimensional amino acid sequence, proteins fold into complex three-dimensional shapes that enable them to carry out their biological functions.
A protein’s shape relates closely to its function. To take one example, antibody proteins fold into shapes that enable them to precisely identify and target foreign bodies, like a key fitting into a lock. As another example, enzymes—proteins that speed up biochemical reactions—are specifically shaped to bind with particular molecules and thus catalyze particular reactions. Understanding the shapes that proteins fold into is thus essential to understanding how organisms function, and ultimately how life itself works.
Determining a protein’s three-dimensional structure based solely on its one-dimensional amino acid sequence has stood as a grand challenge in the field of biology for over half a century. Referred to as the “protein folding problem,” it has stumped generations of scientists. One commentator in 2007 described the protein folding problem as “one of the most important yet unsolved issues of modern science.”

Deep Learning And Proteins: A Match Made In Heaven

In late 2020, in a watershed moment in both biology and computing, an AI system called AlphaFold produced a solution to the protein folding problem. Built by Alphabet’s DeepMind, AlphaFold correctly predicted proteins’ three-dimensional shapes to within the width of about one atom, far outperforming any other method that humans had ever devised.
It is hard to overstate AlphaFold’s significance. Long-time protein folding expert John Moult summed it upwell: “This is the first time a serious scientific problem has been solved by AI.”
Yet when it comes to AI and proteins, AlphaFold was just the beginning.
AlphaFold was not built using large language models. It relies on an older bioinformatics construct called multiple sequence alignment (MSA), in which a protein’s sequence is compared to evolutionarily similar proteins in order to deduce its structure.
MSA can be powerful, as AlphaFold made clear, but it has limitations.
For one, it is slow and compute-intensive because it needs to reference many different protein sequences in order to determine any one protein’s structure. More importantly, because MSA requires the existence of numerous evolutionarily and structurally similar proteins in order to reason about a new protein sequence, it is of limited use for so-called “orphan proteins”—proteins with few or no close analogues. Such orphan proteins represent roughly 20% of all known protein sequences.
Recently, researchers have begun to explore an intriguing alternative approach: using large language models, rather than multiple sequence alignment, to predict protein structures.
“Protein language models”—LLMs trained not on English words but rather on protein sequences—have demonstrated an astonishing ability to intuit the complex patterns and interrelationships between protein sequence, structure and function: say, how changing certain amino acids in certain parts of a protein’s sequence will affect the shape that the protein folds into. Protein language models are able to, if you will, learn the grammar or linguistics of proteins.
The idea of a protein language model dates back to the 2019 UniRep work out of George Church’s lab at Harvard (though UniRep used LSTMs rather than today’s state-of-the-art transformer models).
In late 2022, Meta debuted ESM-2 and ESMFold, one of the largest and most sophisticated protein language models published to date, weighing in at 15 billion parameters. (ESM-2 is the LLM itself; ESMFold is its associated structure prediction tool.)
ESM-2/ESMFold is about as accurate as AlphaFold at predicting proteins’ three-dimensional structures. But unlike AlphaFold, it is able to generate a structure based on a single protein sequence, without requiring any structural information as input. As a result, it is up to 60 times faster than AlphaFold. When researchers are looking to screen millions of protein sequences at once in a protein engineering workflow, this speed advantage makes a huge difference. ESMFold can also produce more accurate structure predictions than AlphaFold for orphan proteins that lack evolutionarily similar analogues.
Language models’ ability to develop a generalized understanding of the “l(fā)atent space” of proteins opens up exciting possibilities in protein science.
But an even more powerful conceptual advance has taken place in the years since AlphaFold.
In short, these protein models can be inverted: rather than predicting a protein’s structure based on its sequence, models like ESM-2 can be reversed and used to generate totally novel protein sequences that do not exist in nature based on desired properties.

Inventing New Proteins

All the proteins that exist in the world today represent but an infinitesimally tiny fraction of all the proteins that could theoretically exist. Herein lies the opportunity.
To give some rough numbers: the total set of proteins that exist in the human body—the so-called “human proteome”—is estimated to number somewhere between 80,000 and 400,000 proteins. Meanwhile, the number of proteins that could theoretically exist is in the neighborhood of 101300—an unfathomably large number, many times greater than the number of atoms in the universe. (To be clear, not all of these 101300 possible amino acid combinations would result in biologically viable proteins. Far from it. But some subset would.)
Over many millions of years, the meandering process of evolution has stumbled upon tens or hundreds of thousands of these viable combinations. But this is merely the tip of the iceberg.
In the words of Molly Gibson, cofounder of leading protein AI startup Generate Biomedicines: “The amount of sequence space that nature has sampled through the history of life would equate to almost just a drop of water in all of Earth’s oceans.”
An opportunity exists for us to improve upon nature. After all, as powerful of a force as it is, evolution by natural selection is not all-seeing; it does not plan ahead; it does not reason or optimize in top-down fashion. It unfolds randomly and opportunistically, propagating combinations that happen to work.
Using AI, we can for the first time systematically and comprehensively explore the vast uncharted realms of protein space in order to design proteins unlike anything that has ever existed in nature, purpose-built for our medical and commercial needs.
We will be able to design new protein therapeutics to address the full gamut of human illness—from cancer to autoimmune diseases, from diabetes to neurodegenerative disorders. Looking beyond medicine, we will be able to create new classes of proteins with transformative applications in agriculture, industrials, materials science, environmental remediation and beyond.
Some early efforts to use deep learning for de novo protein design have not made use of large language models.
One prominent example is ProteinMPNN, which came out of David Baker’s world-renowned lab at the University of Washington. Rather than using LLMs, the ProteinMPNN architecture relies heavily on protein structure data in order to generate novel proteins.
The Baker lab more recently published RFdiffusion, a more advanced and generalized protein design model. As its name suggests, RFdiffusion is built using diffusion models, the same AI technique that powers text-to-image models like Midjourney and Stable Diffusion. RFdiffusion can generate novel, customizable protein “backbones”—that is, proteins’ overall structural scaffoldings—onto which sequences can then be layered.
Structure-focused models like ProteinMPNN and RFdiffusion are impressive achievements that have advanced the state of the art in AI-based protein design. Yet we may be on the cusp of a new step-change in the field, thanks to the transformative capabilities of large language models.
Why are language models such a promising path forward compared to other computational approaches to protein design? One key reason: scaling.

Scaling Laws

One of the key forces behind the dramatic recent progress in artificial intelligence is so-called “scaling laws”: the fact that almost unbelievable improvements in performance result from continued increases in LLM parameter count, training data and compute.
At each order-of-magnitude increase in scale, language models have demonstrated remarkable, unexpected, emergent new capabilities that transcend what was possible at smaller scales.
It is OpenAI’s commitment to the principle of scaling, more than anything else, that has catapulted the organization to the forefront of the field of artificial intelligence in recent years. As they moved from GPT-2 to GPT-3 to GPT-4 and beyond, OpenAI has built larger models, deployed more compute and trained on larger datasets than any other group in the world, unlocking stunning and unprecedented AI capabilities.
How are scaling laws relevant in the realm of proteins?
Thanks to scientific breakthroughs that have made gene sequencing vastly cheaper and more accessible over the past two decades, the amount of DNA and thus protein sequence data available to train AI models is growing exponentially, far outpacing protein structure data.
Protein sequence data can be tokenized and for all intents and purposes treated as textual data; after all, it consists of linear strings of amino acids in a certain order, like words in a sentence. Large language models can be trained solely on protein sequences to develop a nuanced understanding of protein structure and biology.
This domain is thus ripe for massive scaling efforts powered by LLMs—efforts that may result in astonishing emergent insights and capabilities in protein science.
The first work to use transformer-based LLMs to design de novo proteins was ProGen, published by Salesforce Research in 2020. The original ProGen model was 1.2 billion parameters.
Ali Madani, the lead researcher on ProGen, has since founded a startup named Profluent Bio to advance and commercialize the state of the art in LLM-driven protein design.
While he pioneered the use of LLMs for protein design, Madani is also clear-eyed about the fact that, by themselves, off-the-shelf language models trained on raw protein sequences are not the most powerful way to tackle this challenge. Incorporating structural and functional data is essential.
“The greatest advances in protein design will be at the intersection of careful data curation from diverse sources and versatile modeling that can flexibly learn from that data,” Madani said. “This entails making use of all high-signal data at our disposal—including protein structures and functional information derived from the laboratory.”
Another intriguing early-stage startup applying LLMs to design novel protein therapeutics is Nabla Bio. Spun out of George Church’s lab at Harvard and led by the team behind UniRep, Nabla is focused specifically on antibodies. Given that 60% of all protein therapeutics today are antibodies and that the two highest-selling drugs in the world are antibody therapeutics, it is hardly a surprising choice.
Nabla has decided not to develop its own therapeutics but rather to offer its cutting-edge technology to biopharma partners as a tool to help them develop their own drugs.
Expect to see much more startup activity in this area in the months and years ahead as the world wakes up to the fact that protein design represents a massive and still underexplored field to which to apply large language models’ seemingly magical capabilities.

The Road Ahead

In her acceptance speech for the 2018 Nobel Prize in Chemistry, Frances Arnold said: “Today we can for all practical purposes read, write, and edit any sequence of DNA, but we cannot compose it. The code of life is a symphony, guiding intricate and beautiful parts performed by an untold number of players and instruments. Maybe we can cut and paste pieces from nature’s compositions, but we do not know how to write the bars for a single enzymic passage.”
As recently as five years ago, this was true.
But AI may give us the ability, for the first time in the history of life, to actually compose entirely new proteins (and their associated genetic code) from scratch, purpose-built for our needs. It is an awe-inspiring possibility.
These novel proteins will serve as therapeutics for a wide range of human illnesses, from infectious diseases to cancer; they will help make gene editing a reality; they will transform materials science; they will improve agricultural yields; they will neutralize pollutants in the environment; and so much more that we cannot yet even imagine.
The field of AI-powered—and especially LLM-powered—protein design is still nascent and unproven. Meaningful scientific, engineering, clinical and business obstacles remain. Bringing these new therapeutics and products to market will take years.
Yet over the long run, few market applications of AI hold greater promise.
In future articles, we will delve deeper into LLMs for protein design, including exploring the most compelling commercial applications for the technology as well as the complicated relationship between computational outcomes and real-world wet lab experiments.
Let’s end by zooming out. De novo protein design is not the only exciting opportunity for large language models in the life sciences.
Language models can be used to generate other classes of biomolecules, notably nucleic acids. A buzzy startup named Inceptive, for example, is applying LLMs to generate novel RNA therapeutics.
Other groups have even broader aspirations, aiming to build generalized “foundation models for biology” that can fuse diverse data types spanning genomics, protein sequences, cellular structures, epigenetic states, cell images, mass spectrometry, spatial transcriptomics and beyond.
The ultimate goal is to move beyond modeling an individual molecule like a protein to modeling proteins’ interactions with other molecules, then to modeling whole cells, then tissues, then organs—and eventually entire organisms.
The idea of building an artificial intelligence system that can understand and design every intricate detail of a complex biological system is mind-boggling. In time, this will be within our grasp.
The twentieth century was defined by fundamental advances in physics: from Albert Einstein’s theory of relativity to the discovery of quantum mechanics, from the nuclear bomb to the transistor. As many modern observers have noted, the twenty-first century is shaping up to be the century of biology. Artificial intelligence and large language models will play a central role in unlocking biology’s secrets and unleashing its possibilities in the decades ahead.

Buckle up.


未來(lái)智能實(shí)驗(yàn)室的主要工作包括:建立AI智能系統(tǒng)智商評(píng)測(cè)體系,開(kāi)展世界人工智能智商評(píng)測(cè);開(kāi)展互聯(lián)網(wǎng)(城市)大腦研究計(jì)劃,構(gòu)建互聯(lián)網(wǎng)(城市)大腦技術(shù)和企業(yè)圖譜,為提升企業(yè),行業(yè)與城市的智能水平服務(wù)。每日推薦范圍未來(lái)科技發(fā)展趨勢(shì)的學(xué)習(xí)型文章。目前線(xiàn)上平臺(tái)已收藏上千篇精華前沿科技文章和報(bào)告。

    本站是提供個(gè)人知識(shí)管理的網(wǎng)絡(luò)存儲(chǔ)空間,所有內(nèi)容均由用戶(hù)發(fā)布,不代表本站觀點(diǎn)。請(qǐng)注意甄別內(nèi)容中的聯(lián)系方式、誘導(dǎo)購(gòu)買(mǎi)等信息,謹(jǐn)防詐騙。如發(fā)現(xiàn)有害或侵權(quán)內(nèi)容,請(qǐng)點(diǎn)擊一鍵舉報(bào)。
    轉(zhuǎn)藏 分享 獻(xiàn)花(0

    0條評(píng)論

    發(fā)表

    請(qǐng)遵守用戶(hù) 評(píng)論公約

    類(lèi)似文章 更多

    亚洲精品国产福利在线| 国产精品一区二区香蕉视频| 成人精品一级特黄大片| 亚洲高清中文字幕一区二三区| 大香蕉伊人一区二区三区| 国产亚洲精品一二三区| 欧美成人黄色一区二区三区| 久久精品国产亚洲熟女| 欧美极品欧美精品欧美| 国产中文字幕一区二区| 国产熟女一区二区精品视频| 一区二区欧美另类稀缺| 亚洲另类女同一二三区| 亚洲熟女精品一区二区成人| 美女被啪的视频在线观看| 男女午夜在线免费观看视频| 成人午夜激情在线免费观看| 精品日韩欧美一区久久| 91欧美日韩一区人妻少妇| 日韩美女偷拍视频久久| 成年人免费看国产视频| 91麻豆视频国产一区二区| 国产色第一区不卡高清| 成人国产激情在线视频| 国产人妻熟女高跟丝袜| 国内尹人香蕉综合在线| 国产亚洲中文日韩欧美综合网| 国产一级一片内射视频在线| 黄色美女日本的美女日人| 老司机亚洲精品一区二区| 91精品日本在线视频| 中文字幕久热精品视频在线| 欧美夫妻性生活一区二区| 日韩欧美国产亚洲一区| 亚洲淫片一区二区三区| 日韩特级黄片免费观看| 老司机精品国产在线视频| 精品香蕉一区二区在线| 亚洲一区二区三区国产| 在线观看视频国产你懂的| 好吊日在线观看免费视频|