Multimodal Foundation Model Methodology

여러 modality를 한 모델로 묶으면 무엇이 가능해지는가

파운데이션 모델이 single-cell RNA에서 protein·DNA·structure·image·spatial까지 확장된다. 어떤 레이어 수학(Transformer/VAE/Discrete Diffusion)을 쓰는지, gene/cell/residue를 어떻게 토큰화하는지, 어떤 태스크는 single modality로 풀리고 어떤 것은 paired multimodal이 필수인지, 그리고 아키텍처 패턴이 그 위에서 어떻게 골라지는지를 한 페이지에 정리했다. 2024–2026 사이에 scaling laws, token unification, perturbation atlases, geometric DL 같은 새 흐름이 굳어지고 있고 — 이 페이지는 우리 wiki에 정리된 모델 카탈로그와 최신 트렌드를 인터랙티브로 따라간다.

layer math 토큰화 전략 아키텍처 패턴 최신 트렌드 모델 카탈로그

section 01

Layer math — Transformer · VAE · Discrete Diffusion

attention · ELBO · absorbing kernel

멀티모달 파운데이션 모델 corpus는 크게 세 갈래로 나뉜다. Transformer(self-attention 기반, 토큰 시퀀스로 modality를 다룸), VAE(modality-specific encoder + shared latent), Discrete diffusion(absorbing state로 categorical token을 마스크하고 복원). 카드를 클릭하면 수식과 사용 예시가 바뀐다.

Transformer block

표준 구성: token embedding → multi-head self-attention → FFN → LayerNorm, N번 반복. single-cell 데이터엔 gene 순서가 없으므로 transformer를 적용할 때 위치 정보를 어떻게 부여할지가 첫 결정.

Q = X W_Q, K = X W_K, V = X W_V multi-head: head_i = Attn(QW_i^Q, KW_i^K, VW_i^V) output: Concat(head_1, …, head_h) W_O

scGPT generative attention masking: 매 step마다 일부 gene을 unknown으로 두고 visible set에서 예측, 예측된 gene이 다음 step에서 visible로 추가된다 — 순서 없는 set에 causal masking을 재현. condition token(modality, batch, perturbation)은 별도 토큰으로 들어가 elementwise sum.

scFoundation asymmetric encoder/decoder: 깊은 encoder + 얕은 decoder, MAE 스타일. binning 대신 expression scalar를 continuous projection으로 보존.

section 02

토큰화 전략 — 같은 셀, 다른 어휘

rank · bin · projection · text · epigenome · perturbation

foundation model은 결국 "무엇을 토큰으로 보느냐"가 시작이다. 같은 single-cell 데이터를 Geneformer는 rank로, scGPT는 bin으로, scFoundation은 continuous projection으로, scELMo는 LLM이 만든 텍스트 임베딩으로 본다. EpiAgent는 cCRE를 토큰화하고, Tahoe-x1은 drug까지 토큰화한다. 각 카드 클릭 시 토큰 시각화가 바뀐다.

section 03

single-modal로 가능한가, multimodal이 필수인가

cell type · translation · cis-regulation · zero-shot

"멀티모달이면 다 좋다"가 답이 아니다. 어떤 task는 RNA 한 layer로도 SOTA가 나온다 (Geneformer가 dosage-sensitive gene을 AUC 0.91로 맞춤). 어떤 task는 paired multi-modal이 없으면 정의 자체가 안 된다 (cis-regulatory link, primed/committed progenitor 분리). 두 컬럼으로 비교.

single-modal로 충분 multimodal 추가 시 정확도↑

RNA 단일 modality에서 SOTA가 나오거나, 멀티모달이 정확도를 약간 올릴 뿐 정의가 안 바뀌는 task들.

Cell type annotation — PBMC/pancreas/lung 벤치마크 SOTA가 RNA-only 모델로 도달. scGPT · Geneformer · CellFM
Batch correction · integration — Harmony·Scanorama·scVI 수준. scGPT competitive
Expression denoising · imputation — read-depth-aware MLM. scFoundation RDA · scPRINT-2 XPressor
Genetic perturbation prediction — drug/KO 효과 예측. scGPT · GEARS · Tahoe-x1
GRN inference — attention probing, in-silico deletion. scGPT · Geneformer
Dosage-sensitive gene prediction — Geneformer AUC 0.91, dilated cardiomyopathy iPSC 검증. Geneformer

multimodal이 필수 paired data 없으면 정의 안 됨

paired/aligned multimodal data가 본질적으로 필요한 task들. single modality로는 task 자체를 정의할 수 없다.

Cross-modal translation — RNA→ATAC, RNA→protein, ATAC→RNA. scButterfly · BABEL · Polarbear · sciPENN · UnitedNet
Cross-modal imputation — 비싼 modality(ATAC, protein, structure)를 싼 modality(RNA)에서 복원.
Cis-regulatory link discovery — 같은 cell의 ATAC peak ↔ RNA expression Pearson, ±500 kb 윈도. 10x Multiome 필수
Cell-state resolution beyond RNA — primed vs committed progenitor가 RNA로 같지만 chromatin이 다른 케이스. Multiome
Cross-modal explainability — UnitedNet+SHAP가 한 modality 피처가 다른 modality 예측에 얼마나 기여하는지 정량.
Joint sequence-structure protein generation — folding, inverse folding, motif scaffolding 한 모델로. DPLM-2
Spatial multi-omics integration — dissociated + spatial 동시 학습. Nicheformer
Cross-modal zero-shot queries — "이 DNA variant는 어떤 cell type에서?", "이 단백질 구조는 어떤 disease와?". Cui 2025 vision

section 04

아키텍처 패턴 — 어떻게 modality를 묶을 것인가

fusion · alignment · diffusion · graph

멀티모달 파운데이션 모델 corpus를 펼치면 몇 가지 패턴으로 환원된다. 가장 단순한 concatenation(거의 안 쓰임)부터 Seurat의 WNN, totalVI/Matilda의 shared-latent VAE, scButterfly의 dual-aligned, UnitedNet의 multi-task fusion, scGPT의 condition token, Cui 2025 비전인 CLIP-style, DPLM-2의 discrete diffusion, 그리고 PINNACLE/ATOMICA의 geometric DL까지. 패턴 버튼을 누르면 architecture 다이어그램이 바뀐다.

Concatenation + shared encoder

pattern 01 · early fusion

section 05

2024–2026 트렌드 — 패턴 위에서 어디로 움직이는가

scaling · token unification · perturbation · CLIP · geometric

"패턴 steady state"가 끝나고, 2024–2026년에는 그 위에서 scaling laws가 경험적으로 확정되고, token unification이 protein → 중심 dogma 전체로 확장되고, perturbation atlas가 1급 modality로 들어오고, general LLM 재활용·spatial native·tri-modal CLIP·평가 강화·virtual cell·geometric DL 같은 갈래가 자리잡았다. 트렌드를 카드로.

trend 01 · scaling laws

scaling은 이제 가설이 아니라 데이터다

C2S-Scale 27B · Geneformer-v2 316M · CellFM 800M · Tahoe-x1 3B

2023년까지 scaling law는 NLP 유추였다. 이제는 27B → 1B → 27B 410M Pythia/Gemma family, Geneformer-v2의 4,096-gene context 38M/104M/316M scaling 실험이 power law를 직접 확인. 27B C2S-Scale은 perturbation prediction에서 GPT-4와 scGPT를 동시에 능가. 결과: scratch pretraining + LoRA/QLoRA fine-tuning이 default.

trend 02 · token unification

protein을 넘어 central dogma 전체

MIMIC 1B · MaxToki 1B · DPLM-2

DPLM-2가 sequence + structure 토큰을 한 stream으로 묶었다. 2026 MIMIC은 DNA + RNA + protein sequence + AlphaFold backbone + DSSP + SASA + MaSIF + phyloP + ATAC-seq + CAGE + RASP2 + biomedical text까지 통합. MIMIC 1B가 ESM3-open 1.4B, Evo 2 7B를 더 적은 파라미터로 능가 — 같은 파라미터에서 멀티모달 pretraining이 단일 modality를 추월한다는 가장 강한 신호.

trend 03 · perturbation atlas

관찰만으로는 인과를 못 배운다

Tahoe-100M · scLAMBDA · causal cell-tissue atlas vision

2024년부터 perturbation atlas가 first-class training data. Tahoe-100M은 50 cancer cell line × 1,100+ 화합물 = 1억 perturbed cell. Tahoe-x1은 drug ID를 gene token과 함께 입력. scLAMBDA는 LLM-derived gene embedding을 disentangled VAE에 결합해 unseen gene perturbation도 generalization. Rood 2024 causal vision: 관찰 데이터만으로는 cellular mechanism의 인과를 학습할 수 없다.

trend 04 · general LLM 재활용

cell sentence는 그냥 텍스트다

C2S-Scale (Gemma-2 / Pythia) · scELMo · GenePT · scLAMBDA

domain-specific transformer를 scratch부터 만드는 대신, frontier general LLM을 그대로 적응. C2S-Scale은 cell을 "cell sentence"로 표현해 LLM을 그대로 학습. scELMo·GenePT는 GPT-3.5/4의 텍스트 임베딩을 gene embedding으로 그대로 사용 — domain pretraining이 아예 없다. 길이 갈래: domain-specific (scGPT, scFoundation) vs LLM repurposing (C2S-Scale). Cui 2025는 둘 다 공존한다고 본다 — instruction-following layer + molecular substrate layer.

trend 05 · spatial native

spatial은 이제 modality다

Nicheformer 110M · Virtual Embryo (Cao 2026)

Nicheformer는 110M cell(53.8M spatial) 학습. technology-specific mean vector로 MERFISH·Xenium·CosMx·dissociated를 같은 공간에서 정규화. spatial niche query가 first-class. Cao 2026 Virtual Embryo는 flow matching + foundation model embedding으로 volumetric 배아 지도를 spatial-temporal coordinate 조건부로 생성.

trend 06 · tri-modal CLIP

CLIP-style alignment가 single modality를 이긴다

ProteinAligner 867M (ESM-2 + ESM-IF1 + text)

Cui 2025 비전(Pattern 7)이 protein 쪽에서 처음 구체화. ProteinAligner는 sequence(ESM-2 650M) + 3D 구조(ESM-IF1 124M GVP-GNN) + 기능 텍스트(8-layer transformer)를 sequence-anchored contrastive로 묶음. 867M으로 ESM3 1.4B를 대부분의 downstream에서 능가. pure contrastive 목적함수가 MLM/contrastive 충돌을 피한다는 핵심 발견.

trend 07 · evaluation 강화

"foundation model > 모든 baseline" 주장은 정량 조건부로

Wei 2025 · Csendes 2025 · Kedzierska 2025 · scPRINT-2 42-config

2025년 벤치마크들이 이전 주장의 약점을 노출. Wei 2025: 27 perturbation 방법 × 29 데이터셋 × 6 metric — fine-tuning 데이터가 충분히 클 때만 foundation model이 baseline 능가, 항상 1등인 모델은 없음. Csendes 2025: gene-embedding 출처(scGPT vs scFoundation vs scELMo)보다 downstream 아키텍처가 더 중요. Kedzierska 2025: zero-shot에서는 Geneformer/scGPT가 단순 baseline보다 robustly 낫지 않다. scPRINT-2의 42-config 체계가 새 표준.

trend 08 · virtual cell program

AIVC는 더 이상 슬로건이 아니다

Bunne 2024 (HCA) · Cui 2025 (Nature) · Cao 2026 · Yang 2024

"AI Virtual Cell"이 구체적인 program으로 떠오른다. Bunne 2024 HCA 리더십 서명 로드맵 → Cui 2025 multimodal MFM blueprint → Cao 2026 virtual embryo 예측 시뮬레이션 → Yang 2024 cancer-focused AIVC perspective. VCell 2010에서 시작한 가상 세포 lineage가 이제 foundation model을 substrate로 한다.

trend 09 · geometric DL

3D는 token으로 환원되지 않는다

PINNACLE 394,760 representations · ATOMICA 2M+ interfaces

패턴 1–8은 transformer/VAE/diffusion 모두 token 기반. 그와 별도로 geometric DL family가 자라고 있다. PINNACLE: PPI × cell-type interaction × tissue 계층의 multiscale GNN, 156 contexts × 24 tissues = 같은 단백질에 394,760개 representation. ATOMICA: protein-X 인터페이스(X = protein/ligand/RNA/DNA/metal/peptide) 200만 + 복합체에 대한 SE(3) equivariant geometric DL. AlphaFold3급 구조 데이터가 늘어나는 2026–2027에 이 family가 더 커질 전망.

section 06

모델 카탈로그 — 멀티모달 파운데이션 모델

single-cell · protein · DNA · spatial · multimodal-anchor

wiki에 정리된 모델들을 modality·anchor·아키텍처 패턴으로 분류해서 한 화면에. 검색·필터로 좁힐 수 있고, 각 카드는 모델 이름 + 파라미터 + 입력 modality + 핵심 innovation을 보여준다. protein-anchor 9 모델, single-cell 12 모델, genomic·spatial 8 모델 — 같은 패턴(예: CLIP-style)이 protein과 single-cell 양쪽에서 어떻게 다르게 구현되는지 비교에 유용.

section 07

목적별 패턴 추천 — 무엇을 하려면 무엇을 골라야 하는가

goal → pattern → why

패턴 중 하나를 고르는 기준은 결국 "어떤 데이터가 있고, 어떤 deliverable이 필요한가"다. wiki 자료 기반 추천 매트릭스. 왼쪽에 task, 가운데 추천 패턴/모델, 오른쪽에 그렇게 고르는 이유.

CITE-seq · Multiome 빠른 임베딩

Pattern 02 WNN (Seurat v4) 또는 Pattern 03 totalVI / MultiVI

analytic graph 또는 가벼운 VAE로 충분 — 큰 모델이 필요 없는 routine integration.

CITE-seq / SHARE-seq / TEA-seq 다중태스크

Pattern 03 Matilda — multi-task VAE

simulation·DR·classification·feature selection 4개를 한 모델에. rare cell type augmentation까지.

cross-modal translation이 주된 목적

Pattern 04 scButterfly

dual-aligned VAE + adversarial alignment가 RNA↔ATAC, RNA↔protein translation에서 BABEL/Polarbear/UnitedNet 능가.

cross-modal explainability (Patch-seq, GRN)

Pattern 05 UnitedNet + SHAP

한 modality 피처가 다른 modality 예측에 얼마나 기여하는지 정량 — Patch-seq 같은 multi-modal에서 핵심.

RNA-only 대규모 corpus 활용 → multi-omic fine-tune

Pattern 06 scGPT · scFoundation

3300만~5000만 cell pretraining 후 ATAC peak·protein·perturbation을 condition token으로 추가 — pretraining 자산을 그대로 활용.

protein sequence + structure 통합 task

Pattern 08 DPLM-2

folding·inverse folding·co-generation·motif scaffolding 5개를 한 모델로. LFQ structure token + AA token absorbing diffusion.

DNA → cell, protein → disease 같은 zero-shot cross-modal 쿼리

Pattern 07 CLIP-style modular (Cui 2025 blueprint)

unimodal FM(Evo, ESM, Geneformer, scGPT, …)을 contrastive로 정렬. ProteinAligner가 protein 쪽 첫 성공 사례. end-to-end 통합 모델은 아직 미공개.

3D 분자 인터페이스·binding-site 분석

Pattern 09 PINNACLE / ATOMICA

SE(3) equivariance가 필요한 3D 분자 상호작용은 token-based로 구조 정보를 잃는다. graph/geometric DL이 필수.

spatial transcriptomics + dissociated 통합

Pattern 06 + spatial native Nicheformer

technology-specific mean vector로 MERFISH/Xenium/CosMx/dissociated를 같은 공간에 정렬. 110M cell(53.8M spatial) 학습.

drug perturbation 예측 (compound-aware cell representation)

Pattern 06 + perturbation token Tahoe-x1

drug ID를 gene token과 함께 입력 — 1억 perturbed cell 학습으로 unseen compound 일반화.

section 08

한계와 미래 — AIVC를 향한 길

paired data · batch effects · scalability · interpretability

전체 패턴에 공통된 5가지 한계와, Cui 2025가 제시한 열린 방향. AIVC(AI Virtual Cell) substrate가 진짜로 가능해지려면 어디를 더 풀어야 하는가.

limit 01

Paired data scarcity

CITE-seq·10x Multiome도 RNA-only corpus(CELLxGENE 50M+)에 비하면 작다. 진짜 multimodal pretraining의 cap.

limit 02

Modality-specific batch effects

RNA batch는 capture efficiency, ATAC batch는 Tn5 bias — 서로 다른 driver. joint correction이 within-modality보다 어렵다.

limit 03

Scalability

~20K gene + ~500K ATAC peak를 동시에 모델링하면 sequence가 폭발. CellPatch의 gene patching이 10–100× 단축.

limit 04

Interpretability

scGPT attention map·UnitedNet SHAP가 도와주지만 biological causality까지는 못 간다. counterfactual perturbation이 필요한 이유.

limit 05

Missing modality at test

Pattern 3(joint VAE)은 variational marginalization, Pattern 4(dual-aligned)는 translation, Pattern 6(condition token)은 explicit "missing" handling이 필요.

open direction · Cui 2025

진짜 joint pretraining의 길

DNA + RNA + protein + chromatin + image + spatial을 한 모델에. 엄밀한 벤치마크(cell type 예측, disease pseudo-sample 생성, in silico perturbation), guarded-uncertainty output, open leaderboard. AIVC는 거기서 비로소 가능해진다.

section 09

핵심 논문 카탈로그 — wiki에 정리된 자료들

overviews · single-cell FM · protein FM · genomic FM · spatial

검색·필터로 빠르게 스캔. 각 카드 클릭은 Google Scholar 검색으로 연결. overview·FM·method·review·benchmark로 분류.