๋งํฌ: ๋…ผ๋ฌธ PDF๋กœ ๋ฐ”๋กœ ์—ด๊ธฐ

์ €์ž: Xiaojuan Tang, Fanxu Meng, Pingzhi Tang, Yuxuan Wang, Di Yin, Xing Sun, Muhan Zhang

ํ•ต์‹ฌ ์—ฐ๊ตฌ ๋ชฉํ‘œ

๋ณธ ๋…ผ๋ฌธ์€ DeepSeek-V2์—์„œ ๋„์ž…๋œ Multi-Head Latent Attention (MLA)์ด Tensor Parallelism (TP) ํ™˜๊ฒฝ์—์„œ KV ์บ์‹œ ๋ฉ”๋ชจ๋ฆฌ ์ ˆ๊ฐ ํšจ๊ณผ๋ฅผ ์žƒ๋Š” ๋ฌธ์ œ๋ฅผ ํ•ด๊ฒฐํ•˜๊ณ ์ž ํ•ฉ๋‹ˆ๋‹ค. ํŠนํžˆ, TP ํ™˜๊ฒฝ์—์„œ ๊ฐ ๋””๋ฐ”์ด์Šค๊ฐ€ ์ „์ฒด latent vector (cKV)๋ฅผ ๋กœ๋“œํ•ด์•ผ ํ•˜๋Š” ๋น„ํšจ์œจ์„ฑ์„ ๊ฐœ์„ ํ•˜์—ฌ, MLA์˜ ์••์ถ• ์ด์ ๊ณผ TP ํšจ์œจ์„ฑ์„ ๋™์‹œ์— ๋‹ฌ์„ฑํ•˜๋ฉด์„œ๋„ ํ‘œํ˜„ ๋Šฅ๋ ฅ(representational capacity)์„ ์œ ์ง€ํ•˜๋Š” ๊ฒƒ์„ ๋ชฉํ‘œ๋กœ ํ•ฉ๋‹ˆ๋‹ค.

ํ•ต์‹ฌ ๋ฐฉ๋ฒ•๋ก 

์ œ์•ˆํ•˜๋Š” Tensor-Parallel Latent Attention (TPLA)์€ latent representation๊ณผ ๊ฐ ํ—ค๋“œ์˜ ์ž…๋ ฅ ์ฐจ์›์„ ๋””๋ฐ”์ด์Šค ๊ฐ„์— ๋ถ„ํ• ํ•˜๊ณ , ๊ฐ ์ƒค๋“œ์—์„œ ๋…๋ฆฝ์ ์œผ๋กœ ์–ดํ…์…˜์„ ์ˆ˜ํ–‰ํ•œ ํ›„ all-reduce๋กœ ๊ฒฐ๊ณผ๋ฅผ ๊ฒฐํ•ฉํ•ฉ๋‹ˆ๋‹ค. TPLA๋Š” ๊ฐ ์–ดํ…์…˜ ํ—ค๋“œ๊ฐ€ ์ „์ฒด latent representation์„ ํ™œ์šฉํ•˜๊ฒŒ ํ•˜์—ฌ ํ‘œํ˜„ ๋Šฅ๋ ฅ์„ ์œ ์ง€ํ•˜๋ฉฐ, ๋””๋ฐ”์ด์Šค๋Š” KV ์บ์‹œ์˜ ํŒŒํ‹ฐ์…˜๋งŒ ๋กœ๋“œํ•ฉ๋‹ˆ๋‹ค. ๋˜ํ•œ, Hadamard transform ๋˜๋Š” PCA์™€ ๊ฐ™์€ ์ง๊ต ๋ณ€ํ™˜์„ RMSNorm ๋ฐ softmax ์—ฐ์‚ฐ์— ์ ์šฉํ•˜์—ฌ ํฌ๋กœ์Šค-์ƒค๋“œ ๊ฐ„์„ญ์„ ์™„ํ™”ํ•˜๊ณ  ์ •ํ™•๋„ ์ €ํ•˜๋ฅผ ์ตœ์†Œํ™”ํ•ฉ๋‹ˆ๋‹ค. Prefill ๋‹จ๊ณ„์—์„œ๋Š” MLA ๋ฐฉ์‹์„ ์‚ฌ์šฉํ•˜๊ณ  ๋””์ฝ”๋”ฉ ๋‹จ๊ณ„์—์„œ๋Š” TPLA๋ฅผ ์‚ฌ์šฉํ•˜๋Š” Prefill/Decode Separation ์ „๋žต์„ ์ฑ„ํƒํ•˜์—ฌ ๊ฐ ๋‹จ๊ณ„์˜ ํšจ์œจ์„ฑ์„ ์ตœ์ ํ™”ํ•ฉ๋‹ˆ๋‹ค.

์ฃผ์š” ๊ฒฐ๊ณผ

DeepSeek-V3 ๋ฐ Kimi-K2 ๋ชจ๋ธ์—์„œ 32K ํ† ํฐ ์ปจํ…์ŠคํŠธ ๊ธธ์ด ๊ธฐ์ค€์œผ๋กœ ๋””๋ฐ”์ด์Šค๋‹น KV ์บ์‹œ๋ฅผ ๊ฐ์†Œ์‹œ์ผœ ๊ฐ๊ฐ 1.79๋ฐฐ ๋ฐ 1.93๋ฐฐ์˜ ์†๋„ ํ–ฅ์ƒ์„ ๋‹ฌ์„ฑํ–ˆ์Šต๋‹ˆ๋‹ค. ์ด๋Ÿฌํ•œ ์„ฑ๋Šฅ ํ–ฅ์ƒ์€ LongBench ๋ฐ commonsense benchmarks์—์„œ ์„ฑ๋Šฅ ์ €ํ•˜ ์—†์ด ์ด๋ฃจ์–ด์กŒ์œผ๋ฉฐ, FlashAttention-3์™€ ํ˜ธํ™˜๋˜์–ด ์‹ค์šฉ์ ์ธ ๊ตฌํ˜„์ด ๊ฐ€๋Šฅํ•จ์„ ๋ณด์˜€์Šต๋‹ˆ๋‹ค. ํŠนํžˆ, PCA ๊ธฐ๋ฐ˜ ์žฌ๋งค๊ฐœ๋ณ€์ˆ˜ํ™”(reparameterization)๋Š” RMSNorm๊ณผ softmax๋ฅผ ๋™์‹œ์— ๋ณ‘๋ ฌํ™”ํ•  ๋•Œ ์ตœ์ƒ์˜ ์„ฑ๋Šฅ์„ ์ผ๊ด€๋˜๊ฒŒ ์ œ๊ณตํ–ˆ์Šต๋‹ˆ๋‹ค.

AI ์‹ค๋ฌด์ž๋ฅผ ์œ„ํ•œ ์‹œ์‚ฌ์ 

TPLA๋Š” MLA ๊ธฐ๋ฐ˜ LLM์˜ Tensor Parallelism ์ถ”๋ก  ํšจ์œจ์„ฑ์„ ํฌ๊ฒŒ ํ–ฅ์ƒ์‹œ์ผœ ์žฅ๋ฌธ ์ปจํ…์ŠคํŠธ ์ถ”๋ก  ๋น„์šฉ์„ ์ ˆ๊ฐํ•  ์ˆ˜ ์žˆ๋Š” ์‹ค์šฉ์ ์ธ ์†”๋ฃจ์…˜์„ ์ œ๊ณตํ•ฉ๋‹ˆ๋‹ค. ๊ธฐ์กด ์‚ฌ์ „ ํ›ˆ๋ จ๋œ MLA ๋ชจ๋ธ์— ์žฌํ›ˆ๋ จ ์—†์ด ์ ์šฉ ๊ฐ€๋Šฅํ•˜์—ฌ ๋„์ž… ์žฅ๋ฒฝ์ด ๋‚ฎ์œผ๋ฉฐ, FlashAttention-3์™€ ๊ฐ™์€ ์ตœ์ ํ™” ๋ผ์ด๋ธŒ๋Ÿฌ๋ฆฌ์™€์˜ ํ˜ธํ™˜์„ฑ์œผ๋กœ end-to-end ์„ฑ๋Šฅ ํ–ฅ์ƒ์„ ๊ธฐ๋Œ€ํ•  ์ˆ˜ ์žˆ์Šต๋‹ˆ๋‹ค. Prefill/Decode ๋‹จ๊ณ„ ๋ถ„๋ฆฌ ์ „๋žต์€ ๊ฐ ๋‹จ๊ณ„์˜ ์ปดํ“จํŒ… ๋ฐ ๋ฉ”๋ชจ๋ฆฌ ํŠน์„ฑ์„ ๊ณ ๋ คํ•œ ์ตœ์ ํ™”๋ฅผ ๊ฐ€๋Šฅํ•˜๊ฒŒ ํ•˜์—ฌ ์ „์ฒด ์ถ”๋ก  ํŒŒ์ดํ”„๋ผ์ธ์˜ ํšจ์œจ์„ฑ์„ ๋†’์ž…๋‹ˆ๋‹ค.

โš ๏ธ ์•Œ๋ฆผ: ์ด ๋ฆฌ๋ทฐ๋Š” AI๋กœ ์ž‘์„ฑ๋˜์—ˆ์Šต๋‹ˆ๋‹ค.

Comments