Papers I Like
Paper • 2402.15627 • Published • 34Note Megatron: > Much of our model parallel approach can be characterized as techniques aimed at reducing communication and keeping the GPUs compute bound MegaScale: 1. Architectural changes to improve pipelining / performance (PTB, SWA) 2. Overlap communication (ZeRO, fuse collective w/ weight-proj, triple-buffering b/w layer) 3. Other "micro"-optimizations (FlashAttention, fuse all the things, LAMB optimizer)
Sora: A Review on Background, Technology, Limitations, and Opportunities of Large Vision Models
Paper • 2402.17177 • Published • 88Note Argues that Sora likely: 1. Encodes video into discrete tokenized latent space (e.g. vq-ViViT style) with space-time latent patches 2. Add noise 3. Feed into a standard DiT with conditioning cross-attending visual tokens 4. Auto regressively generate/remove noise on the whole video or the next frame 5. Run decoder on cleaned up latent patches (reassembled in correct aspect ratio) to get back to video pixel space I think they encode then patch (like WALT), since 3D conv preserve aspect ratio
Beyond Language Models: Byte Models are Digital World Simulators
Paper • 2402.19155 • Published • 49Note CPU instructions: In this example, bGPT flawlessly executed all 251 consecutive instructions, achieving a perfect performance in modelling CPU states by predicting the next state from the current state and an instruction. For clarity, we translate byte sequences into a readable format, with the original binary file accessible here.
Hydragen: High-Throughput LLM Inference with Shared Prefixes
Paper • 2402.05099 • Published • 19Transformers are RNNs: Fast Autoregressive Transformers with Linear Attention
Paper • 2006.16236 • Published • 3Simple linear attention language models balance the recall-throughput tradeoff
Paper • 2402.18668 • Published • 18The Reversal Curse: LLMs trained on "A is B" fail to learn "B is A"
Paper • 2309.12288 • Published • 3
GaLore: Memory-Efficient LLM Training by Gradient Low-Rank Projection
Paper • 2403.03507 • Published • 182Note 1. Grab the gradient from backproping the loss: - G_t = -∇W φ_t(W_t) where G_t is gradient matrix at timestep t 2. Gradient Projection: - SVD for P and Q: [U,S,V] = SVD(G_t) then P_t = U[:, :r] and Q_t = V[:, :r]. Only every T steps - Low-Rank projection R_t = P_t^T G_t Q_t. (note R_t could be diagonal) 3. Weight update: - N_t = ρ_t(R_t), ρ_t represents the optimizer (e.g. Adam) update - G_t = P_t N_t Q_t^T - unproject. - W_{t+1} = W_t + η * G_t - η is learning rate
Align your Latents: High-Resolution Video Synthesis with Latent Diffusion Models
Paper • 2304.08818 • Published • 7Note Extra interesting mention in the WALT paper I missed the first few times: > However, similar to Blattmann et al. [4], we can also potentially leverage pretrained image LDMs with transformer backbones by simply interleaving STW layers. They do share very similar ViT designs - e.g. the interleaved spatial (frozen for Blattman, windowed for WALT) and (spatio)temporal layers (full space-time attention+conv3d for Blattman, windowed for WALT).
Zoology: Measuring and Improving Recall in Efficient Language Models
Paper • 2312.04927 • Published • 2Unfamiliar Finetuning Examples Control How Language Models Hallucinate
Paper • 2403.05612 • Published • 3MoAI: Mixture of All Intelligence for Large Language and Vision Models
Paper • 2403.07508 • Published • 75Language Model Beats Diffusion -- Tokenizer is Key to Visual Generation
Paper • 2310.05737 • Published • 4What Algorithms can Transformers Learn? A Study in Length Generalization
Paper • 2310.16028 • Published • 2Function Vectors in Large Language Models
Paper • 2310.15213 • Published • 1In-Context Learning Creates Task Vectors
Paper • 2310.15916 • Published • 42Scattered Mixture-of-Experts Implementation
Paper • 2403.08245 • Published • 1Dynamic Memory Compression: Retrofitting LLMs for Accelerated Inference
Paper • 2403.09636 • Published • 2