Foundation AI Papers
Curated List of Must-Reads on LLM reasoning at Temus AI team
Paper • 2310.04406 • Published • 8Note Top reasoning trick on HummanEval: MCTS + LLM + Feedback + Reflection @UIUC
Chain-of-Thought Reasoning Without Prompting
Paper • 2402.10200 • Published • 101Note Our Re-Implementation code: https://github.com/fangyuan-ksgk/CoT-Reasoning-without-Prompting Insight: Decoding time reasoning is cheap, effective, and can bring out the 'inherent' reasoning capacity from pre-trained LLM. Drawback: Indentification of the set of answer, and its location reamains the million dollar question.
ICDPO: Effectively Borrowing Alignment Capability of Others via In-context Direct Preference Optimization
Paper • 2402.09320 • Published • 6Note In-Context-Learning based preference alignment, performance on-par with Supervised Fine-Tuning (SFT). Can be used to generated optimal preference pairs, or augment the preference dataset.
Self-Discover: Large Language Models Self-Compose Reasoning Structures
Paper • 2402.03620 • Published • 109Note Self-Discover solves any task in three steps: Pickging a reasoning structure, designing a stepwise reasoning plan, then implement the thinking process to get the answer. Significant performance improvement is observed.
Self-Rewarding Language Models
Paper • 2401.10020 • Published • 144Note Meta's work on iterative self-improvement of LLM.
Direct Language Model Alignment from Online AI Feedback
Paper • 2402.04792 • Published • 29Note A simplification of Meta's self-rewarding LLM, relying on LLM's innate capacity of understanding the preference shown in the original labeled dataset, and use it to gives thumb up & down, which are then feed back to the model weight through DPO.
Matryoshka Representation Learning
Paper • 2205.13147 • Published • 9Note Matryoshka embedding innovatively blends PCA's flexibility with the precision of learned embeddings. It achieves this with a simple change on the loss function. Instead of only aiming for a high-quality D-dimensional compression, the Matryoshka approach requires that truncated portions of this embedding - like the first 2, 4, ..., up to D/2 dimensions - also be effective in isolation. This dual focus ensures that even smaller segments of the embedding retain useful information. The intuition her
Learning to Learn Faster from Human Feedback with Language Model Predictive Control
Paper • 2402.11450 • Published • 21Note In-context learning by day & fine-tuning by night. This will rock the back-prop learning tradition, and achieve human-level efficient learning capacity.
LongRoPE: Extending LLM Context Window Beyond 2 Million Tokens
Paper • 2402.13753 • Published • 112Note Essentially extrapolate the rotary position embedding adaptively, with some evolution algorithm to my vague understanding, not thoroughly read yet. Although seems to be MC's response to Google's RingAttention
StarCoder: may the source be with you!
Paper • 2305.06161 • Published • 30Note May the Source be with you :>
When Scaling Meets LLM Finetuning: The Effect of Data, Model and Finetuning Method
Paper • 2402.17193 • Published • 23Note Updated Version of Fine-Tuning guide from Google.
Tree of Thoughts: Deliberate Problem Solving with Large Language Models
Paper • 2305.10601 • Published • 11Note DeepMind's prompt based reasoning approach: Monte Carlo Tree, Breadth first search / Depth first search, deliberate atomic thought process controlled with proposal prompt. Essentially these guys are doing chess here.
Beyond A*: Better Planning with Transformers via Search Dynamics Bootstrapping
Paper • 2402.14083 • Published • 47Note "...the presented experiments use token sequences that are significantly longer than the sequences used to train LLMs such as Llama 2..." More computing resources than what they put into llama2 is now poured into training on a --maze game..... This simply can't be the way towards reasoning capacity guys.
Sora: A Review on Background, Technology, Limitations, and Opportunities of Large Vision Models
Paper • 2402.17177 • Published • 88Note For obvious reasons, we have to read this one.
Online normalizer calculation for softmax
Paper • 1805.02867 • Published • 1Note This is the foundation of modern accelarated attention mechanism. Softmax requires a normalization constant to be computed in order to get the normalized attention weights for each value. A simple & obvious observation says that global normalization constant can be ensembled from two sub-chunks' computation result, therefore parallel computation among chunks together with a final ensemble steps allow us to accelerate attention computation, which leads to efficient attention mechanism.
Self-attention Does Not Need O(n^2) Memory
Paper • 2112.05682 • Published • 3Note Similar idea as the online-softmax paper, parallelism is achieved by breaking sequence length (number of queries) into chunks, the parallelism of softmax computation means such chunking operation would be equivalent to the full attention mechanism, whilst allowsing for speed-up through multiprocessing hardrwares like GPU/TPU.
Blockwise Parallel Transformer for Long Context Large Models
Paper • 2305.19370 • Published • 3Note Re-write of the efficient memory attention mechanism, every query, key, value is separated into blocks (along their sequence length dimension). Merge is conducted after parallel computation of normalization constant and attention weights & values on each block. An observation is made that the feedforward propagation, and the residual operation is also parallelizable and can be done within each query block. key & value block attention -> query block merge attention, feedforward & residual is done
Ring Attention with Blockwise Transformers for Near-Infinite Context
Paper • 2310.01889 • Published • 10Note Sacrifice latency to gain memory efficiency compared to BPT, by using Ring-communication to restrict communication to 1-copy per host, instead of passing all parallel computation result into one device (highly memory requiring). This lossens the memory requirement to achieve higher context length, therefore scaling up the context-length of transformer to 10M tokens.
World Model on Million-Length Video And Language With RingAttention
Paper • 2402.08268 • Published • 37Note UC Berkeley rocks
RoFormer: Enhanced Transformer with Rotary Position Embedding
Paper • 2104.09864 • Published • 10Note This is the preliminary technique behind LongRoPE. Rotary position embedding addresses the crave for 'relative position embedding' through independent embedding with block-wise diagonal rotation matrix. Microsoft's approach to scaling their attention context, is through extrapolation / interpolation based off this rotatry embedding vectors.
Instruction-tuned Language Models are Better Knowledge Learners
Paper • 2402.12847 • Published • 25
DoRA: Weight-Decomposed Low-Rank Adaptation
Paper • 2402.09353 • Published • 26Note Rotation contains much more entropy than scaling, and it is much more friendly to combination, and less-prune to explosion / vanishing values. Rotating the neural network parameters just seems much more important than scaling them. Eventually people might just converge towards binary paramter, where you do not need to scale anything.
The Era of 1-bit LLMs: All Large Language Models are in 1.58 Bits
Paper • 2402.17764 • Published • 603Note Confirmed to be real by open-source people -- learn this as the new bible if you wish to pre-train your own model. Linear layer from any ViT / Transformer can be replaced by a BitNet with teneary parameters and achieve similar performance via quantization-aware training. Re-Implementation with MoE-BitNet too: https://github.com/fangyuan-ksgk/1bitNet So there is free lunch? Or just that float16 is too redundant with so many layers?
Scalable Diffusion Models with Transformers
Paper • 2212.09748 • Published • 16Note UC Berkeley rocks
Linear Alignment: A Closed-form Solution for Aligning Human Preferences without Tuning and Feedback
Paper • 2401.11458 • Published • 2Note This work is similar to ICDPO, trying to utilize the knowledge inside LLM through Contrastive Decoding (CD). Linear extrapolation is adopted to align the logits according to the direction of alignment, calculated by the difference in logits between a preference principle prompted LLM and a plain LLM.
Feast Your Eyes: Mixture-of-Resolution Adaptation for Multimodal Large Language Models
Paper • 2403.03003 • Published • 9NeuroPrompts: An Adaptive Framework to Optimize Prompts for Text-to-Image Generation
Paper • 2311.12229 • Published • 26How Far Are We from Intelligent Visual Deductive Reasoning?
Paper • 2403.04732 • Published • 18Trial and Error: Exploration-Based Trajectory Optimization for LLM Agents
Paper • 2403.02502 • Published • 3Common 7B Language Models Already Possess Strong Math Capabilities
Paper • 2403.04706 • Published • 16
CLLMs: Consistency Large Language Models
Paper • 2403.00835 • Published • 3Note Essentially LLM can be viewed as a perplexity score refiner, given any shit & non-sence setence, it can re-predict a better sentence with simple casual masking, which has better perplexity score, the trick is such casual masking prediction is naturally possible to carried out in parallel. Then you have the problem of requiring too much iterations, or dropping accuracy because you are conditioning on shit etc... Therefore some fine-tuning tricks are applied to force out more performance.
DeLLMa: A Framework for Decision Making Under Uncertainty with Large Language Models
Paper • 2402.02392 • Published • 4
Weight Normalization: A Simple Reparameterization to Accelerate Training of Deep Neural Networks
Paper • 1602.07868 • Published • 2Note OG work from openAI, decouple weight manitude & direction for learning. Everything boils down to one equation.
AgentCoder: Multi-Agent-based Code Generation with Iterative Testing and Optimisation
Paper • 2312.13010 • Published • 4Code Generation with AlphaCodium: From Prompt Engineering to Flow Engineering
Paper • 2401.08500 • Published • 5
Hyena Hierarchy: Towards Larger Convolutional Language Models
Paper • 2302.10866 • Published • 7Note The Heyna operator utilizes long convolution to achieve better performance than the attention operator at a fraction of the cost, particularly with longer contexts. Interestingly, the convolution filter is controlled by the position of the token, which also allows for a structured window to control the attention mechanism. This enables behaviors such as decay-through-time, which more closely resembles human attention.
Fine Tuning vs. Retrieval Augmented Generation for Less Popular Knowledge
Paper • 2403.01432 • Published • 2
Gemini 1.5: Unlocking multimodal understanding across millions of tokens of context
Paper • 2403.05530 • Published • 60Note something like 'we use python'
Adaptive Skeleton Graph Decoding
Paper • 2402.12280 • Published • 2Note parallel decoding for parallel thoughts, sequential decoding for serial deduction process
RAT: Retrieval Augmented Thoughts Elicit Context-Aware Reasoning in Long-Horizon Generation
Paper • 2403.05313 • Published • 9Note CoT but with step-wise RAG (bit crazy but multi-processing allows parallel RAG implementation, so why not?)
QLoRA: Efficient Finetuning of Quantized LLMs
Paper • 2305.14314 • Published • 45Note Trade more computation for less memory. Extra computation involves dequantizing the quantized weight matrices, extra gradient propagation into low-rank adapter matrices. Less memory means no need to store the pre-trained float16 weight matrix, and no need to store gradient for the pre-trained weight matrix. There is no free lunch. BTW, fine-tuning a X GB model requires 4X GB memory to store weights and gradients. LoRA achieves ~XGB by using adaptors, and QLoRA ahieves ~X/N GB with quantization.
Is Cosine-Similarity of Embeddings Really About Similarity?
Paper • 2403.05440 • Published • 3
Language Agents as Optimizable Graphs
Paper • 2402.16823 • Published • 3Note work from KAUST optimizes Graph-based Agent cooperation
Stealing Part of a Production Language Model
Paper • 2403.06634 • Published • 90Note How to hack OpenAI
GaLore: Memory-Efficient LLM Training by Gradient Low-Rank Projection
Paper • 2403.03507 • Published • 182Note Pre-train on 24GB GPU !! We know LoRA part, but what is Ga?
BitNet: Scaling 1-bit Transformers for Large Language Models
Paper • 2310.11453 • Published • 96Note Predecessor work for era of 1-bit LLM paper, here they actually scale everything down to 1-bit, even the activation function gets quantized. One could argue that the quantization of the activation function destructs the performance of these quantization-aware trained model.
Self-RAG: Learning to Retrieve, Generate, and Critique through Self-Reflection
Paper • 2310.11511 • Published • 74JudgeLM: Fine-tuned Large Language Models are Scalable Judges
Paper • 2310.17631 • Published • 33Learning From Mistakes Makes LLM Better Reasoner
Paper • 2310.20689 • Published • 28
RLVF: Learning from Verbal Feedback without Overgeneralization
Paper • 2402.10893 • Published • 10Note In-context RL:
Bias-Augmented Consistency Training Reduces Biased Reasoning in Chain-of-Thought
Paper • 2403.05518 • Published • 2An Image is Worth 1/2 Tokens After Layer 2: Plug-and-Play Inference Acceleration for Large Vision-Language Models
Paper • 2403.06764 • Published • 25
Intrinsic Dimensionality Explains the Effectiveness of Language Model Fine-Tuning
Paper • 2012.13255 • Published • 3Note This is the foundation of modern time fine-tuning & quantization, and the experimental discovery here closely relate to theory of mind. Essentially with more paramters, model will be more reluctant to learn with high intrinsic dimensional, as then a lower dimensional representation suffices with the shear amount of parameters. As a result, model is always trying to minimize 'surprise' while reducing 'energy' -- in this case, the intrinsic dimension.
Towards General Computer Control: A Multimodal Agent for Red Dead Redemption II as a Case Study
Paper • 2403.03186 • Published • 5
Branch-Train-MiX: Mixing Expert LLMs into a Mixture-of-Experts LLM
Paper • 2403.07816 • Published • 39Note Branch: copy base LLM multiple times. Train: Shard training data into multiple domain-specific subset, and train each LLM in parallel (zero communication required). Mix: Merge all domain specific LLMs into a MoE model (token-level routing + multiple FFN layers), then MoE-finetune this model to adapt to the full training dataset. Interesting discussion on dying expert (routed weights collapse to zero) effect and the 'load balancing' tricks as a simple solution by adding balancing loss.
Unfamiliar Finetuning Examples Control How Language Models Hallucinate
Paper • 2403.05612 • Published • 3PlanGPT: Enhancing Urban Planning with Tailored Language Model and Efficient Retrieval
Paper • 2402.19273 • Published • 3Unleashing Cognitive Synergy in Large Language Models: A Task-Solving Agent through Multi-Persona Self-Collaboration
Paper • 2307.05300 • Published • 18Demystifying Embedding Spaces using Large Language Models
Paper • 2310.04475 • Published • 3Simple and Scalable Strategies to Continually Pre-train Large Language Models
Paper • 2403.08763 • Published • 49MultiHop-RAG: Benchmarking Retrieval-Augmented Generation for Multi-Hop Queries
Paper • 2401.15391 • Published • 6GiT: Towards Generalist Vision Transformer through Universal Language Interface
Paper • 2403.09394 • Published • 25DeAL: Decoding-time Alignment for Large Language Models
Paper • 2402.06147 • Published • 7SELFI: Autonomous Self-Improvement with Reinforcement Learning for Social Navigation
Paper • 2403.00991 • Published • 2Uni-SMART: Universal Science Multimodal Analysis and Research Transformer
Paper • 2403.10301 • Published • 52
Evolutionary Optimization of Model Merging Recipes
Paper • 2403.13187 • Published • 50Note Goal-Oriented Model Merging with Evolutionary Search. PS (slerp / dare -- succeeds) + DFS (passthrough with scaler adaptor -- fails!) Our re-implementation here: https://github.com/fangyuan-ksgk/Evolutionary-Model-Merge
Reverse Training to Nurse the Reversal Curse
Paper • 2403.13799 • Published • 13PERL: Parameter Efficient Reinforcement Learning from Human Feedback
Paper • 2403.10704 • Published • 57VideoAgent: A Memory-augmented Multimodal Agent for Video Understanding
Paper • 2403.11481 • Published • 12
DiPaCo: Distributed Path Composition
Paper • 2403.10616 • Published • 12Note 'Pathway' vision at Google & Imaginary work from Deepmind: ML Model as decomposable pathways, each path is capcable of process input and provide output, with this model structure, distributed training across different clusters require low synchronization and communication. Inference with such model also require only loading the few offline-routed paths which matter to your task at hand, alleviating the storage requirement of the computation.
Training Neural Networks from Scratch with Parallel Low-Rank Adapters
Paper • 2402.16828 • Published • 3
RAFT: Adapting Language Model to Domain Specific RAG
Paper • 2403.10131 • Published • 67Note Simple yet powerful observation: noise get included into RAG pipeline, so fine-tune your LLM against such noise through data augmentation.
Recurrent Drafter for Fast Speculative Decoding in Large Language Models
Paper • 2403.09919 • Published • 20LLMLingua-2: Data Distillation for Efficient and Faithful Task-Agnostic Prompt Compression
Paper • 2403.12968 • Published • 24
SOLAR 10.7B: Scaling Large Language Models with Simple yet Effective Depth Up-Scaling
Paper • 2312.15166 • Published • 56Note Successful Data Flow merging, two Mistral 7B is adopted to concatenate with each other with overlap, the overlapped layers gets merged, and continual pre-training is done while observing "fast performance recovery". It seems the nice LLM has much more compressed & generalizble information stored in each layer than we think. Contiual pretraining from a stack of these layers might be a good idea for efficient scaling of model size.
Resolving Interference When Merging Models
Paper • 2306.01708 • Published • 13Blending Is All You Need: Cheaper, Better Alternative to Trillion-Parameters LLM
Paper • 2401.02994 • Published • 48WARM: On the Benefits of Weight Averaged Reward Models
Paper • 2401.12187 • Published • 18A Unified Framework for Model Editing
Paper • 2403.14236 • Published • 2LLM2LLM: Boosting LLMs with Novel Iterative Data Enhancement
Paper • 2403.15042 • Published • 25Detoxifying Large Language Models via Knowledge Editing
Paper • 2403.14472 • Published • 3GAIA: a benchmark for General AI Assistants
Paper • 2311.12983 • Published • 183LLM Agent Operating System
Paper • 2403.16971 • Published • 65The Unreasonable Ineffectiveness of the Deeper Layers
Paper • 2403.17887 • Published • 78LISA: Layerwise Importance Sampling for Memory-Efficient Large Language Model Fine-Tuning
Paper • 2403.17919 • Published • 16
Long-form factuality in large language models
Paper • 2403.18802 • Published • 24Note Split, Revise to be self contained ensure the atomic thought is maintained, and the argument can be broken down into coherant small arguments. Based on the smaller argument, one can do fact check, do reflection, do MCTS, do all sorts of stuff with them. This little trick opens a door to immense posibility. And specifically useful with their open-sourced code !
Sparse Upcycling: Training Mixture-of-Experts from Dense Checkpoints
Paper • 2212.05055 • Published • 5
Scaling Expert Language Models with Unsupervised Domain Discovery
Paper • 2303.14177 • Published • 2Note An extension to the Branch-Train-Mix approach from Meta. Multiple copies are made from the base LLM, a unsupervised clustering-based domain discovery (main difference) is done by traversing the training data corpus, and then offline task-level routing ensures each LLM is trained into a domain experts. Mixing weighted logit prediction from routed top-K experts is carried out during inference.
Beyond Distillation: Task-level Mixture-of-Experts for Efficient Inference
Paper • 2110.03742 • Published • 3Note Google's work on Multilingual Machine Translation task in 2021. Token-MoE suffer from having to load all the experts in storage while dynamic token routing only activate a few of them -- this creates redundancy. By assigning each input language & output language pair as a separate task, task-level routing (Task-MoE) is shown to outperform Token-MoE on multilingual machine translation task, avoiding interference by mixed training.
RouterBench: A Benchmark for Multi-LLM Routing System
Paper • 2403.12031 • Published • 3Fusing Models with Complementary Expertise
Paper • 2310.01542 • Published • 1Lumos: Learning Agents with Unified Data, Modular Design, and Open-Source LLMs
Paper • 2311.05657 • Published • 27
Teaching Large Language Models to Reason with Reinforcement Learning
Paper • 2403.04642 • Published • 46Note It seems Meta is not playing RL in the right way, check Quiet-StaR, a significant improvement is possible. Again one potential reson is that the thought generation process in here is not PARALLEL. As Matt puts it, thinking process is fundamentally Parallel, while the saying process isn't. So the RL should be done with the thought, and not the greedy-1 decoding path? Need to dig it
Model Stock: All we need is just a few fine-tuned models
Paper • 2403.19522 • Published • 10Advancing LLM Reasoning Generalists with Preference Trees
Paper • 2404.02078 • Published • 44
Octopus v2: On-device language model for super agent
Paper • 2404.01744 • Published • 57Note LLM OS == fine-tuning on specific sequence of actions within a constrained environment == few billiion parameter agent ?
Adaptive-RAG: Learning to Adapt Retrieval-Augmented Large Language Models through Question Complexity
Paper • 2403.14403 • Published • 6BLADE: Enhancing Black-box Large Language Models with Small Domain-Specific Models
Paper • 2403.18365 • Published • 2LLaVA-Gemma: Accelerating Multimodal Foundation Models with a Compact Language Model
Paper • 2404.01331 • Published • 25
Mixture-of-Depths: Dynamically allocating compute in transformer-based language models
Paper • 2404.02258 • Published • 104Note 'Skip' transformer blocks yields on par performance than sequential propagation without 'skipping'. This is essentially a 'Hard' residual, as opposed to the 'Soft' residual in ResNet.
Visualization-of-Thought Elicits Spatial Reasoning in Large Language Models
Paper • 2404.03622 • Published • 4
ReFT: Representation Finetuning for Language Models
Paper • 2404.03592 • Published • 90Note Given a pre-trained LLM, sequence input is represented in latent space across transformer blocks. These representation can be intervened to achieve 'control' of model output, which higher efficiency than modification of model weights. Representation fine-tuning uses learnable representation intervention for fine-tuning, achieving 10x-50x more efficiency boost compared to LoRA & DoRA. Compositional learning is done by doing mutually exclusive intervention on separate position & layers.
Direct Nash Optimization: Teaching Language Models to Self-Improve with General Preferences
Paper • 2404.03715 • Published • 60Note Iterative DPO. Difference with self-rewarding LLM is that it uses a strong LLM to RANK the generated responses, thereby having a much better efficiency in 'aligning' the small / weak model towards the strong model. Moreover, the nature of evaluation being simpler than generation means one could even surpass GPT-4 performance with such iteration schemes.
Model Editing with Canonical Examples
Paper • 2402.06155 • Published • 11Model Editing Can Hurt General Abilities of Large Language Models
Paper • 2401.04700 • Published • 3Let's Focus on Neuron: Neuron-Level Supervised Fine-tuning for Large Language Model
Paper • 2403.11621 • Published • 2LLM2Vec: Large Language Models Are Secretly Powerful Text Encoders
Paper • 2404.05961 • Published • 64Eagle and Finch: RWKV with Matrix-Valued States and Dynamic Recurrence
Paper • 2404.05892 • Published • 31CodecLM: Aligning Language Models with Tailored Synthetic Data
Paper • 2404.05875 • Published • 16RS-DPO: A Hybrid Rejection Sampling and Direct Preference Optimization Method for Alignment of Large Language Models
Paper • 2402.10038 • Published • 6Context versus Prior Knowledge in Language Models
Paper • 2404.04633 • Published • 5WILBUR: Adaptive In-Context Learning for Robust and Accurate Web Agents
Paper • 2404.05902 • Published • 20
Best Practices and Lessons Learned on Synthetic Data for Language Models
Paper • 2404.07503 • Published • 29Note Position paper on data augmentation approaches adopted in LLM training. Real data is expensive to get and will run out very quickly, most of the public benchmark dataset is already inside training corpus for many LLMs. One interesting point: replacing natural language label with random labels improve model performance. Classic improving difficulty leads to generalization trick.
MiniCPM: Unveiling the Potential of Small Language Models with Scalable Training Strategies
Paper • 2404.06395 • Published • 21Evaluating Mathematical Reasoning Beyond Accuracy
Paper • 2404.05692 • Published • 2Rho-1: Not All Tokens Are What You Need
Paper • 2404.07965 • Published • 84
Aligner: Achieving Efficient Alignment through Weak-to-Strong Correction
Paper • 2402.02416 • Published • 4Note Residual Connection has been proven to be the right way to compose neural network layers. This work from Peking university proves the same trick can be used to compose LLMs. (Not surprising at all). The image reminds me of the quote from Issac Newton: "if I have seen further, it is by standing on the shoulders of giants"
Stream of Search (SoS): Learning to Search in Language
Paper • 2404.03683 • Published • 25THOUGHTSCULPT: Reasoning with Intermediate Revision and Search
Paper • 2404.05966 • Published • 2Autonomous Evaluation and Refinement of Digital Agents
Paper • 2404.06474 • Published • 1
Symbol tuning improves in-context learning in language models
Paper • 2305.08298 • Published • 3Note This work touches on the fundamental issue of ML model training: it always pick the short-cut. To predict natural language label, the short-cut is to check semantic similarity. But what if we replace the natural language label with random labels? Then the model is forced to 'learns' what it shoudl do, and then do it.
TransformerFAM: Feedback attention is working memory
Paper • 2404.09173 • Published • 43Medusa: Simple LLM Inference Acceleration Framework with Multiple Decoding Heads
Paper • 2401.10774 • Published • 54From r to Q^*: Your Language Model is Secretly a Q-Function
Paper • 2404.12358 • Published • 2
Toward Self-Improvement of LLMs via Imagination, Searching, and Criticizing
Paper • 2404.12253 • Published • 53Note Not the right way to go. Model would reinforce itself.
TriForce: Lossless Acceleration of Long Sequence Generation with Hierarchical Speculative Decoding
Paper • 2404.11912 • Published • 16Reuse Your Rewards: Reward Model Transfer for Zero-Shot Cross-Lingual Alignment
Paper • 2404.12318 • Published • 14
Compression Represents Intelligence Linearly
Paper • 2404.09937 • Published • 27Note Equivalence between optimal compression and perplexity score. LLM encodes rare sequence with more bits and common sequence with less bits, the less bits it required, the more compressed the encoding is. Pre-Training loss is precisely modeling such compression efficiency. This paper identifies a linear correpondance of comression and Intelligence.
Many-Shot In-Context Learning
Paper • 2404.11018 • Published • 4Note Increase the amount of few-shot prompts to >500k. Interesting discovery is that "Unsupervised ICL" which presents problem and not problem-solution pairs achieves on-par performance with question-answer prompting. We need to present context to LLM such that its domain-specific knowledge could be revealed.
Align Your Steps: Optimizing Sampling Schedules in Diffusion Models
Paper • 2404.14507 • Published • 21SnapKV: LLM Knows What You are Looking for Before Generation
Paper • 2404.14469 • Published • 23
Achieving >97% on GSM8K: Deeply Understanding the Problems Makes LLMs Perfect Reasoners
Paper • 2404.14963 • Published • 2Note Observation is made that LLM errors mostly because it fails to understand the question. Rephrasing the question by LLM itself provides explicit requirement for LLM to address this error, combination of question rephrasing, information extraction and then step-by-step reasoning leads to higher accuracy. More compute, more accuracy.
LayerSkip: Enabling Early Exit Inference and Self-Speculative Decoding
Paper • 2404.16710 • Published • 74
NExT: Teaching Large Language Models to Reason about Code Execution
Paper • 2404.14662 • Published • 4Note Get error message and re-program is generally what happens when we code. However, this is not exactly what happens, error message help us locate the exact place where the error occurs, which helps breaking down the code issue, and allow one to trace back the issue. This paper sort of does that.
Grokked Transformers are Implicit Reasoners: A Mechanistic Journey to the Edge of Generalization
Paper • 2405.15071 • Published • 37