Foundation AI Papers - a Temus Collection

Language Agent Tree Search Unifies Reasoning Acting and Planning in Language Models

Paper • 2310.04406 • Published Oct 6, 2023 • 8

Note Top reasoning trick on HummanEval: MCTS + LLM + Feedback + Reflection @UIUC

Chain-of-Thought Reasoning Without Prompting

Paper • 2402.10200 • Published Feb 15 • 101

Note Our Re-Implementation code: https://github.com/fangyuan-ksgk/CoT-Reasoning-without-Prompting Insight: Decoding time reasoning is cheap, effective, and can bring out the 'inherent' reasoning capacity from pre-trained LLM. Drawback: Indentification of the set of answer, and its location reamains the million dollar question.

ICDPO: Effectively Borrowing Alignment Capability of Others via In-context Direct Preference Optimization

Paper • 2402.09320 • Published Feb 14 • 6

Note In-Context-Learning based preference alignment, performance on-par with Supervised Fine-Tuning (SFT). Can be used to generated optimal preference pairs, or augment the preference dataset.

Self-Discover: Large Language Models Self-Compose Reasoning Structures

Paper • 2402.03620 • Published Feb 6 • 109

Note Self-Discover solves any task in three steps: Pickging a reasoning structure, designing a stepwise reasoning plan, then implement the thinking process to get the answer. Significant performance improvement is observed.

Self-Rewarding Language Models

Paper • 2401.10020 • Published Jan 18 • 144

Note Meta's work on iterative self-improvement of LLM.

Direct Language Model Alignment from Online AI Feedback

Paper • 2402.04792 • Published Feb 7 • 29

Note A simplification of Meta's self-rewarding LLM, relying on LLM's innate capacity of understanding the preference shown in the original labeled dataset, and use it to gives thumb up & down, which are then feed back to the model weight through DPO.

Matryoshka Representation Learning

Paper • 2205.13147 • Published May 26, 2022 • 9

Note Matryoshka embedding innovatively blends PCA's flexibility with the precision of learned embeddings. It achieves this with a simple change on the loss function. Instead of only aiming for a high-quality D-dimensional compression, the Matryoshka approach requires that truncated portions of this embedding - like the first 2, 4, ..., up to D/2 dimensions - also be effective in isolation. This dual focus ensures that even smaller segments of the embedding retain useful information. The intuition her

Learning to Learn Faster from Human Feedback with Language Model Predictive Control

Paper • 2402.11450 • Published Feb 18 • 21

Note In-context learning by day & fine-tuning by night. This will rock the back-prop learning tradition, and achieve human-level efficient learning capacity.

LongRoPE: Extending LLM Context Window Beyond 2 Million Tokens

Paper • 2402.13753 • Published Feb 21 • 112

Note Essentially extrapolate the rotary position embedding adaptively, with some evolution algorithm to my vague understanding, not thoroughly read yet. Although seems to be MC's response to Google's RingAttention

StarCoder: may the source be with you!

Paper • 2305.06161 • Published May 9, 2023 • 30

Note May the Source be with you :>

When Scaling Meets LLM Finetuning: The Effect of Data, Model and Finetuning Method

Paper • 2402.17193 • Published Feb 27 • 23

Note Updated Version of Fine-Tuning guide from Google.

Tree of Thoughts: Deliberate Problem Solving with Large Language Models

Paper • 2305.10601 • Published May 17, 2023 • 11

Note DeepMind's prompt based reasoning approach: Monte Carlo Tree, Breadth first search / Depth first search, deliberate atomic thought process controlled with proposal prompt. Essentially these guys are doing chess here.

Beyond A*: Better Planning with Transformers via Search Dynamics Bootstrapping

Paper • 2402.14083 • Published Feb 21 • 47

Note "...the presented experiments use token sequences that are significantly longer than the sequences used to train LLMs such as Llama 2..." More computing resources than what they put into llama2 is now poured into training on a --maze game..... This simply can't be the way towards reasoning capacity guys.

Sora: A Review on Background, Technology, Limitations, and Opportunities of Large Vision Models

Paper • 2402.17177 • Published Feb 27 • 88

Note For obvious reasons, we have to read this one.

Online normalizer calculation for softmax

Paper • 1805.02867 • Published May 8, 2018 • 1

Note This is the foundation of modern accelarated attention mechanism. Softmax requires a normalization constant to be computed in order to get the normalized attention weights for each value. A simple & obvious observation says that global normalization constant can be ensembled from two sub-chunks' computation result, therefore parallel computation among chunks together with a final ensemble steps allow us to accelerate attention computation, which leads to efficient attention mechanism.

Self-attention Does Not Need O(n^2) Memory

Paper • 2112.05682 • Published Dec 10, 2021 • 3

Note Similar idea as the online-softmax paper, parallelism is achieved by breaking sequence length (number of queries) into chunks, the parallelism of softmax computation means such chunking operation would be equivalent to the full attention mechanism, whilst allowsing for speed-up through multiprocessing hardrwares like GPU/TPU.

Blockwise Parallel Transformer for Long Context Large Models

Paper • 2305.19370 • Published May 30, 2023 • 3

Note Re-write of the efficient memory attention mechanism, every query, key, value is separated into blocks (along their sequence length dimension). Merge is conducted after parallel computation of normalization constant and attention weights & values on each block. An observation is made that the feedforward propagation, and the residual operation is also parallelizable and can be done within each query block. key & value block attention -> query block merge attention, feedforward & residual is done

Ring Attention with Blockwise Transformers for Near-Infinite Context

Paper • 2310.01889 • Published Oct 3, 2023 • 10

Note Sacrifice latency to gain memory efficiency compared to BPT, by using Ring-communication to restrict communication to 1-copy per host, instead of passing all parallel computation result into one device (highly memory requiring). This lossens the memory requirement to achieve higher context length, therefore scaling up the context-length of transformer to 10M tokens.

World Model on Million-Length Video And Language With RingAttention

Paper • 2402.08268 • Published Feb 13 • 37

Note UC Berkeley rocks

RoFormer: Enhanced Transformer with Rotary Position Embedding

Paper • 2104.09864 • Published Apr 20, 2021 • 10

Note This is the preliminary technique behind LongRoPE. Rotary position embedding addresses the crave for 'relative position embedding' through independent embedding with block-wise diagonal rotation matrix. Microsoft's approach to scaling their attention context, is through extrapolation / interpolation based off this rotatry embedding vectors.

Instruction-tuned Language Models are Better Knowledge Learners

Paper • 2402.12847 • Published Feb 20 • 25

DoRA: Weight-Decomposed Low-Rank Adaptation

Paper • 2402.09353 • Published Feb 14 • 26

Note Rotation contains much more entropy than scaling, and it is much more friendly to combination, and less-prune to explosion / vanishing values. Rotating the neural network parameters just seems much more important than scaling them. Eventually people might just converge towards binary paramter, where you do not need to scale anything.

The Era of 1-bit LLMs: All Large Language Models are in 1.58 Bits

Paper • 2402.17764 • Published Feb 27 • 603

Note Confirmed to be real by open-source people -- learn this as the new bible if you wish to pre-train your own model. Linear layer from any ViT / Transformer can be replaced by a BitNet with teneary parameters and achieve similar performance via quantization-aware training. Re-Implementation with MoE-BitNet too: https://github.com/fangyuan-ksgk/1bitNet So there is free lunch? Or just that float16 is too redundant with so many layers?

Scalable Diffusion Models with Transformers

Paper • 2212.09748 • Published Dec 19, 2022 • 16

Note UC Berkeley rocks

Linear Alignment: A Closed-form Solution for Aligning Human Preferences without Tuning and Feedback

Paper • 2401.11458 • Published Jan 21 • 2

Note This work is similar to ICDPO, trying to utilize the knowledge inside LLM through Contrastive Decoding (CD). Linear extrapolation is adopted to align the logits according to the direction of alignment, calculated by the difference in logits between a preference principle prompted LLM and a plain LLM.

Feast Your Eyes: Mixture-of-Resolution Adaptation for Multimodal Large Language Models

Paper • 2403.03003 • Published Mar 5 • 9

NeuroPrompts: An Adaptive Framework to Optimize Prompts for Text-to-Image Generation

Paper • 2311.12229 • Published Nov 20, 2023 • 26

Note Essentially LLM can be viewed as a perplexity score refiner, given any shit & non-sence setence, it can re-predict a better sentence with simple casual masking, which has better perplexity score, the trick is such casual masking prediction is naturally possible to carried out in parallel. Then you have the problem of requiring too much iterations, or dropping accuracy because you are conditioning on shit etc... Therefore some fine-tuning tricks are applied to force out more performance.

DeLLMa: A Framework for Decision Making Under Uncertainty with Large Language Models

Paper • 2402.02392 • Published Feb 4 • 4

Weight Normalization: A Simple Reparameterization to Accelerate Training of Deep Neural Networks

Paper • 1602.07868 • Published Feb 25, 2016 • 2

Note OG work from openAI, decouple weight manitude & direction for learning. Everything boils down to one equation.

AgentCoder: Multi-Agent-based Code Generation with Iterative Testing and Optimisation

Paper • 2312.13010 • Published Dec 20, 2023 • 4

Code Generation with AlphaCodium: From Prompt Engineering to Flow Engineering

Paper • 2401.08500 • Published Jan 16 • 5

Hyena Hierarchy: Towards Larger Convolutional Language Models

Paper • 2302.10866 • Published Feb 21, 2023 • 7

Note The Heyna operator utilizes long convolution to achieve better performance than the attention operator at a fraction of the cost, particularly with longer contexts. Interestingly, the convolution filter is controlled by the position of the token, which also allows for a structured window to control the attention mechanism. This enables behaviors such as decay-through-time, which more closely resembles human attention.

Fine Tuning vs. Retrieval Augmented Generation for Less Popular Knowledge

Paper • 2403.01432 • Published Mar 3 • 2

Gemini 1.5: Unlocking multimodal understanding across millions of tokens of context

Paper • 2403.05530 • Published Mar 8 • 60

Note something like 'we use python'

Adaptive Skeleton Graph Decoding

Paper • 2402.12280 • Published Feb 19 • 2

Note parallel decoding for parallel thoughts, sequential decoding for serial deduction process

RAT: Retrieval Augmented Thoughts Elicit Context-Aware Reasoning in Long-Horizon Generation

Paper • 2403.05313 • Published Mar 8 • 9

Note CoT but with step-wise RAG (bit crazy but multi-processing allows parallel RAG implementation, so why not?)

QLoRA: Efficient Finetuning of Quantized LLMs

Paper • 2305.14314 • Published May 23, 2023 • 45

Note Trade more computation for less memory. Extra computation involves dequantizing the quantized weight matrices, extra gradient propagation into low-rank adapter matrices. Less memory means no need to store the pre-trained float16 weight matrix, and no need to store gradient for the pre-trained weight matrix. There is no free lunch. BTW, fine-tuning a X GB model requires 4X GB memory to store weights and gradients. LoRA achieves ~XGB by using adaptors, and QLoRA ahieves ~X/N GB with quantization.

Is Cosine-Similarity of Embeddings Really About Similarity?

Paper • 2403.05440 • Published Mar 8 • 3

Language Agents as Optimizable Graphs

Paper • 2402.16823 • Published Feb 26 • 3

Note work from KAUST optimizes Graph-based Agent cooperation

Stealing Part of a Production Language Model

Paper • 2403.06634 • Published Mar 11 • 90

Note How to hack OpenAI

GaLore: Memory-Efficient LLM Training by Gradient Low-Rank Projection

Paper • 2403.03507 • Published Mar 6 • 182

Note Pre-train on 24GB GPU !! We know LoRA part, but what is Ga?

BitNet: Scaling 1-bit Transformers for Large Language Models

Paper • 2310.11453 • Published Oct 17, 2023 • 96

Note Predecessor work for era of 1-bit LLM paper, here they actually scale everything down to 1-bit, even the activation function gets quantized. One could argue that the quantization of the activation function destructs the performance of these quantization-aware trained model.

Self-RAG: Learning to Retrieve, Generate, and Critique through Self-Reflection

Paper • 2310.11511 • Published Oct 17, 2023 • 74

JudgeLM: Fine-tuned Large Language Models are Scalable Judges

Paper • 2310.17631 • Published Oct 26, 2023 • 33

Learning From Mistakes Makes LLM Better Reasoner

Paper • 2310.20689 • Published Oct 31, 2023 • 28

RLVF: Learning from Verbal Feedback without Overgeneralization

Paper • 2402.10893 • Published Feb 16 • 10

Note In-context RL:

Bias-Augmented Consistency Training Reduces Biased Reasoning in Chain-of-Thought

Paper • 2403.05518 • Published Mar 8 • 2

An Image is Worth 1/2 Tokens After Layer 2: Plug-and-Play Inference Acceleration for Large Vision-Language Models

Paper • 2403.06764 • Published Mar 11 • 25

Intrinsic Dimensionality Explains the Effectiveness of Language Model Fine-Tuning

Paper • 2012.13255 • Published Dec 22, 2020 • 3

Note This is the foundation of modern time fine-tuning & quantization, and the experimental discovery here closely relate to theory of mind. Essentially with more paramters, model will be more reluctant to learn with high intrinsic dimensional, as then a lower dimensional representation suffices with the shear amount of parameters. As a result, model is always trying to minimize 'surprise' while reducing 'energy' -- in this case, the intrinsic dimension.

Towards General Computer Control: A Multimodal Agent for Red Dead Redemption II as a Case Study

Paper • 2403.03186 • Published Mar 5 • 5

Branch-Train-MiX: Mixing Expert LLMs into a Mixture-of-Experts LLM

Paper • 2403.07816 • Published Mar 12 • 39

Note Branch: copy base LLM multiple times. Train: Shard training data into multiple domain-specific subset, and train each LLM in parallel (zero communication required). Mix: Merge all domain specific LLMs into a MoE model (token-level routing + multiple FFN layers), then MoE-finetune this model to adapt to the full training dataset. Interesting discussion on dying expert (routed weights collapse to zero) effect and the 'load balancing' tricks as a simple solution by adding balancing loss.

Unfamiliar Finetuning Examples Control How Language Models Hallucinate

Paper • 2403.05612 • Published Mar 8 • 3

PlanGPT: Enhancing Urban Planning with Tailored Language Model and Efficient Retrieval

Paper • 2402.19273 • Published Feb 29 • 3

Unleashing Cognitive Synergy in Large Language Models: A Task-Solving Agent through Multi-Persona Self-Collaboration

Paper • 2307.05300 • Published Jul 11, 2023 • 18

Demystifying Embedding Spaces using Large Language Models

Paper • 2310.04475 • Published Oct 6, 2023 • 3

Simple and Scalable Strategies to Continually Pre-train Large Language Models

Paper • 2403.08763 • Published Mar 13 • 49

MultiHop-RAG: Benchmarking Retrieval-Augmented Generation for Multi-Hop Queries

Paper • 2401.15391 • Published Jan 27 • 6

GiT: Towards Generalist Vision Transformer through Universal Language Interface

Paper • 2403.09394 • Published Mar 14 • 25

DeAL: Decoding-time Alignment for Large Language Models

Paper • 2402.06147 • Published Feb 5 • 7

SELFI: Autonomous Self-Improvement with Reinforcement Learning for Social Navigation

Paper • 2403.00991 • Published Mar 1 • 2

Uni-SMART: Universal Science Multimodal Analysis and Research Transformer

Paper • 2403.10301 • Published Mar 15 • 52

Evolutionary Optimization of Model Merging Recipes

Paper • 2403.13187 • Published Mar 19 • 50

Note Goal-Oriented Model Merging with Evolutionary Search. PS (slerp / dare -- succeeds) + DFS (passthrough with scaler adaptor -- fails!) Our re-implementation here: https://github.com/fangyuan-ksgk/Evolutionary-Model-Merge

Reverse Training to Nurse the Reversal Curse

Paper • 2403.13799 • Published Mar 20 • 13

PERL: Parameter Efficient Reinforcement Learning from Human Feedback

Paper • 2403.10704 • Published Mar 15 • 57

VideoAgent: A Memory-augmented Multimodal Agent for Video Understanding

Paper • 2403.11481 • Published Mar 18 • 12

DiPaCo: Distributed Path Composition

Paper • 2403.10616 • Published Mar 15 • 12

Note 'Pathway' vision at Google & Imaginary work from Deepmind: ML Model as decomposable pathways, each path is capcable of process input and provide output, with this model structure, distributed training across different clusters require low synchronization and communication. Inference with such model also require only loading the few offline-routed paths which matter to your task at hand, alleviating the storage requirement of the computation.

Training Neural Networks from Scratch with Parallel Low-Rank Adapters

Paper • 2402.16828 • Published Feb 26 • 3

RAFT: Adapting Language Model to Domain Specific RAG

Paper • 2403.10131 • Published Mar 15 • 67

Note Simple yet powerful observation: noise get included into RAG pipeline, so fine-tune your LLM against such noise through data augmentation.

Recurrent Drafter for Fast Speculative Decoding in Large Language Models

Paper • 2403.09919 • Published Mar 14 • 20

LLMLingua-2: Data Distillation for Efficient and Faithful Task-Agnostic Prompt Compression

Paper • 2403.12968 • Published Mar 19 • 24

SOLAR 10.7B: Scaling Large Language Models with Simple yet Effective Depth Up-Scaling

Paper • 2312.15166 • Published Dec 23, 2023 • 56

Note Successful Data Flow merging, two Mistral 7B is adopted to concatenate with each other with overlap, the overlapped layers gets merged, and continual pre-training is done while observing "fast performance recovery". It seems the nice LLM has much more compressed & generalizble information stored in each layer than we think. Contiual pretraining from a stack of these layers might be a good idea for efficient scaling of model size.

Resolving Interference When Merging Models

Paper • 2306.01708 • Published Jun 2, 2023 • 13

Blending Is All You Need: Cheaper, Better Alternative to Trillion-Parameters LLM

Paper • 2401.02994 • Published Jan 4 • 48

WARM: On the Benefits of Weight Averaged Reward Models

Paper • 2401.12187 • Published Jan 22 • 18

A Unified Framework for Model Editing

Paper • 2403.14236 • Published Mar 21 • 2

LLM2LLM: Boosting LLMs with Novel Iterative Data Enhancement

Paper • 2403.15042 • Published Mar 22 • 25

Detoxifying Large Language Models via Knowledge Editing

Paper • 2403.14472 • Published Mar 21 • 3

GAIA: a benchmark for General AI Assistants

Paper • 2311.12983 • Published Nov 21, 2023 • 183

LLM Agent Operating System

Paper • 2403.16971 • Published Mar 25 • 65

The Unreasonable Ineffectiveness of the Deeper Layers

Paper • 2403.17887 • Published Mar 26 • 78

LISA: Layerwise Importance Sampling for Memory-Efficient Large Language Model Fine-Tuning

Paper • 2403.17919 • Published Mar 26 • 16

Long-form factuality in large language models

Paper • 2403.18802 • Published Mar 27 • 24

Note Split, Revise to be self contained ensure the atomic thought is maintained, and the argument can be broken down into coherant small arguments. Based on the smaller argument, one can do fact check, do reflection, do MCTS, do all sorts of stuff with them. This little trick opens a door to immense posibility. And specifically useful with their open-sourced code !

Sparse Upcycling: Training Mixture-of-Experts from Dense Checkpoints

Paper • 2212.05055 • Published Dec 9, 2022 • 5

Scaling Expert Language Models with Unsupervised Domain Discovery

Paper • 2303.14177 • Published Mar 24, 2023 • 2

Note An extension to the Branch-Train-Mix approach from Meta. Multiple copies are made from the base LLM, a unsupervised clustering-based domain discovery (main difference) is done by traversing the training data corpus, and then offline task-level routing ensures each LLM is trained into a domain experts. Mixing weighted logit prediction from routed top-K experts is carried out during inference.

Beyond Distillation: Task-level Mixture-of-Experts for Efficient Inference

Paper • 2110.03742 • Published Sep 24, 2021 • 3

Note Google's work on Multilingual Machine Translation task in 2021. Token-MoE suffer from having to load all the experts in storage while dynamic token routing only activate a few of them -- this creates redundancy. By assigning each input language & output language pair as a separate task, task-level routing (Task-MoE) is shown to outperform Token-MoE on multilingual machine translation task, avoiding interference by mixed training.

RouterBench: A Benchmark for Multi-LLM Routing System

Paper • 2403.12031 • Published Mar 18 • 3

Fusing Models with Complementary Expertise

Paper • 2310.01542 • Published Oct 2, 2023 • 1

Lumos: Learning Agents with Unified Data, Modular Design, and Open-Source LLMs

Paper • 2311.05657 • Published Nov 9, 2023 • 27

Teaching Large Language Models to Reason with Reinforcement Learning

Paper • 2403.04642 • Published Mar 7 • 46

Note It seems Meta is not playing RL in the right way, check Quiet-StaR, a significant improvement is possible. Again one potential reson is that the thought generation process in here is not PARALLEL. As Matt puts it, thinking process is fundamentally Parallel, while the saying process isn't. So the RL should be done with the thought, and not the greedy-1 decoding path? Need to dig it

Model Stock: All we need is just a few fine-tuned models

Paper • 2403.19522 • Published Mar 28 • 10

Advancing LLM Reasoning Generalists with Preference Trees

Paper • 2404.02078 • Published Apr 2 • 44

Octopus v2: On-device language model for super agent

Paper • 2404.01744 • Published Apr 2 • 57

Note LLM OS == fine-tuning on specific sequence of actions within a constrained environment == few billiion parameter agent ?

Adaptive-RAG: Learning to Adapt Retrieval-Augmented Large Language Models through Question Complexity

Paper • 2403.14403 • Published Mar 21 • 6

BLADE: Enhancing Black-box Large Language Models with Small Domain-Specific Models

Paper • 2403.18365 • Published Mar 27 • 2

LLaVA-Gemma: Accelerating Multimodal Foundation Models with a Compact Language Model

Paper • 2404.01331 • Published Mar 29 • 25

Mixture-of-Depths: Dynamically allocating compute in transformer-based language models

Paper • 2404.02258 • Published Apr 2 • 104

Note 'Skip' transformer blocks yields on par performance than sequential propagation without 'skipping'. This is essentially a 'Hard' residual, as opposed to the 'Soft' residual in ResNet.

Visualization-of-Thought Elicits Spatial Reasoning in Large Language Models

Paper • 2404.03622 • Published Apr 4 • 4

ReFT: Representation Finetuning for Language Models

Paper • 2404.03592 • Published Apr 4 • 90

Note Given a pre-trained LLM, sequence input is represented in latent space across transformer blocks. These representation can be intervened to achieve 'control' of model output, which higher efficiency than modification of model weights. Representation fine-tuning uses learnable representation intervention for fine-tuning, achieving 10x-50x more efficiency boost compared to LoRA & DoRA. Compositional learning is done by doing mutually exclusive intervention on separate position & layers.

Direct Nash Optimization: Teaching Language Models to Self-Improve with General Preferences

Paper • 2404.03715 • Published Apr 4 • 60

Note Iterative DPO. Difference with self-rewarding LLM is that it uses a strong LLM to RANK the generated responses, thereby having a much better efficiency in 'aligning' the small / weak model towards the strong model. Moreover, the nature of evaluation being simpler than generation means one could even surpass GPT-4 performance with such iteration schemes.

Model Editing with Canonical Examples

Paper • 2402.06155 • Published Feb 9 • 11

Model Editing Can Hurt General Abilities of Large Language Models

Paper • 2401.04700 • Published Jan 9 • 3

Let's Focus on Neuron: Neuron-Level Supervised Fine-tuning for Large Language Model

Paper • 2403.11621 • Published Mar 18 • 2

LLM2Vec: Large Language Models Are Secretly Powerful Text Encoders

Paper • 2404.05961 • Published Apr 9 • 64

Eagle and Finch: RWKV with Matrix-Valued States and Dynamic Recurrence

Paper • 2404.05892 • Published Apr 8 • 31

CodecLM: Aligning Language Models with Tailored Synthetic Data

Paper • 2404.05875 • Published Apr 8 • 16

RS-DPO: A Hybrid Rejection Sampling and Direct Preference Optimization Method for Alignment of Large Language Models

Paper • 2402.10038 • Published Feb 15 • 6

Context versus Prior Knowledge in Language Models

Paper • 2404.04633 • Published Apr 6 • 5

WILBUR: Adaptive In-Context Learning for Robust and Accurate Web Agents

Paper • 2404.05902 • Published Apr 8 • 20

Best Practices and Lessons Learned on Synthetic Data for Language Models

Paper • 2404.07503 • Published Apr 11 • 29

Note Position paper on data augmentation approaches adopted in LLM training. Real data is expensive to get and will run out very quickly, most of the public benchmark dataset is already inside training corpus for many LLMs. One interesting point: replacing natural language label with random labels improve model performance. Classic improving difficulty leads to generalization trick.

MiniCPM: Unveiling the Potential of Small Language Models with Scalable Training Strategies

Paper • 2404.06395 • Published Apr 9 • 21

Evaluating Mathematical Reasoning Beyond Accuracy

Paper • 2404.05692 • Published Apr 8 • 2

Rho-1: Not All Tokens Are What You Need

Paper • 2404.07965 • Published Apr 11 • 84

Aligner: Achieving Efficient Alignment through Weak-to-Strong Correction

Paper • 2402.02416 • Published Feb 4 • 4

Note Residual Connection has been proven to be the right way to compose neural network layers. This work from Peking university proves the same trick can be used to compose LLMs. (Not surprising at all). The image reminds me of the quote from Issac Newton: "if I have seen further, it is by standing on the shoulders of giants"

Stream of Search (SoS): Learning to Search in Language

Paper • 2404.03683 • Published Apr 1 • 25

THOUGHTSCULPT: Reasoning with Intermediate Revision and Search

Paper • 2404.05966 • Published Apr 9 • 2

Autonomous Evaluation and Refinement of Digital Agents

Paper • 2404.06474 • Published Apr 9 • 1

Symbol tuning improves in-context learning in language models

Paper • 2305.08298 • Published May 15, 2023 • 3

Note This work touches on the fundamental issue of ML model training: it always pick the short-cut. To predict natural language label, the short-cut is to check semantic similarity. But what if we replace the natural language label with random labels? Then the model is forced to 'learns' what it shoudl do, and then do it.

TransformerFAM: Feedback attention is working memory

Paper • 2404.09173 • Published Apr 14 • 43

Medusa: Simple LLM Inference Acceleration Framework with Multiple Decoding Heads

Paper • 2401.10774 • Published Jan 19 • 54

From r to Q^*: Your Language Model is Secretly a Q-Function

Paper • 2404.12358 • Published Apr 18 • 2

Toward Self-Improvement of LLMs via Imagination, Searching, and Criticizing

Paper • 2404.12253 • Published Apr 18 • 53

Note Not the right way to go. Model would reinforce itself.

TriForce: Lossless Acceleration of Long Sequence Generation with Hierarchical Speculative Decoding

Paper • 2404.11912 • Published Apr 18 • 16

Reuse Your Rewards: Reward Model Transfer for Zero-Shot Cross-Lingual Alignment

Paper • 2404.12318 • Published Apr 18 • 14

Compression Represents Intelligence Linearly

Paper • 2404.09937 • Published Apr 15 • 27

Note Equivalence between optimal compression and perplexity score. LLM encodes rare sequence with more bits and common sequence with less bits, the less bits it required, the more compressed the encoding is. Pre-Training loss is precisely modeling such compression efficiency. This paper identifies a linear correpondance of comression and Intelligence.

Many-Shot In-Context Learning

Paper • 2404.11018 • Published Apr 17 • 4

Note Increase the amount of few-shot prompts to >500k. Interesting discovery is that "Unsupervised ICL" which presents problem and not problem-solution pairs achieves on-par performance with question-answer prompting. We need to present context to LLM such that its domain-specific knowledge could be revealed.

Align Your Steps: Optimizing Sampling Schedules in Diffusion Models

Paper • 2404.14507 • Published Apr 22 • 21

SnapKV: LLM Knows What You are Looking for Before Generation

Paper • 2404.14469 • Published Apr 22 • 23

Achieving >97% on GSM8K: Deeply Understanding the Problems Makes LLMs Perfect Reasoners

Paper • 2404.14963 • Published Apr 23 • 2

Note Observation is made that LLM errors mostly because it fails to understand the question. Rephrasing the question by LLM itself provides explicit requirement for LLM to address this error, combination of question rephrasing, information extraction and then step-by-step reasoning leads to higher accuracy. More compute, more accuracy.

LayerSkip: Enabling Early Exit Inference and Self-Speculative Decoding

Paper • 2404.16710 • Published Apr 25 • 74

NExT: Teaching Large Language Models to Reason about Code Execution

Paper • 2404.14662 • Published Apr 23 • 4

Note Get error message and re-program is generally what happens when we code. However, this is not exactly what happens, error message help us locate the exact place where the error occurs, which helps breaking down the code issue, and allow one to trace back the issue. This paper sort of does that.

Grokked Transformers are Implicit Reasoners: A Mechanistic Journey to the Edge of Generalization

Paper • 2405.15071 • Published May 23 • 37