stereoplegic
's Collections
LLM architecture
updated
The Impact of Depth and Width on Transformer Language Model
Generalization
Paper
•
2310.19956
•
Published
•
9
Retentive Network: A Successor to Transformer for Large Language Models
Paper
•
2307.08621
•
Published
•
170
RWKV: Reinventing RNNs for the Transformer Era
Paper
•
2305.13048
•
Published
•
14
Attention Is All You Need
Paper
•
1706.03762
•
Published
•
44
READ: Recurrent Adaptation of Large Transformers
Paper
•
2305.15348
•
Published
•
2
Emergent Mixture-of-Experts: Can Dense Pre-trained Transformers Benefit
from Emergent Modular Structures?
Paper
•
2310.10908
•
Published
•
1
Parameter-Efficient Mixture-of-Experts Architecture for Pre-trained
Language Models
Paper
•
2203.01104
•
Published
•
2
Scaling Pre-trained Language Models to Deeper via Parameter-efficient
Architecture
Paper
•
2303.16753
•
Published
•
1
GateLoop: Fully Data-Controlled Linear Recurrence for Sequence Modeling
Paper
•
2311.01927
•
Published
•
1
White-Box Transformers via Sparse Rate Reduction
Paper
•
2306.01129
•
Published
•
1
Improving Transformers with Probabilistic Attention Keys
Paper
•
2110.08678
•
Published
•
1
Wide Attention Is The Way Forward For Transformers?
Paper
•
2210.00640
•
Published
•
1
Architecture Matters in Continual Learning
Paper
•
2202.00275
•
Published
•
1
Scaling TransNormer to 175 Billion Parameters
Paper
•
2307.14995
•
Published
•
21
FlashFFTConv: Efficient Convolutions for Long Sequences with Tensor
Cores
Paper
•
2311.05908
•
Published
•
12
Hiformer: Heterogeneous Feature Interactions Learning with Transformers
for Recommender Systems
Paper
•
2311.05884
•
Published
•
5
AutoML in the Age of Large Language Models: Current Challenges, Future
Opportunities and Risks
Paper
•
2306.08107
•
Published
•
1
Continual Learning with Dependency Preserving Hypernetworks
Paper
•
2209.07712
•
Published
•
1
Are We Falling in a Middle-Intelligence Trap? An Analysis and Mitigation
of the Reversal Curse
Paper
•
2311.07468
•
Published
•
1
Rethinking Attention: Exploring Shallow Feed-Forward Neural Networks as
an Alternative to Attention Layers in Transformers
Paper
•
2311.10642
•
Published
•
23
Trellis Networks for Sequence Modeling
Paper
•
1810.06682
•
Published
•
1
ProSG: Using Prompt Synthetic Gradients to Alleviate Prompt Forgetting
of RNN-like Language Models
Paper
•
2311.01981
•
Published
•
1
Exponentially Faster Language Modelling
Paper
•
2311.10770
•
Published
•
118
Replacing softmax with ReLU in Vision Transformers
Paper
•
2309.08586
•
Published
•
17
Transformer Language Models without Positional Encodings Still Learn
Positional Information
Paper
•
2203.16634
•
Published
•
5
Maestro: Uncovering Low-Rank Structures via Trainable Decomposition
Paper
•
2308.14929
•
Published
•
1
Robust low-rank training via approximate orthonormal constraints
Paper
•
2306.01485
•
Published
•
1
Low Rank Optimization for Efficient Deep Learning: Making A Balance
between Compact Architecture and Fast Training
Paper
•
2303.13635
•
Published
•
1
Cuttlefish: Low-Rank Model Training without All the Tuning
Paper
•
2305.02538
•
Published
•
1
Relaxed Attention for Transformer Models
Paper
•
2209.09735
•
Published
•
1
I3D: Transformer architectures with input-dependent dynamic depth for
speech recognition
Paper
•
2303.07624
•
Published
•
1
Emergence of Segmentation with Minimalistic White-Box Transformers
Paper
•
2308.16271
•
Published
•
13
White-Box Transformers via Sparse Rate Reduction: Compression Is All
There Is?
Paper
•
2311.13110
•
Published
•
1
Linear Self-Attention Approximation via Trainable Feedforward Kernel
Paper
•
2211.04076
•
Published
•
1
Mamba: Linear-Time Sequence Modeling with Selective State Spaces
Paper
•
2312.00752
•
Published
•
138
Decoder-only Architecture for Speech Recognition with CTC Prompts and
Text Data Augmentation
Paper
•
2309.08876
•
Published
•
1
Smooth Diffusion: Crafting Smooth Latent Spaces in Diffusion Models
Paper
•
2312.04410
•
Published
•
14
HyperMixer: An MLP-based Low Cost Alternative to Transformers
Paper
•
2203.03691
•
Published
•
1
Interpret Vision Transformers as ConvNets with Dynamic Convolutions
Paper
•
2309.10713
•
Published
•
1
Laughing Hyena Distillery: Extracting Compact Recurrences From
Convolutions
Paper
•
2310.18780
•
Published
•
3
LLM360: Towards Fully Transparent Open-Source LLMs
Paper
•
2312.06550
•
Published
•
57
Text Generation with Diffusion Language Models: A Pre-training Approach
with Continuous Paragraph Denoise
Paper
•
2212.11685
•
Published
•
2
AR-Diffusion: Auto-Regressive Diffusion Model for Text Generation
Paper
•
2305.09515
•
Published
•
2
TESS: Text-to-Text Self-Conditioned Simplex Diffusion
Paper
•
2305.08379
•
Published
•
1
DiffuSeq: Sequence to Sequence Text Generation with Diffusion Models
Paper
•
2210.08933
•
Published
•
5
DiffuSIA: A Spiral Interaction Architecture for Encoder-Decoder Text
Diffusion
Paper
•
2305.11517
•
Published
•
1
Diffusion Language Models Can Perform Many Tasks with Scaling and
Instruction-Finetuning
Paper
•
2308.12219
•
Published
•
1
Likelihood-Based Diffusion Language Models
Paper
•
2305.18619
•
Published
•
1
ParaGuide: Guided Diffusion Paraphrasers for Plug-and-Play Textual Style
Transfer
Paper
•
2308.15459
•
Published
•
1
SSD-LM: Semi-autoregressive Simplex-based Diffusion Language Model for
Text Generation and Modular Control
Paper
•
2210.17432
•
Published
•
1
Self-conditioned Embedding Diffusion for Text Generation
Paper
•
2211.04236
•
Published
•
1
Cached Transformers: Improving Transformers with Differentiable Memory
Cache
Paper
•
2312.12742
•
Published
•
12
SwitchGPT: Adapting Large Language Models for Non-Text Outputs
Paper
•
2309.07623
•
Published
•
1
Learning to Skip for Language Modeling
Paper
•
2311.15436
•
Published
•
1
Zoology: Measuring and Improving Recall in Efficient Language Models
Paper
•
2312.04927
•
Published
•
2
Beyond Surface: Probing LLaMA Across Scales and Layers
Paper
•
2312.04333
•
Published
•
18
DeLighT: Deep and Light-weight Transformer
Paper
•
2008.00623
•
Published
•
1
Leveraging Contextual Information for Effective Entity Salience
Detection
Paper
•
2309.07990
•
Published
•
7
Paper
•
2306.09539
•
Published
•
9
Blockwise Parallel Transformer for Long Context Large Models
Paper
•
2305.19370
•
Published
•
3
Block-Recurrent Transformers
Paper
•
2203.07852
•
Published
•
1
Vision Mamba: Efficient Visual Representation Learning with
Bidirectional State Space Model
Paper
•
2401.09417
•
Published
•
59
RNNs of RNNs: Recursive Construction of Stable Assemblies of Recurrent
Neural Networks
Paper
•
2106.08928
•
Published
•
1
LKCA: Large Kernel Convolutional Attention
Paper
•
2401.05738
•
Published
•
1
InfoDiffusion: Information Entropy Aware Diffusion Process for
Non-Autoregressive Text Generation
Paper
•
2310.11976
•
Published
•
2
Enhancing Phrase Representation by Information Bottleneck Guided Text
Diffusion Process for Keyphrase Extraction
Paper
•
2308.08739
•
Published
•
1
Gated Linear Attention Transformers with Hardware-Efficient Training
Paper
•
2312.06635
•
Published
•
6
Hierarchically Gated Recurrent Neural Network for Sequence Modeling
Paper
•
2311.04823
•
Published
•
2
Improving Natural Language Capability of Code Large Language Model
Paper
•
2401.14242
•
Published
•
1
BlackMamba: Mixture of Experts for State-Space Models
Paper
•
2402.01771
•
Published
•
23
Can Mamba Learn How to Learn? A Comparative Study on In-Context Learning
Tasks
Paper
•
2402.04248
•
Published
•
30
Simple Hardware-Efficient Long Convolutions for Sequence Modeling
Paper
•
2302.06646
•
Published
•
2
A Quantitative Review on Language Model Efficiency Research
Paper
•
2306.01768
•
Published
•
1
A Unified View of Long-Sequence Models towards Modeling Million-Scale
Dependencies
Paper
•
2302.06218
•
Published
•
1
Accelerating Toeplitz Neural Network with Constant-time Inference
Complexity
Paper
•
2311.08756
•
Published
•
1
Agent Attention: On the Integration of Softmax and Linear Attention
Paper
•
2312.08874
•
Published
•
2
Linear Transformers with Learnable Kernel Functions are Better
In-Context Models
Paper
•
2402.10644
•
Published
•
79
Enhancing Transformer RNNs with Multiple Temporal Perspectives
Paper
•
2402.02625
•
Published
ByT5: Towards a token-free future with pre-trained byte-to-byte models
Paper
•
2105.13626
•
Published
•
2
Griffin: Mixing Gated Linear Recurrences with Local Attention for
Efficient Language Models
Paper
•
2402.19427
•
Published
•
52
Simple linear attention language models balance the recall-throughput
tradeoff
Paper
•
2402.18668
•
Published
•
18
Linear Transformers are Versatile In-Context Learners
Paper
•
2402.14180
•
Published
•
6
DenseMamba: State Space Models with Dense Hidden Connection for
Efficient Large Language Models
Paper
•
2403.00818
•
Published
•
15
MEGABYTE: Predicting Million-byte Sequences with Multiscale Transformers
Paper
•
2305.07185
•
Published
•
9
OpenELM: An Efficient Language Model Family with Open-source Training
and Inference Framework
Paper
•
2404.14619
•
Published
•
126
Megalodon: Efficient LLM Pretraining and Inference with Unlimited
Context Length
Paper
•
2404.08801
•
Published
•
63
Various Lengths, Constant Speed: Efficient Language Modeling with
Lightning Attention
Paper
•
2405.17381
•
Published
The Hedgehog & the Porcupine: Expressive Linear Attentions with Softmax
Mimicry
Paper
•
2402.04347
•
Published
•
13
SVD-LLM: Truncation-aware Singular Value Decomposition for Large
Language Model Compression
Paper
•
2403.07378
•
Published
•
3
GoldFinch: High Performance RWKV/Transformer Hybrid with Linear Pre-Fill
and Extreme KV-Cache Compression
Paper
•
2407.12077
•
Published
•
54