LINEAR ATTENTION MODEL PERFORMANCE

Linear-Next: The Definitive Linear Attention Model Leaderboard

Comprehensive benchmarks and rankings for the latest AI models. Compare performance across key metrics to find the perfect model for your needs.

Training Framework

Flame: Flash Linear Attention Made Easy

Welcome to flame, a minimal and efficient framework built on torchtitan for training Flash Linear Attention (FLA) models with blazing efficiency.

Feature Highlights:

  • 🚀Minimal, easy-to-use, extensible training framework
  • 😊Seamless integration with fla and transformers
  • 🔄Zero-cost data preprocessing: online tokenization, dataset shuffling, and multiple datasets support
  • 🔮4D parallelism (coming soon)

Training Corpus

General (50%)

  • DCLM-pro
  • Cosmopedia-v2
  • Fineweb-edu

Code (30%)

  • The-stack v2

Math (15%)

  • Finemath-shards

Reasoning (5%)

  • Natural_Reasoning

Training Budget

Small

50B Tokens

Medium

300B Tokens

Large

1T Tokens

Token Distribution

  • 74.1% of tokens allocated to Large models (1T tokens)
  • 22.2% of tokens allocated to Medium models (300B tokens)
  • 3.7% of tokens allocated to Small models (50B tokens)

Model Size

90M

410M

1B

3B

7B

Scaling from 90M parameters up to 7B parameters

Model Variants

Linear Attention

Lightning Attention
HGRN 1, HGRN 2
cosFormer (1+2)
GLA
MetaLA
DeltaNet Family
RWKV7
Titan
TTT
Mamba (1+2)

Sparse Attention

NSA
MoBA

Hybrid

MiniMax 01
...

Benchmarks

Knowledge QA
MMLU
MMLU pro
SimpleQA
General Reasoning
GPQA
BBH
HellaSwag
KOR-Bench
Math
GSM8k
DROP
Math
Code
HumanEval
LiveCodeBench
Long Context
NIAH
HELMET