> At a high level, you take an existing transformer model
Freeze all the weights, delete the attention layer, replace it with RWKV, and train it through multiple stages.
Maybe I'm misunderstanding but it seems like they still depend on an attention based model to train their model? While it's interesting and would definitely enable the personalized AI they seem to be going for, I don't really see how they can say that it's not based on attention architecture. Someone still needs to shell out the big bucks to train that teacher model, no?
pico_creator
1 day ago
[ - ]
(original article author)
I view it more as a shortcut. We have trained 7B and 14B models from scratch, matching transformer performance with similar sized datasets.
This has been shown to even slightly outperform transformer scaling law, with the training we done from 1B to 14B. And we expect it to do so as we scale.
However as of this point, answering and settling that debate for good at 72B scale - is a $5 Million dollar bill. So for now, we use the short cuts, to just show that it actually works - and use that money to iterate and improve the architecture faster.
inhumantsar
1 day ago
[ - ]
Thanks for the explanation! Sounds pretty exciting. I'll keep my eyes peeled for the paper
Attention is NOT all you need: Qwerky-72B trained using only 8 AMD MI300X GPUs
(substack.recursal.ai)
19 points
by: jtatarchuk
1 day ago
3 comments
inhumantsar
1 day ago
[ - ]
> At a high level, you take an existing transformer model Freeze all the weights, delete the attention layer, replace it with RWKV, and train it through multiple stages.
Maybe I'm misunderstanding but it seems like they still depend on an attention based model to train their model? While it's interesting and would definitely enable the personalized AI they seem to be going for, I don't really see how they can say that it's not based on attention architecture. Someone still needs to shell out the big bucks to train that teacher model, no?
pico_creator
1 day ago
[ - ]
(original article author)
I view it more as a shortcut. We have trained 7B and 14B models from scratch, matching transformer performance with similar sized datasets.
This has been shown to even slightly outperform transformer scaling law, with the training we done from 1B to 14B. And we expect it to do so as we scale.
However as of this point, answering and settling that debate for good at 72B scale - is a $5 Million dollar bill. So for now, we use the short cuts, to just show that it actually works - and use that money to iterate and improve the architecture faster.
inhumantsar
1 day ago
[ - ]
Thanks for the explanation! Sounds pretty exciting. I'll keep my eyes peeled for the paper