Every week the firehose of new preprints continues for AI.
There are so many papers of late, that it is a constant running gag in the ML community about how much research there is to keep up with. My purpose in cultivating more of an intentional practice of reading and engaging with whatever new ideas that I see – even if that is difficult. I suppose that along the way I’ll also have the opportunity to joke about the golden age of Machine Learning research that we find ourselves in. (or era of overheated exuberance, depending on who you ask.)
Thus, I’ll try to keep up a weekly series of some of the newer papers that I’m reading, along with my small digest or takeaway.
The two papers that I’ve been thinking about this week are:
- Multi-token prediction.
- This one is interesting because performing multi token look aheads with transformers seems like one of those ideas that would obviously improve the output of your LLM – intuitively the autoregressive nature of outputs means that a single token at a time is akin to a person giving a speech stumbling one word at a time. I’d be curious what the optimal lookahead for generations is, and if this could be varied rather than fixed.
The architecture here seems super interesting – and is something that I plan on training with a MinGPT implementation as part of my 100 models challenge.
- Leave No Context Behind – Infini-attention.
- This was a paper that I enjoyed – it combined talk of a compressive memory system that keeps some of the residuals of past kv pairs with the attention mechanism.
- Overall, one of the things I liked about this paper is that it takes something that always feels very magical / tacked on that AI engineers deal with – LLM context, and used a compressive memory system to incorporate it better into the model.
Overall, what I liked about these two papers is that they offered elegant ideas to potentially radically improve transformer models – but both ideas in retrospect seem rather simple. I’m curious if Infini-attention really will pan out in a way that is empirically better than other state space models, and if multi token prediction is a reliably better paradigm than single token prediction for transformers.
I’ll share more of my experimentation with these in my next post on Training 100 models.
Leave a comment