Skip to main content

Dynamic Evaluation

Old neural net technique for runtime (online) learning which boosts performance.

Dynamic evaluation or test-time finetuning is a performance-enhancing1 online machine learning technique where the ML model is trained further at runtime on ‘new’ data, eg. an RNN/Transformer is benchmarked on predicting text, but in addition to its prediction each timestep, it does an additional gradient descent on the newly-observed text. (It is analogous to short-term memory neural plasticity.) Dynamic evaluation was introduced for RNNs by Mikolov et al 2010, where the continual learning reduced perplexity in predicting English, and was used in many RNNs afterwards for the best performance (cf. neural cache).

Dynamic evaluation is attractive because it requires no modifications to the architecture or training—it simply does more ‘training’, rather than leaving the weights frozen and relying on the hidden state (or self-attention) to do all learning. It is especially useful when dealing with rare entities, domain shift, or personalization2, and for serial tasks where the best performance is needed. It can also be augmented with retrieval methods or adding in similar datapoints, which can teach the NN more.

Dynamic evaluation has fallen out of fashion due to emphasis on simple-to-deploy models, proprietary cloud services, and throughput over quality; but it may be revived by local NN models, or by tasks requiring cognitive flexibility not handled by pure self-attention (ARC?).


  1. In the limit, dynamic evaluation is equivalent finetuning/arbitrarily large context windows on the new data (cf. MCTS): finetuning on the new dataset as if i.i.id will typically beat any online learning approach like dynamic evaluation, and self-attention does in-context learning which is loosely similar to gradient descent and so a large enough context window without dynamic evaluation will beat a short context window with.

    In past work like Rarren-Triki et al 2024 or Hardt et al 2023, dynamic evaluation’s benefits are equivalent to a 10× scale-up?

    But the benefit might be much larger depending on the scaling laws. As dynamic evaluation can be seen as a form of runtime search, scaling laws like Jones2020 suggest that context windows may scale poorly and dynamic evaluation gain a large compute advantage: dynamic evaluation may enable models to need only small context windows (like thousands) for the immediate local context (having learned from the previous text and updated the model itself) to match models which require millions of tokens of context (because they must compute an equivalent update restricted purely to self-attention over the history).

    And these could be hybridized: dynamic evaluation by default simply tries to predict the next token as usual, but in dealing with large corpuses, we often have a specific task in mind. So a hybrid might be to reserve the first n tokens of the context window for the user prompt, and then do dynamic evaluation along the way. The presence of the user prompt should “focus the LLM’s attention” on things relevant to that prompt, preserving relevant knowledge in the weights.↩︎

  2. eg. to fix Whisper transcripts by user correction of novel or proper nouns, which can’t be done by prompting due to its short LLM context window.↩︎

Similar Links

[Similar links by topic]