One more non-NLP GPT experiment: Prompted with a training set, can a GPT predict the weights of a MLP? 1/6
Each sequence is built as follows: - generate eight rectangles at random in [-1,1]^2, their interior is class 1, exterior class 0. - generate uniformly 250 training points and 250 test points quantize the xs and ys in 101 values. 2/6
- train a one hidden layer MLP with 2 input units, 32 hidden units and 2 output units. Quantize the weights during training more and more aggressively so that at the end they are quantized in 101 values 3/6
- create a sequence with 750 tokens for the training set (x/y/class) followed by a marker, and the quantized weights. 4/6
Train a 352M parameter GPT on 250k sequences (had to write code to train MLPs by batching models and samples!) Given a new set of 250 training examples (hence a 750 token prompt) let it generate the weights, plug them in the MLP, compute the error rate on 250 test points. 5/6
Here is the graph after 1 and 7 epochs of the ratio between the test error rate of the trained MLP and the GPT-generated MLP (e.g. 700 out of 1000 GPT-generated MLPs have a test error lesser than 2x the backprop one) No extraordinary, but still cool IMO. 6/6

Oct 18, 2023 · 6:37 AM UTC

Replying to @francoisfleuret
Is it open source on Git Hub?
fleuret.org/cgi-bin/gitweb/g… ./main.py --task=qmlp --model=352M --nb_train_samples=250000 --result_dir=results_qmlp_352M --batch_size=2
Replying to @francoisfleuret
What if you did something like SVM as a benchmark? I appreciate that what you're trying to do is predict the MLP weights rather than the predictions themselves but still seems like it would be a good benchmark.....
I am not sure to understand? You mean GPT-generated MLP vs. vanilla SVM?
Replying to @francoisfleuret
Shouldn’t the “trained MLP” baseline be the ensemble / voting of all of the 250K models. As the GPT MLPs have seen 250K x more data?
Each MLP is learning a different problem, you cannot ensemble them. But I agree the GPT can leverage the statistics of the task itself while backprop is agnostic.
Replying to @francoisfleuret
Are the classes balanced? Do you have other baselines? (e.g. if the classes aren't balanced, random predictions according the training label proportions)
I resample the rectangles if it's less balanced than 40/60. MLP test error is 0.098±0.038
Replying to @francoisfleuret
I'm probably being dense - but how do 'epochs' enter into the experiment? Are these two different GPTs, trained with 1-epoch and 7-epoch MLP weights?
No, performance of the same GPT at different stages of its training. It's still in progress.