So I tried out GPT-3's trick of conditioning on training data with XLNet While it doesn't do as well as the 175B GPT-3, it does much better than the version which is the same size as XLNet (0.4B) The visual below is from their paper on Winogrande – I added the squares for XLNet
At least part of the reason that XLNet does better is that the task I evaluate on is a cloze task (e.g., a fill-in-the-blank). XLNet can factorize in any order, so it's much more amenable to tasks like this.
But it makes me wonder: why has Open AI stuck so aggressively to the classical left-to-right language model? Is GPT-3 better than anything else, or is its only advantage that it's bigger? Would a mega-XLNet blow GPT-3 out of the water?

Jul 20, 2020 · 3:43 PM UTC

Replying to @joeddav
IMO they stick to L2R LM because they didn't mean to obtain sota results on various tasks, but to reveal new phenomena in training super big models. Just look at the paper titles: GPT: all NLU tasks 1 framework, GPT-2: Unsupervised Multitask Learner, GPT-3: Few-Shot Learner.
Replying to @joeddav
My best guess is pre-training time. XLNet pretraining is several times more expensive than Bert which itself is a few times more expensive than GPT...(on a steps needed basis, ignoring seq length etc.)
Replying to @joeddav
It would, no doubt.
Replying to @joeddav
Super interesting questions! My take is that the original GPT came out before BERT, and the benefits of bidirectionality were not as obvious then. But starting from GPT-2, they seem to have switched to a generative focus, and are using it mainly as a pure language model.
Replying to @joeddav
I had the same remark... maybe it is easier to distribute such huge model with unidir context or maybe they just had it ready to train :D... (in my GPT3 recent tests, I must say it has most of the defaults of GPT2 even if it is much more impressive)
Replying to @joeddav
Open AI is practical, their philosophy seems to be simplicity, reliability, and scalability. PPO was the workhorse for OpenAI Five and robotics dexterity like GPT L2R is for language modeling. Same story, massive compute alongside battle-tested algorithms.