So I tried out GPT-3's trick of conditioning on training data with XLNet
While it doesn't do as well as the 175B GPT-3, it does much better than the version which is the same size as XLNet (0.4B)
The visual below is from their paper on Winogrande – I added the squares for XLNet
But it makes me wonder: why has Open AI stuck so aggressively to the classical left-to-right language model?
Is GPT-3 better than anything else, or is its only advantage that it's bigger? Would a mega-XLNet blow GPT-3 out of the water?
Jul 20, 2020 · 3:43 PM UTC