May 2021 Gwern.net newsletter with links on AI hardware, diffusion models, optogenetics, brain scanning.
May 2021’s Gwern.net newsletter is now out; previous, April 2021 (archives). This is a collation of links and summary of major changes, overlapping with my Changelog; brought to you by my donors on Patreon.
Note: I will be in Denver 12–13 June 2021 for a conference.
Writings
Links
AI
-
Hardware:
-
“Podracer architectures for scalable Reinforcement Learning”, et al 2021 (highly-efficient TPU pod use: eg. solving Pong in <1min at 43 million FPS on a TPUv3-2048); “Google details new TPUv4 AI accelerator chips” (2.7× TPUv3 chips; up to TPUv4-4096 pods, yielding >1 ExaFLOPS; public access later in 2021)
-
“ZeRO-Infinity: Breaking the GPU Memory Wall for Extreme Scale Deep Learning”, et al 2021 (~1 trillion parameters per 16 GPUs/DGX-2-node, scaling to >512 GPUs ~40% efficiency)
-
“GSPMD: General and Scalable Parallelization for ML Computation Graphs”, et al 2021 (Google upgrade of GPipe/GShard arch to match MS DeepSpeed: “…50%–62% compute utilization on 128–2048 Cloud TPUv3 cores for models with up to one trillion parameters”)
-
“DLRM: High-performance, Distributed Training of Large-scale Deep Learning Recommendation Models”, et al 2021 (ZionEX software/hardware platform for training extremely large embeddings—while embeddings aren’t ‘real’ parameters & things like DynamicEmbedding will never learn tricks like GPT-3 no matter how big, they present similar challenges); “RecPipe: Co-designing Models and Hardware to Jointly Optimize Recommendation Quality and Performance”, et al 2021
-
-
“From Motor Control to Team Play in Simulated Humanoid Football”, et al 2021 (curriculum training of a single NN from raw humanoid control to coordinated team-wide soccer strategy; neat to compare with et al 2020 in terms of agent abilities)
-
“Wav2vec-U: Unsupervised Speech Recognition”, et al 2021
-
“Anthropic” public-benefit-corp/startup launched (founded by the Amodeis; $124M investment for scaling “reliable and steerable AI systems”); “Cooperative AI Foundation” (CAIF) launched
-
“MLP-Mixer: An all-MLP Architecture for Vision”, et al 2021 (another FC paper removing even more inductive biases—ponies are all you need: “Mixer improves more rapidly with data than ResNets, or even ViT, and the gap between large scale Mixer and ViT models shrinks until the performance is matched on the entire dataset…” The Bitter Lesson truly is the single bitterest lesson in ML, isn’t it? The more people tweet about how MLP-Mixer is overhyped because is −X% worse than the ultra-hand-optimized baseline or requires Y× more FLOPS, the more they demonstrate precisely why this sort of research is so important! And showing, incidentally, that Transformers are still under-researched if such a fundamental fact could have been missed for so long.)
-
“Data-Efficient Language-Supervised Zero-Shot Learning with Self-Distillation”, et al 2021 ( CLIP-like performance scaled down to n = 3m using soft labels generated by a Conceptual Captions-pretrained model)
-
“SR3: Image Super-Resolution via Iterative Refinement”, et al 2021; “Diffusion Models Beat GANs on Image Synthesis”, 2021 (DDPM1 finally surpass BigGAN-deep on ImageNet 512px images at similar compute-cost, as expected from their good scaling); “Cascaded Diffusion Models for High Fidelity Image Generation”, et al 2021
-
“Learning to summarize from human feedback”, et al 2020
-
“Grokking: Generalization Beyond Overfitting On Small Algorithmic Data Sets”, et al 2021 ( discussion; new scaling effect, ‘grokking’: sudden perfect generalization emerging many epochs after training-set overfitting on algorithmic tasks when training in flat shallow loss landscapes); “Knowledge distillation: A good teacher is patient and consistent”, et al 2021 (training much smaller models merely requires hundreds of thousands or millions of epochs)
-
“Scaling End-to-End Models for Large-Scale Multilingual ASR”, et al 2021
-
“Reward is enough”, et al 2021 (a DRL manifesto: reward losses enough at scale of compute/parameters/tasks to induce all important capabilities like memory/exploration/generalization/imitation/reasoning)
-
Inverse Scaling Prize: $100k prize for finding tasks that cause worse perf in large language models (deadline: 2022-08-27)
-
Scaling Down:
lazy
: a tool for running processes in idle time (how to train on a GPU without destroying your GUI’s usability!lazy
pauses runs briefly while you interact with your desktop, letting you do months-long runs without going crazy or resorting to Colab etc. This enables hobbyists to go after previously-infeasible model sizes); EleutherAI releases a 6b-parameter GPT-3 model, GPT-J (are you still using GPT-2/GPT-Neo? upgrade!); “Aggregating Nested Transformers”, et al 2021/ “Less is More: Pay Less Attention in Vision Transformers”, et al 2021
-
“ByT5: Towards a token-free future with pre-trained byte-to-byte models”, et al 2021 (character models—not just feasible but desirable; we’ll get our rhyming & pun-making language models yet!)
-
“Machine Learning Attacks Against the Asirra CAPTCHA”, Golle 200816ya (a look back on a decade of CV progress: months of work for 80% cat vs dog with SVM ensembles in 200816ya; 5min in Fast.ai for 99% accuracy in 2018; for even more perspective, Cireşan 2012)
Genetics
Everything Is Heritable:
-
“The complete sequence of a human genome”, et al 2021 ( media)
-
“Using DNA to predict intelligence”, von 2021 (review)
-
“Rapid Sequencing–Based Diagnosis of Thiamine Metabolism Dysfunction Syndrome” (sequence everyone!)
Engineering:
-
“Sense codon reassignment enables viral resistance and encoded polymer synthesis”, et al 2021 (“ultra-safe cells”: synthesizing an entire E. coli genome with swapped codons for complete viral immunity)
-
“In vivo CRISPR base editing of PCSK9 durably lowers cholesterol in primates”, et al 2021
-
Optogenetics: “Partial recovery of visual function in a blind patient after optogenetic therapy”, et al 2021 ( media); “Wireless multilateral devices for optogenetic studies of individual and social behaviors”, et al 2021 ( media)
-
“First genetically modified Oxitec mosquitoes released in the United States”
-
“Genomic characterization of world’s longest selection experiment in mouse reveals the complexity of polygenic traits”, Palma-et al 2021
-
“Surrogate broodstock to enhance biotechnology research and applications in aquaculture”, et al 2021
-
“Utility of polygenic embryo screening for disease depends on the selection strategy”, et al 2021
Statistics/Meta-Science
-
“How a Publicity Blitz Created The Myth of Subliminal Advertising”, Rogers 199232ya (the famous movie-theater/popcorn-sales experiment never happened—subliminal advertising was the Cambridge Analytica of the 1950s)
Politics/Religion
Psychology/Biology
-
“A connectomic study of a petascale fragment of human cerebral cortex”, Shapson-et al 2021 (“…This “digital tissue” is a ~660,000× scale up of an earlier saturated reconstruction from a small region of mouse cortex, published in 2015 (et al 2015 ). Although this scaleup was difficult, it was not hundreds of thousands of times more difficult and took about the same amount of time as the previous data set (~4 years)…The rapid improvements over the past few years…argues that analyzing volumes that are even 3 orders of magnitude larger, such as an exascale whole mouse brain connectome, will likely be in reach within a decade.” See also “Accelerating progress in brain recording tech”.)
-
“Neuroimaging evidence for a network sampling theory of individual differences in human intelligence test performance”, et al 2021; “The neural basis of intelligence in fine-grained cortical topographies”, et al 2021; “Predicting intelligence from brain gray matter volume”, et al 2020 (towards the mechanistic reification of g: per P-FIT, it is global efficiency/total cognitive resources which can be spent on learning & orchestrating specialized capabilities); if we consider recent human brain imaging studies, cross-species comparisons, and deep learning as converging, I would offer as a speculation the following:
The Master Synthesis: intelligence is execution of small simplicity-weighted programs, best discovered by search over smooth loss landscapes like that of highly-overparameterized differentiable networks containing lottery-ticket subnetworks which are ensembled/averaged over, approaching Bayes-optimal reasoning in the limit (as nearest-neighbors-like high dimensional interpolation / memorization gives way to algorithmic generalization / interpolation on a more abstract level); this can be implemented by large numbers of similar neurons trained using any of the many approximations to backprop; general flexible behavior which cannot be feasibly specialized is guided by what is ‘left over’ from basic tasks (driving encephalization quotient & allometric scaling laws); human intelligence’s g is real but is the overall ‘pool’ of neural resources which derives from overall body integrity because the number of neurons, their density, their myelination, resistance to damage and infection etc, is causally downstream of all body and developmental systems, creating a huge mutational target; the brain regions specialize and differentiate, and their orchestration (or lack thereof) contributes to observed performance on tasks tapping into multiple specialized regions; as tasks rely on fewer regions or approach intrinsic ceiling, g ceases to be observable and task-specific influences matter most.
-
Why do larger animals need so much more neurons to control their body, when one would expect the hierarchical structure to be efficient?
One possibility from an ANN perspective is the tradeoff between width & depth (wide vs deep models learn different things): wide shallow nets have low latency, but tend to be parameter-inefficient compared to deeper nets (perhaps because they lern more redundant but parallel representations?). Because larger animals live in the same world as smaller ones and still need to act with reasonable latency on the millisecond to second time scale, they presumably are forced towards wider nets, and away from a latency-unconstrained parameter or FLOPS-optimal architecture & scaling.
-
-
“MDMA-assisted therapy for severe PTSD: a randomized, double-blind, placebo-controlled phase 3 study”, et al 2021 (d = 0.9 over therapy); “Effects of Psilocybin-Assisted Therapy on Major Depressive Disorder”, et al 2021
-
“Why Animals Don’t Get Lost: Birds do it. Bees do it. Learning about the astounding navigational feats of wild creatures can teach us a lot about where we’re going” (on spectacular but still mysterious feats of animal navigation)
-
“In The Future Of Collecting, Is Anyone Having Fun?” (on Bobblehead collectors)
-
“The Best And The Rest: Revisiting The Norm Of Normality Of Individual Performance”, O’Boyle & Aguinis 201212ya (performance is log-normal)
-
“A conserved strategy for inducing appendage regeneration”, et al 2021 (slight regrowth of damaged mouse limbs by drinking sugar+amino-acid-supplemented water)
-
“Know Your Amphetamines”, Scott Alexander
-
“Feeling Small: Exploring the Tactile Perception Limits [of Humans]”, et al 2013
-
“The Board Game of the Alpha Nerds: Before Risk, before Dungeons & Dragons, before Magic: The Gathering, there was Diplomacy” (WP; “I still don’t know whom I should have trusted, if anyone. All I know is that I felt stupid, stressed out, humiliated, and sad.”)
Technology
Economics
-
“RCTs to Scale: Comprehensive Evidence from 2 Nudge Units”, Della2020 (nudge effects overestimated by 6.2
-
“No causal associations between childhood family income and subsequent psychiatric disorders, substance misuse and violent crime arrests: a nationwide Finnish study of >650,000 individuals and their siblings”, et al 2021; “Parental income and mental disorders in children and adolescents: prospective register-based study”, et al 2021
-
“Everything You Might Want to Know about Whaling”, Matt Lakeman
Fiction
Miscellaneous
-
“The Strange Story of Dagobert, the Duck Tales Bandit: In the ’90s, a frustrated artist in Berlin went on a crime spree—building bombs, extorting high-end stores, and styling his persona after Scrooge McDuck. He soon became a German folk hero.” (WP; another reminder for Americans—odd as it may seem, Donald Duck is extremely popular overseas; see also the unknown-in-the-USA character John D. Rockerduck or beloved Scandinavian tradition From All of Us to All of You who 2020 airing set an all-time record of >4.5m viewers)
-
List of atmospheric optical phenomena (How many would you recognize from a distance or plane? How many have you even heard of?)
-
Baron Franz Nopcsa von Felső-Szilvás (noted geologist, paleontologist, anthropologist, homosexual, & skyjacker)
-
What is a diffusion model like DDPM? To try to explain it as simply as possible without the math:
DDPM is a neural net which is trained to fix noise in an image: it takes a noisy image and ‘sharpens’ it to produce a new image. You train it by adding dirt to a normal image, and teaching it to turn the dirty version into the original. As it gets better, it learns what the images all tend to look like so it can ‘see through’ ever more noise, to turn smudged hints of the original image into its best guess. Once it’s done training, what happens if you give it a completely dirty photo, which is pure static noise? Well, it produces a slightly less dirty ‘photo’. And if you do it again? it’s a little cleaner still. Now, what if you do this many times? It has to get cleaner each time. The end result: the static noise goes in, and a face pops out! The DDPM has hallucinated a face out of the noise. One little blob of static here turned into a nose, and another blob turned into an ear, and it went from there.↩︎