Polymath Engineer Weekly #46
Read more, learn more
Links of the week
The point, then, isn’t that you should watch less CNBC and read more Ben Graham. It’s that if you read more Ben Graham you’ll have an easier time understanding what you should or shouldn’t pay attention to on CNBC. This applies to most fields.
Take for example observational determinism (OD): does an algorithm always converge on the same answer? Let’s use the example of four threads, each nonatomically incrementing a counter once. To show that the final value of x isn’t always 4, you just have find one trace where it’s not 4. But to show that the final value is inconsistent, you have to find two traces that get different answers! The TLA+ model checker (TLC) can do the first but not the second.
In other words, when we represent real-world objects and concepts such as images, audio recordings, news articles, user profiles, weather patterns, and political views as vector embeddings, the semantic similarity of these objects and concepts can be quantified by how close they are to each other as points in vector spaces. Vector embedding representations are thus suitable for common machine learning tasks such as clustering, recommendation, and classification.
It’s easy to be led. When we first start out as developers we have little to no autonomy over our day-to-day: our manager hands us tasks, provides us with context, and adds us to meetings. We’re in the warm embrace of certainty. Once you become a high-level individual contributor (IC), being directed quickly nets you a calendar stuffed with recurring meetings, “just a quick question”, one-on-ones, and more, leaving little breathing room or time for proactivity.
Surprisingly, many who make complexity their bread-and-butter think everybody doing the same job as them share similar traits, that they are not special.
However, not everybody can go keyboard only on a border-less tile windows manager, be productive for months, and not burn out. Actually, most people can't.
Scratch that, most devs can't.
So as I was reading up on transformers, I got fixated on this question: where are the 175 billion parameters in the architecture? Not in the literal sense (the parameters are in the computer), but how are they “spent” between various parts of the architecture - the attention heads vs feed-forward networks, for instance. And how can one calculate the number of parameters from the architecture’s “size hyperparameters” like dimensionality and number of layers?
The goal of this post is to answer those questions, and make sense of this nice table from the GPT-3 paper, deriving the n_params column from the other columns.
Book of the Week
Do you have any more links our community should read? Feel free to post them on the comments.
Have a nice week. 😉