We’ve heard a lot about the singularity in recent years, but I’ve taken a more nuanced interpretation of it. Rather than interpret the growth of technology as being irreversible and uncontrollable, I’m more interested in the degree to which technological advancement is impossible to comprehend to the average person. This then becomes problematic from a standpoint that technology is increasingly indistinguishable from magic. So, I’ve been reading a bit about artificial intelligence (AI) to try and write about how it actually works, so we don’t get to that point. Today, we’re going to be talking about a couple of concepts in AI language models!
I read about this after discovering this article from Benj Edwards at Ars Technica, which raised a number of interesting questions about AI detection systems that attempt to determine whether written material was likely produced by an AI. One of the models predicted that the US Constitution and the intro to the Book of Genesis (בְּרֵאשִׁית) were almost certainly written by artificial intelligence. The author explained that this was a product of a similar measurements of burstiness and perplexity in the text, since Large Language Models (LLMs) are often trained on documents like these.
What does that mean?
First, How Does It Do It?
First, we have to understand a bit more about how the LLM works. An LLM works on the basic idea of converting normal human language from the user into smaller bits, referred to as “tokens,” and then the tokens are broken down into unique identifiers. The process of tokenization is complex. Some words are part of phrases (rather than containing individual meaning)– for example, a term like “truck stop” might be tokenized as one item instead of as two separate words. The LLM must figure out this whole mess before generating a response.
Because words mean different things to different people, the LLM has to have some sort of “average” of meanings that allows it to understand people communicating in different ways to communicate the same thing. This means that the LLM will not necessarily translate each word into its own token, but might create a token out of phrases– and so on. This is a curious concept in Midjourney and ChatGPT, both of which will produce slightly different responses based on the syntax of your input (although there are differing interpretations for Midjourney about how this actually works, including among some supposed pros, whom I disagree with– but this is a story for another day).
Tokenization allows the words to be converted more readily into numbers, and this begins the process of the LLM trying to figure out how to parse the meaning of the user input before trying to figure out how to create a response. The actual processing of the data is a bit more opaque to the average person, but suffice it to say, it’s complicated enough that I’ve avoided getting into it here. How complex? We don’t really know, but it’s something I’d like to get into. The user-facing side of things– the box in which you type text like, “hey, Chat, whussup,” is going to be on the order of single kilobytes. The computational backend probably involves many gigabytes of data (i.e. numbers and text converted to numbers– but no images for an LLM) and then probably many, many megabytes of processing throughput. This is why you can’t run ChatGPT on your phone (yet).
The Value of Good Training
One thing you might have noticed about LLMs like ChatGPT is the fact that they may make accuracy errors, but they don’t typically make style errors. This is the product of good training, and it speaks well for the developers in an era when the integrity of written language is increasingly devalued. Chat doesn’t write particularly well, of course, but it does write in a usually stylistically correct manner. How does it get there?
You may recall that a lot of statistics is a question of figuring out what to do with a set of data. More specifically, a lot of statistics involves figuring out how to develop a particular equation, model, or algorithm, and then figuring out how accurately it predicts or explains a specific situation. Remember variances, standard deviation, p-value, all that jazz? “Perplexity” is a related concept as a measure of uncertainty. In LLM terms, it refers to the model’s accuracy in coming up with the appropriate word.
Wikipedia graciously provides us with the equation to explain perplexity:
Not really sure what all of that means (something gets summed, something gets multiplied, and a bunch of weird exponents and even a log)! Anyway, this can be used to quantify the amount of uncertainty or randomness in a set of data. In the context of LLMs, perplexity is often used as a measure of how well the model predicts a sample. A lower perplexity means the probability distribution of the model is a better predictor of the sample. A lower perplexity translates to lower uncertainty in predicting the next word in a sequence of words. This might generate more ordinary, generic phrases, or it might generate more accurate statements.
This refers to clusters of occurrences. An LLM can analyze text to assess “burstiness” to figure out the meaning of a complex thing in terms of the clustering of certain related terms. This can be especially valuable as a tool to understand something like tone rather than the probability of words in a sentence as described in the previous section.
I’m not entirely clear based on the Ars Technica article how perplexity and burstiness would help us determine whether text was AI-generated or not. There are some easier ways to tell, though, beyond being able to answer the question: Does this read like a freshman seminar paper written by someone who was probably not very smart, but got very good grades?
Other Indicators We Can Use To Test AI?
One point, especially going back to the examples of Genesis and the US Constitution, is that especially downmarket LLMS, may be prone to repeat themselves or “drift” within a paragraph. LLMs store a limited number of tokens in their memory for the user’s benefit– for example, within a conversation, so the AI remembers what you’ve referred to previously. GPT-3 stores 2048 tokens at a time, while GPT-4 can handle 8,000– or more if you can wangle it.
Perhaps a bit more esoteric is the idea of the AI using faulty logic to respond to a question, or using logic that just sounds, well, a bit bizarre. I ran into this while trying to talk to GPT-3 about my attempt to render the Cities of Tomorrow in Midjourney. It was interesting, because Chat came up with good responses, but sometimes the conclusions were, well, odd. Odd how? I thought it was interesting that ChatGPT seemed more “interested” in creating new interpretations when I’d ask “why do you suspect [x]” than it was in relying on the most basic, fundamental rationales for what I was going for, which involved questioning the nature of the training of the model. In other words, I was trying to get Chat to say that “the training wasn’t done right” in response to a specific question, but instead it suggested that the training was a product of a complex set of social factors, blah blah blah.
This was certainly interesting. Was it true? I don’t know. For now, we have the ability to stroke our proverbial beards, nod and say, “ah, yes, quite interesting!” as opposed to, say, figuring out how to defend ourselves against Skynet. The AI detectors might not be working as well as we’d like them to. But maybe that just means we need to come up with better AI detectors. After all, we have AI to develop the technology further, don’t we?
“Odd” or “bizarre” are terms we can’t quite quantify using “perplexity” or “burstiness”, though– at least not at this point. Human text just seems to have that je ne sais quoi, doesn’t it?