Understanding Gen AI - Sandeep Jain

The algorithm by which Generative AI programs like ChatGPT generate text is extremely difficult to understand, for anyone. If you really want to understand it, it will take a week or two of dedicated effort. Here are the best available sources on the Internet, for understanding GPT, the heart of ChatGPT, or the Generative Pretrained Transformer, which is what GPT stands for, in detail:

Andrej Karpathy’s nanochat, the source code for training and running a small version of ChatGPT.
A visual simulation and rendering of the Transformer model, from Georgia Tech.
The original paper that started it all, Attention is All You Need, from Google.

A hand waving explanation of the GPT algorithm follows – I believe this will be simpler for you than the above sources, if you’re interested and willing to spend a few minutes on it:

The goal of GPT is to predict the next word in a sequence of words, like “AI technology is a subfield of ….”, and then the next word after that, and so on, which in this case might be “computer” and then “science”. This simple rule results in the fantastic text generation capabilities of programs like ChatGPT, Claude, and Google Gemini.

The way this is done, for production Large Language Models (the programs that use GPTs to generate text) is that, ideally, all the text on the Internet is fed into a program which is a black box that observes and learns how the partial sentences in that text are completed by their subsequent words.

Now, these programs do not act on text directly, as they have been designed. The text is converted into numbers, the numbers are run through the core of the algorithm, and at the end of it, the numbers are converted back into text. A word or fragment of a word is represented by a unique number from a dictionary of all possible words or word fragments. This number is called a “token”. These tokens are then further organized into fixed length arrays of the numbers or tokens. Such arrays are called “vectors” in the language of mathematics. There are two types of these vectors in Gen AI: one which represents the “meaning” of a word fragment or word, which we’re representing as a numerical token (word fragments don’t have meaning to us, but this is how a Large Language Model reads meaning into its inputs), and the second type of vector encodes the position of a token in a passage. The first type of vector is called a “token embedding”, and the second type of vector is called a “positional embedding”.

These embeddings are then run through a sequence of structurally identical “Transformer Blocks”. A Transformer is a sequence of Transformer Blocks. A Transformer Block consists of two components: a complicated component which I’ll just explain, and a simpler component.

The complicated component is called the “multi-headed self attention” component. The “self attention” aspect of this deals with the fact that for the algorithm to understand a sentence completely, the relationship of every token in the sentence to every other token is represented and learnt. This is done by creating three vectors for every token: a vector that represents what the given token is looking for from the other tokens in the sentence, a vector that represents what the given token contains, and a vector that represents what the given token can share with other tokens. The first vector is called a “query”, the second vector is called a “key”, and the third vector is called a “value”. Thus a relationship between every token (word/word fragment) in a sentence with every other token is captured. Because there can be multiple contexts in which tokens relate to other tokens, for example verb-object, pronoun-noun relations, and so on, the self attention is captured multiple times, called multiple heads, thus multi-headed self attention. These multi-heads can run in parallel because each self attention head is independent of the other, which is why GPUs (Graphical Processing Units as from Nvidia), which are basically parallel processing chips for doing computations on numerical arrays are so effective with them and are in such high demand. This entire multi-headed self attention algorithm aims to capture all relevant relationships between all the words in a sentence – and judging by the success of ChatGPT, is pretty good at doing so.

The simpler component following the multi-headed self attention component is just an ordinary neural network, a very simple digital version of an animal’s network of neurons which expands it’s input vectors internally, transforms them in a “non-linear” way (meaning if the input is doubled the output is not doubled, but becomes something completely different, which is how animals recognize patterns), compresses the expanded transformed vector back to the original size, and passes it on to the next Transformer Block.

For generating text, at the end of the whole set of Transformer Blocks, there is a layer of units which calculate the probability or likelihood of each of the words in the dictionary being the next word in the given sequence, and either the word with the highest probability is chosen, or one of the higher probability words is chosen, depending on the setting.

That is how ChatGPT works.

The relevance of this for using, or prompting, ChatGPT is, that while ChatGPT will always give an answer to any prompt, the precision and relevance of the answer will vary depending on the prompt. It will vary based on the sequence and content of words that constitute your prompt. To maximize your chances of getting a good answer, you should include all the content which is relevant to your answer – prompts are sometimes pages and pages long, while pursuing at the same time the somewhat contradictory goal of keeping your prompt concise in the sense of not putting anything which is not relevant. This comes from the core of the Transformer mechanism described above, the multi-headed self attention mechanism, which indicates that every word you type in your prompt matters, since every word in the prompt attends to every other word in that prompt. This is if you’re seeking goal oriented answers. If your prompt is “Write a poem about love”, and you’re just exploring ChatGPT or your favourite Gen AI program for the sake of it, there’s nothing wrong with that either.

There are some accepted best practices for prompting – meta-prompting, few-shot, chain-of-thought, etc., which, frankly, I am not going to talk about just now, since I haven’t done a scientific evaluation of them in the sense of testing the “quality” of the output with or without the best practice across a sufficiently exhaustive set of cases. The easiest best practice is asking my LLM to generate a prompt for itself, based on a sufficiently detailed desciption of the prompt, and then modifying it. This again, is for serious uses of LLMs, such as for coding, making business decisions, and getting medical responses.

Finally, hallucinations are a known and documented problem of LLMs – when they make up facts and state them with confidence – but a lesser talked about problem is that LLMs by default respond with the tone and authority of a college textbook or a scientific paper, even though the results may not be 100% on target, while not being completely hallucinated. It is increasingly difficult to contradict a response from an LLM, but maintaining critical human judgement is vital. At a good college, they teach you how to critique the works of the greatest masters, and that skill is very important today.

Leave a ReplyCancel Reply