The subtle biases of LLM training are difficult to detect but can manifest themselves in unexpected places. I call this the ‘Lone Banana Problem’ of AI.

Daniel Hook, CEO, Digital Science

Think of all the tools that the information age has brought us…

Think of all the people whose diseases have become manageable or which have been cured. Think of all the economic benefits that have been generated through increased productivity. Think of all the routes that we have to express ourselves and share our creativity. Think of all the skills that were needed to power that revolution – the hardware engineering, and the software engineering – the development of whole new fields of knowledge and understanding. 

Then think of all the wealth disparity that has been introduced into our world.  Think of the social anxiety of always being online.  Think of the undermining of our democratic institutions.

The effects of new technology are seldom either solely positive or solely negative and, as such, there is a responsibility that sits with those that develop technology to consider how it will be used.  Underlying that responsibility sits a need to deeply understand technology.  The recent rise of Large Language Models (LLMs) and its rapid adoption begs many questions about whether we understand the technology and whether we understand how it will impact us at a cultural, societal, or economic level.

It is clear that the experience that we have of programming and existing technologies will give way to different skills with this new technology.  The command of a programmer mindset and skill with languages such as C++ and Python will give way to the need to understand a dynamic meta language that is drawn from the patterns of online human interaction.  The new skill for the present technology is “speaking” language in the way that AI determines that language to be spoken from the inputs that it has consumed.

In that context, knowing that LLMs are not producing something new or creative, but rather that they are producing (or reproducing) the statistical average of the inputs that they have consumed in the context of the question they have been asked is important. Understanding that AIs do not understand us the way that we think they do is an important step in taking responsibility for developing on top of these technologies when creating new tools.

What follows attempts to illustrate this point.

Going Bananas

When I have written before on this topic I have noted that it is not these large biases that are concerning to me, rather it is the subtle biases that are difficult to detect that I think should be of more concern. I found recently what I regard to be an excellent illustration of the phrase that I have chosen to describe the output of AI. I call this the ‘Lone Banana Problem’.

Bananas are attractive fruits, they have a jolly colour, and they taste great. I have had a running joke for several years with a friend of mine that due to his love of bananas he should use them more significantly in the branding of his business. When I signed up to Midjourney I saw the perfect opportunity to generate an idealised banana image for him to use. I started with a simple prompt: “A single banana casting a shadow on a grey background”. The result is shown in Figure 1.


1: Four initial outputs generated by Midjourney in response to the prompt “A single banana casting a shadow on a grey background”.

Now, the more astute amongst you may notice an issue with the outputs that Midjourney has produced in Figure 1. While the bananas are beautiful and look extremely tasty in their artistic pose casting their shadow on the grey background, you may notice that I asked for a single solitary banana, on its own, but none of the variants that I received contained just one banana. Of course, I thought, the error must be mine, I clearly must not have been sufficiently precise in my prompt. So, I tried variants – from “a perfect ripe banana on a pure grey background casting a light shadow, hyperrealistic”, to the more specific “a single perfect ripe banana alone on a pure grey background casting a light shadow, hyperrealistic photographic”, and even to the emphatic (even pleading) “ONE perfect banana alone on a uniform light grey surface, shot from above, hyperrealistic photographic”.

The Invisible Monkey Problem?

I mentioned this challenge to a friend of mine who has much more of a programmer’s brain than I. He asked me if I tried to get Midjourney to render a monkey with a banana and then asking Midjourney to imagine that the monkey was invisible. (You see what I mean about a programmer’s brain?) He (my friend, not the monkey) was surmising that the data around monkeys holding bananas, or bananas in a different context might yield different results. The depressing result is included in Figure 2.


Figure 2: One of the outputs of  the experimental prompt: “An invisible monkey with a single banana”.

You are quite right, that monkey (in Figure 2) should look sheepish! Firstly, he should be invisible and is conspicuous by well…his conspicuousness.  Secondly, he is holding not one but two bananas!  The results were the same with aliens holding bananas and other animals.  Slightly bizarrely, several of the monkeys ended up wearing bananas or being banana-coloured.

Every image that I asked Midjourney to produce contained 2 (or more) bananas seemingly no matter how I asked.

I began to suspect that bananas, like quarks in the Standard Model of physics, might not naturally occur on their own as through some obscure binding principle they might only occur in pairs. I checked the kitchen. Experimental evidence suggested that bananas can definitely appear individually. Phew! But, the fact remained that I couldn’t get an individual banana as an output from the AI. So, what was going on?

Bias Training

One of the problems of generative AI is that understanding what is going on inside the machine’s brain is almost impossible. There are interesting approaches such as TCAV that attempt to give us more of an insight, but as with a human brain, we don’t fully understand the process that goes on inside a deep learning algorithm.  

Testing and understanding the outputs for a given input is critical when deciding how this type of technology can be applied in real-world applications. Anyone who has studied chaos theory or who has heard of the butterfly effect will know that naturally complex systems are often highly sensitive to initial conditions. The Large Language Models (LLMs) that we are building have highly complex inputs in the form of vast amounts of data that are used to train these intelligences. However, like the Mandelbrot set and other complex-looking fractals, the rules that are then applied to the data to do the training are deceptively simple.

The bias to two bananas in a picture is, I believe, an example of a subtle bias (OK, it’s not that subtle, but it is more subtle than many of the more concerning news-grabbing biases that we regularly read about). A naïve explanation may be that in the training dataset there have been many pictures of bananas added to Midjourney’s database that have been labelled “banana” but not labelled “two bananas”. It may also be that Midjourney has never seen an individual banana, so it doesn’t know that a single banana is possible.

The danger here is that due to the convincing nature of our interactions with AIs, we begin to believe that they understand the world in the way that we do.  They don’t.  AIs, at their current level of development, don’t perceive objects in the way that we do – they understand commonly occurring patterns.  Their reality is fundamentally different to ours – it is not born in the physical world but in a logical world. Certainly, as successive generations of AI develop, it is easy for us to have interactions with them that suggest that they do understand. Some of the results of textual analysis that I’ve done with ChatGPT definitely give the impression of understanding. And yet, without a sense of the physical world, an AI has a problem with the concept of a single banana.

Of course, the point of this article is not to be vexed about the lack of individual bananas in the AI’s virtual world. It is to point out that even though this technology is developing rapidly and that the output is impressive, there are still gaps and, while they are not always immediately notable, they are not small. Ethical and responsible use of AI is easy to forget when faced with the speed of innovation and the constant press hype around AI. The lone banana problem is, in some sense, a less scary version of HAL the AI in Arthur C. Clarke’s 2001 where instead of killing the crew of the Odyssey, I have merely discovered a virtual universe in which bananas only appear in pairs.

While humans are amazing pattern matchers, that skill is augmented by common sense (in many but not all cases), context and an evolved and subtle understanding of the physical world around us. AIs don’t yet have those augmentations – they are pure pattern matching power. And hence, they are only as good as the data that we input into the training set and hence can be no more than the statistical average of those inputs. In the lone banana problem, the statistics suggested that bananas only appear in twos (or more) and so the AI could not imagine a single banana, because the data and parametric tuning that had gone on didn’t allow it to consider that approach, on average.

Existential Questions

But this line of thinking does beg certain uncomfortable questions such as, is human intelligence just the result of pattern matching in the context of an enriched relationship with a physical world? Is human morality simply the result of a pattern matching type feature with a sense of its own mortality? Or, is there something deeper going on? Is inspiration or intuition something that an AI will be able to master either through its learnt experiences or a richer relationship with the physical world? In a philosophical sense, what is creativity? And, does human creativity differ fundamentally from that of machine creativity? 

Certainly human creativity has limits (and, perhaps paradoxically, appears to become more limited with age – precisely as we have been exposed to more experiences). Some of those limits are, for example, the amount of data that we can perceive and process. Others appear in the extent to which we can imagine beyond our everyday experiences. Machine creativity appears to have different limits – not ones of data processing, but rather limits on what can be perceived to be relevant or important, and similar to human experience, on imagining beyond experience. Despite the alluring implication of the command “imagine” used to initiate a new prompt when getting Midjourney to create a set of images, to what extent is the AI able to imagine beyond the patterns that have been fed to it?

It also points to a deeper issue of the underlying nature of this technology. When programming became a more mainstream job in the 1980s and people started studying for computer science degrees, it was noted that programming required a certain way of thinking.  It is the same for prompt engineering, but the type of thinking behind prompt engineering is not the same as for programming. Prompt engineering requires a deep understanding of language and not only that, a deep understanding of how a large language model understands language. In the same way that deep learning algorithms have a deep understanding of chess and Go and consequently make surprising moves due to their perception of the game, the same will be true of large language models and their perception of the world. Their use of language and interpretation will be considerably more nuanced than ours, loaded with a myriad of cultural references that none of us can possibly have assimilated.

While I appreciate that AIs are a good deal more complex than my simple example shows, and indeed my example may even just be a facet of clumsy prompt engineering, it still demonstrates that you have to be incredibly careful about your assumptions when using AI. What it tells you may not be what you think it is telling you and it may not be for the reasons that you think either.

At Digital Science, we believe that we have a responsibility to ensure that the technologies that we release are well tested and well understood. The use cases where we deploy AI have to be appropriate for the level at which we know the AI can perform and any functionality needs to come with a “health warning” so that people know what they need to look for – when they can trust an AI and when they shouldn’t.

Postlog

After two weeks in despair of ever finding a lone banana, I tried a different style of prompt in Midjourney (either it has learnt some new tricks, or I might be getting better at “prompt thinking”). The prompt “A single banana on its own casting a shadow on a grey background” yielded Figure 3.

This shows two things: Firstly, that initial conditions, by which I mean the prompt given can be very sensitive – the difference between the phrases “A single banana casting a shadow on a grey background” and “A single banana on its own casting a shadow on a grey background” is not large, either in the semantics or the words used. However, the outcome is significantly different in its level of accuracy.  Secondly, even with this improved prompt formulation coupled with whatever upgrades that may have been implemented by the Midjourney team in the prior two weeks, there is still one output that contains two bananas, and if you look closely, one attempted banana is trying to split itself in two!!


Figure 3: Four initial outputs generated by Midjourney in response to the prompt “A single banana on its own casting a shadow on a grey background”.

Read More