Back

What is generative AI?

Whether you’re reading a coherent article written by an AI or looking at a stunning image generated from a simple description, you might wonder how these models manage to produce such sophisticated content?

Generative AI refers to models designed to create new content—like text, images, music, or even video—by learning patterns from vast amounts of existing data. Unlike traditional AI models that classify, predict, or recognize patterns, generative models focus on creating something new by imitating what they’ve learned.

The most popular examples include text models like GPT which can generate essays, stories, or answers to questions based on a prompt and image models like DALL-E or Stable Diffusion, which generate images from textual descriptions. These models aren’t just parroting what they’ve seen; they’re learning underlying patterns that allow them to craft unique responses. To create lifelike text or images, generative AI models  “learn” patterns, structures, and relationships in data, building the foundation they’ll later use to generate new content.

Training

GPT is trained on billions of sentences from books, websites, articles, and more. It “reads” this text to understand how words relate to one another, which words often appear together, and what kind of language is appropriate in different contexts. The model isn’t learning facts or memorising sentences. It is picking up on statistical patterns that represent the structure and style of human language.

For image generation models like DALL-E, training involves millions of images paired with descriptions. The model learns what “mountains,” “sunset,” or “ocean” typically look like, as well as the relationships between different visual elements. Over time, it builds an understanding of colours, shapes, textures, and composition, so it can recreate these elements in new, generated images.

Both text and image generative models are based on a type of neural network called transformers. In a transformer model, data moves through multiple layers, with each layer focusing on different features or patterns. In a model like GPT, each layer of neurons processes the text data, focusing on different aspects of language. For instance: The first layers might capture basic relationships between words, like grammar and syntax. The middle layers could focus on sentence structure, identifying how ideas are typically organised. The final layers focus on high-level relationships, like tone, context, and style. This layering allows GPT to understand not only how to form a sentence but also how to mimic different tones, styles, and levels of formality.

For image generation models, the layers have similar functions but are designed to analyse visual information. Early layers capture low-level details, like edges and colours, while deeper layers detect shapes, patterns, and objects. This hierarchy of layers allows the model to recognise and generate complex scenes, like a bustling street or a serene landscape.

Generating content

Once trained, a generative AI model can start generating content by using the patterns it has learned. This involves a process called sampling, where the model predicts the next word (in text) or pixel (in images) based on what it has seen so far. When you give GPT a prompt, it generates responses word by word. For each word, the model calculates probabilities for possible next words based on what it has learned.

For example, if you start with “The sun sets over the…,” GPT might predict words like “ocean,” “hills,” or “city” as the most likely next words, depending on the context. The model chooses words based on these probabilities, often picking the one with the highest likelihood, though it can introduce randomness to keep the responses creative and avoid sounding repetitive. In diffusion-based image generation (like Stable Diffusion), the model starts with a “noise” image (essentially static) and gradually refines it into a coherent picture by filling in pixels based on the prompt. The model uses what it has learned about textures, shapes, and colours to generate pixels that make sense together, creating an image that matches the description you provided.

One of the keys to human-like text and image generation is contextual awareness. Transformer models have an attention mechanism that helps them focus on relevant parts of the data, allowing them to produce coherent and contextually relevant content. For GPT, context is essential. If you ask it to write a story in the style of a fairy tale, it picks up on the language and style typical of that genre. This attention to context allows it to maintain consistency, whether that’s a tone, theme, or ongoing storyline.

For instance, if GPT is generating a conversation between two characters, it will remember what each character has said so far, maintaining personality traits or storyline elements over multiple interactions.In image generation, context is about creating consistency within a scene. If you ask for a “red apple on a blue table,” the model needs to maintain the colour, size, and position of each element to create a coherent image. Attention mechanisms help the model “remember” and focus on these details to deliver a cohesive result.

Limitations to 'Gen AI'

By learning from vast datasets, understanding patterns, and predicting outputs, these models can produce impressive, human-like text and images. However, they still lack true understanding, which means we need to use them with awareness of their limitations. As generative AI continues to improve, it will open up even more possibilities, but it will remain a tool.

True understanding
Generative models don’t “understand” language or images like humans do. They rely on statistical patterns rather than real comprehension, which can lead to errors or odd results, especially with complex or abstract prompts.
Bias
Since models like GPT and DALL-E are trained on vast datasets sourced from the internet, they can inherit biases or inaccuracies present in the data. This can affect the quality or fairness of their output.
Inability to reason
Text-based generative models don’t fact-check. If you ask GPT for information, it might confidently present incorrect details, simply because it has seen similar patterns in the data. This can make it risky to rely on AI-generated text for critical information.
Complex scenes
Image generation models may struggle with complex scenes or intricate details, especially if they haven’t seen similar examples in training. For instance, a prompt involving very specific scenarios (like “a unicorn riding a skateboard at sunset in the style of an oil painting”) may yield results that are visually impressive but lack precision.

'Big tech', for the everyday business

Our mission is to ensure that all businesses, regardless of size, can take advantage of the 'big tech' and AI revolution. Get started on Breezy for free and scale as you grow. Time to see how AI can help your business.

Try Breezy