AI Multimodal Models are Coming

What happens when you can generate and interpret text, video, audio and more from one tool

[The image above is generated by Midjourney. The prompt I used to create the image is listed at the end of this email.]

Generative AI, with its ability to create content ranging from text to images, has already made significant strides in transforming our digital experiences. However, the emergence of multimodal models is set to usher in a new era, fundamentally reshaping how we interact with AI systems.

What are Multimodal AI Models?

Multimodal AI models are a type of machine learning model that can understand, interpret, and generate content across different data types or modalities. Traditional AI models are unimodal, meaning they are designed to handle data, be it text, images, or audio. In contrast, multimodal models can simultaneously process information from text, images, audio, and even video, giving them a more holistic understanding of the data they're analyzing.

Why are Multimodal Models Important?

Multimodal models have emerged as a crucial technology in artificial intelligence, enabling a comprehensive understanding of complex scenarios by processing multiple data types. These models uniquely integrate information from various modalities, such as audio, video, and text, offering a more holistic perspective. From sentiment analysis in videos to medical diagnostics, the importance of multimodal models lies in their capability to provide nuanced insights, streamline processes, and enhance accuracy.

  • Holistic Understanding: These models can provide a more comprehensive and nuanced understanding of complex scenarios by processing multiple data types. For instance, understanding the sentiment of video content might require analyzing both the spoken words (audio) and the visual cues (video).

  • Versatility: A single multimodal model can replace multiple unimodal models, leading to more efficient and streamlined systems. For businesses, this means reduced costs and complexity in deploying AI solutions.

  • Improved Accuracy: Combining information from multiple sources can lead to better decision-making. For example, analyzing textual patient records and medical images in medical diagnostics can result in more accurate diagnoses.

The significance of multimodal models cannot be overstated. These models offer a holistic understanding of complex scenarios by combining different data types, leading to more nuanced insights and better decision-making. Moreover, their versatility allows the replacement of multiple unimodal models, reducing costs and complexity in deploying AI solutions. Ultimately, the improved accuracy achieved through integrating information from multiple sources has the potential to revolutionize various industries, ranging from healthcare to customer service, making multimodal models an essential component of the AI landscape.

Unified User Experience

Imagine querying an AI system with a spoken question, showing it a picture, and receiving a written response - all within a single platform. Multimodal models can seamlessly integrate various data types, offering a unified and fluid user experience. Users will no longer need to switch between different tools or platforms based on the data type they are working with.

Enhanced Creativity and Content Generation

Generative AI's capabilities will be significantly expanded with multimodal models. For instance, users could provide a textual description and receive a generated image that matches the description, or vice versa. This opens avenues for enhanced creativity, from custom artwork and music generation to interactive storytelling, where the narrative can shift based on visual or auditory inputs.

Contextual Understanding

One of the challenges with current AI models is their limited context. With the ability to process multiple data types, multimodal models can have a more holistic understanding of user inputs. For example, when shown a picture of a beach and asked about the weather, the model can respond more accurately by combining visual cues from the image with textual information.

Accessibility and Inclusivity

Multimodal models can cater to diverse user needs, making AI interactions more accessible. Those with visual impairments can use auditory inputs, while hearing-impaired users can rely on visual or textual data. This inclusivity ensures that a broader audience can benefit from the advancements in generative AI.

Google's Gemini: Pioneering the Multimodal Revolution

In May 2023, at the Google I/O developer conference, Sundar Pichai, the CEO of Google, unveiled the company's ambitious AI project, Gemini. Developed by the combined expertise of Google DeepMind's Brain Team and DeepMind division, Gemini is poised to be a game-changer in the AI space.

  • Multimodal Capabilities: Gemini is designed to be inherently multimodal, processing various data types, including text and images, to enhance its conversational abilities.

  • Advanced Features: Pichai hinted at groundbreaking features for Gemini, such as memory and planning, which could empower the AI to perform tasks requiring intricate reasoning.

  • Industry Response: The buzz around Gemini is palpable. Notably, Elon Musk hinted at keen competition in AI, suggesting that Gemini might be a formidable rival to existing models.

  • The Bigger Picture: With its advanced features and multimodal capabilities, Gemini signifies a significant leap in natural language processing. If it lives up to its promise, it could redefine interactive AI, aligning with Google's mission to bring AI to billions responsibly.

OpenAI's OpenModal: A Formidable Contender

As 2023 progresses, OpenAI is not far behind in the multimodal race with its project, OpenModal. Eager to lead the AI revolution, OpenAI is gearing to launch its next-generation large-language model with multimodal capabilities.

  • The Multimodal Race: OpenAI is in a tight race with Google, aiming to integrate its most advanced LLM, GPT-4, with multimodal features that could rival Google's Gemini.

  • GPT-Vision: Six months after the GPT-4 launch, OpenAI is preparing to roll out GPT-Vision, indicating a broader application of their multimodal capabilities.

  • The Bigger Picture: OpenAI's endeavors with OpenModal showcase its commitment to pushing AI boundaries. As they compete with Google, the future of AI promises innovations and breakthroughs.

Real-world Applications

The potential applications of multimodal models are vast. Doctors could receive diagnostic suggestions in healthcare by showing models of patient scans and providing verbal symptoms. In education, students could learn through interactive modules that respond to written queries and visual demonstrations.

The rise of multimodal models is not just an incremental step in the evolution of AI; it's a paradigm shift. These models promise a more integrated, intuitive, and enriching interaction with generative AI by breaking down the barriers between different data types. As we stand on the cusp of this new era, it's exciting to envision a future where the limitations of data types do not confine our digital interactions but are instead elevated by the boundless possibilities of multimodal intelligence.

Real World Applications

  1. Healthcare: Multimodal AI can analyze patient data, including medical history, X-rays, and MRI scans, to provide more accurate diagnoses and treatment recommendations.

  2. Entertainment: In movies and gaming, these models can generate real-time content that responds to user inputs, creating a more immersive experience.

  3. E-commerce: By analyzing product images and descriptions, multimodal AI can offer better product recommendations to users.

  4. Education: These models can provide a more comprehensive assessment of students by analyzing written essays, oral presentations, and visual projects.

  5. Security: Multimodal AI can be used in surveillance systems, analyzing visual footage and audio data to detect potential threats.

Challenges and Considerations

While the potential of multimodal AI is vast, there are challenges to consider:

  1. Data Privacy: As these models process diverse data types, ensuring the privacy and security of this data becomes paramount.

  2. Complexity: Training multimodal models requires vast amounts of diverse data and significant computational power.

  3. Interpretability: Understanding how these models make decisions can be challenging, which is crucial for applications in sectors like healthcare and finance.


The rise of multimodal AI models heralds a new era in artificial intelligence. Their ability to process and understand multiple data types simultaneously offers a more holistic and nuanced perspective, opening doors to numerous applications across sectors. As with any technological advancement, it's essential to approach its adoption thoughtfully, considering its potential benefits and challenges. Nevertheless, the future of AI looks promising with the advent of multimodal models.

Tip of the Week: Creating and Editing Images with AI using Adobe Firefly

Adobe Firefly is a part of Adobe's Creative Cloud suite and is designed to harness the power of artificial intelligence in creating artwork. The tool has been developed with generative AI technology, allowing users to produce stunning visual content.

Key Features of Adobe Firefly:

  1. Text to Image: Generate images based on a text prompt. The user-friendly interface lets you input a text prompt and generate art with a click.

  2. Generative Fill: This feature lets you remove objects or paint in new ones using text prompts.

  3. Text Effects: Apply various styles and textures to text based on a prompt.

  4. Generative Recolor: This tool allows you to create color variations of your vector art from a text prompt.

Tips for Creating Stunning AI Art with Adobe Firefly:

  1. Start With an Idea: Begin with a clear concept of what you want to create.

  2. Use Style Modifiers: Enhance your images by including style modifiers in your prompts.

  3. Add a Boost Word: Words like "beautiful" or "detailed" can enhance the quality of your output.

  4. Be Clear Yet Descriptive: Provide detailed instructions for best results.

  5. Add Repetitions: Repeating specific modifiers can help the AI better understand your vision.

Unique Features of Adobe Firefly:

  1. Optional Tags and Modifiers: Unlike other text-to-image generators that require manual text input for instructions, Firefly offers a wide variety of tags and modifiers through easy-to-use menus. This feature simplifies the process and enhances the final image output.

  2. Aspect Ratio Choices: Firefly provides four aspect ratio options: square, landscape, portrait, and widescreen. Your choice will significantly influence the final image, especially regarding the amount of blank space the AI must consider.

  3. Content Type, Color, Tone, and More: Dropdown menus allow users to specify content type (Photo, Graphic, or Art), color and tone, lighting, and composition. For more creative outputs, setting the Content-Type to Art is recommended.

  4. Styles Menu: With 63 aesthetic toggles (as of the time of writing), users can apply various styles to their prompts. However, it's essential to ensure that the chosen styles complement each other for the best results.

Mastering the Art of Prompting in Firefly:

  1. Influences from Historical Artists: Firefly's training on images owned by Adobe. For instance, adding "in the style of [artist name]" may not be as helpful as they are in other models like DALL-E or Midjourney, so don’t ask for something contemporary that might be copyrighted, like a street art painting in the style of Banksy or a photo in the style of Annie Lebowitz.

  2. Reference Images: One of Firefly's standout features is the ability to use a generated image as a "reference image." Users can adjust the emphasis between the reference image and the prompt, allowing more control over the AI's output.

  3. Navigating Your Work: Firefly is browser-based; users can easily navigate their prompt history using browser navigation buttons.

  4. Text Effects and Recolor Vectors: Apart from the primary text-to-image service, Firefly offers tools for text effects and recoloring vectors. The text effects tool is up-and-coming, allowing users to apply visual effects to AI-generated lettering. Visual m

A Firefly Hack

Adobe Firefly's prompts can be enhanced using "flags" or variables. For instance, Adobe staff have demonstrated using [avoid=background] in Firefly's text-to-image prompts. While some of these variables are described as "buggy," knowledge of their format can help users frame their prompts more effectively.

Adobe Firefly Features integrated into the Adobe Creative Suite

Additionally, Adobe Firefly is being integrated into various Adobe apps, allowing users to make transformations in seconds. Some of the integrations include:

  • Generative Fill and Generative Expand in Photoshop: Users can use a simple text prompt to add or remove content from any image and drag beyond the image border to fill the expanded canvas with matching content seamlessly. You can also add elements using Generative AI. This video from Adobe shows it in action:

  • Generative Recolor in Illustrator: This feature unlocks endless color combinations with simple text prompts.

  • Text to Image and Text Effects in Adobe Express: Users can easily create social posts, videos, flyers, banners, and cards with these features.


Adobe Firefly isn’t bad. I prefer Midjourney, but that’s a personal preference. Adobe Firefly’s most significant advantages are in integrating into the Adobe Suite; if you are already a Photoshop wizard, you will probably benefit from these natively more than in the stand-alone app.

What I Read this Week

What I Listened to this Week

AI Tools I am Evaluating

  • MyMind - Save anything with a click and stay in the flow. mymind understands what it is and remembers the necessary details, so you don't have to.

  • HeyGen Video Translate - Translate your videos seamlessly with one click, using a natural voice clone and authentic speaking style.

  • Hubspot Campaign Assistant - Generate copy for landing pages, emails, or ads in just a few clicks so you can focus on tasks that need a human touch.

Midjourney Prompt for Header Image

For every issue of the Artificially Intelligent Enterprise, I include the MIdjourney prompt I used to create the header image for that edition.

Photography of a photorealistic visualization of audio, visual, and text-based AI, with 3D projections of sound waves, images, and digital text, set against a backdrop of a high-tech control room, wide-angle shot capturing the room's advanced equipment, lit by cool blue and purple lights, photographed by Jimmy Nelson, inspired by the dance of digital and physical realms, highlighting the AI's real-time analytics, with sharp focus on the projections' clarity, award-winning, featured on Behance --ar 16:9

Join the conversation

or to participate.