Please wait ...

CASE

Creating a Music Video with AI: How We Brought Tommy Ramm’s ‘Knöllchen, Liebe, Pfefferspray’ to Life

YouTube

Mit dem Laden des Videos akzeptieren Sie die Datenschutzerklärung von YouTube.
Mehr erfahren

Video laden

Generative AI systems have seen explosive growth in the past year, offering a wide range of tools for users. By now, most people are familiar with platforms like ChatGPT and Midjourney, which have become staples in AI-driven text and image generation. However, generative AI has extended its reach beyond text and images—now encompassing audio, video, and much more. As you are reading this article new platforms like Picsagon.com are emerging to bridge the gaps between content creation, editing, and publishing.

In this article, we’ll take you behind the scenes of how we used AI to create the music video for German Schlager star Tommy Ramm’s hit single “Knöllchen, Liebe, Pfefferspray” (translated as Speeding Tickets, Love, Pepper Spray).

Overcoming Challenges in Traditional Music Video Production

Our initial plan to shoot the music video traditionally quickly hit roadblocks. Tommy’s schedule was tight, and the unpredictable rainy weather in Germany left us with little time and even fewer opportunities to shoot. To make matters worse, we would have needed city permits, costumes, power supplies, catering, actors a crew etc that were simply out of reach given the time and budget constraints. Faced with these challenges, we decided to take an unconventional approach.

We captured a handful of photos featuring Tommy and an actress we were familiar with. From there, we transitioned to digital production. Tommy was enthusiastic about this creative direction from the beginning and visited our studio daily to check on our progress.

After some experimentation, we chose the newly released Flux model for ComfyUI to generate the visuals for our video. The song’s narrative centers around a man who repeatedly gets speeding tickets in an attempt to catch the attention of a policewoman. Initially, we envisioned Tommy driving an expensive luxury car, but since he owns a VW in real life, he suggested we aim for something more relatable and true to his character.

Adopting an AI-Driven Approach

Luckily for us a new generative-image model called „Flux“ by Black Forest Labs had been released a couple of weeks earlier and we decided to use it in our approach. While other AI models struggle with replicating realistic logos without extensive training (like using a LoRa), Flux delivered what we needed quickly. Though training a LoRa for Flux would have enhanced the car’s appearance, our focus remained on the song rather than the brand, so a consistent visual representation was all we needed.

Building Digital Avatars: Bringing Tommy and Nathalie to Life

Next, we focused on creating a digital avatar of Nathalie, the actress featured in the video. She was both impressed and slightly unnerved at how easily we could create a digital twin capable of performing any action we wanted.

Managing Ethical Concerns: Protecting Digital Content Rights

In one scene, Tommy fantasizes about Nathalie wearing various provocative outfits. Naturally, this raised concerns about how this footage might be used in other contexts, so we included contractual clauses specifying that only pre-approved scenes would be used.

On the flip side, Nathalie had a good laugh when we showed her some outtakes that didn’t make the final cut.

With both Tommy and Nathalie’s digital avatars ready, we generated over a hundred images for various scenes. However, not every image was flawless; even Flux had its off days, producing characters with more than five fingers or poses that defied physics. Ensuring that every image accurately depicted either Tommy or Nathalie was an additional challenge, as our workflow in ComfyUI occasionally mixed things up, requiring us to re-render certain scenes.

Overcoming the Challenges of Generative AI: Handling Imperfections and Model Quirks

Once the images were finalized, we moved on to animation. Initially, we experimented with the Animdiff model in Stable Diffusion, but the output was too inconsistent to be reliable. Instead, we used Runway to create scenes lasting between five and ten seconds. As we assembled the footage in our editing tool, the scenes naturally aligned with different parts of the song, allowing us to fine-tune music samples for each specific segment.

Animating AI-Generated Scenes: Turning Stills into Dynamic Visuals

One of the most complex tasks was achieving realistic lip-syncing. At first, we aimed to sync all shots, but it quickly became clear that this would be overwhelming. We narrowed our focus to about 20 key shots where lip-syncing would have the most impact. We initially tried using Runway for this, but its lip-sync feature was subpar, making characters appear stiff and the sync itself blurry. Only one scene in the final video was done using Runway’s lip-syncing, and even that was barely acceptable.

Mastering AI-Driven Lip Syncing: Finding the Right Tools and Techniques

For the rest of the shots, we opted for a more intricate solution. We planned to use LivePortrait in ComfyUI, but we knew additional animation would be necessary. However, animating in LivePortrait is tricky; while it’s easy to create footage using default videos, the requirement to face the camera directly feels unnatural and would have required multiple retakes. To avoid this hassle, we turned to an alternative software called “Hallo,” available on GitHub. Hallo allowed us to animate portraits with various emotions and lip-sync them to a WAV file, even enabling them to “sing”—a perfect fit for our needs. We created 20 “driving portraits” for the key lip-syncing parts using a character built in Picsagon.com.

We used each of the driving images in ComfyUI to control the facial expressions of Tommy in each of the Lipsync-Scenes.

Leveraging Advanced Animation Tools: The Power of ‘Hallo’ and Custom Portraits

As we refined the lip-syncing, we noticed that some videos needed rework due to excessive mouth movement or exaggerated facial expressions. LivePortrait tends to amplify these expressions by overlaying driver motions onto the original footage, sometimes resulting in overly dramatic looks. One such exaggerated scene remains in the final cut when Tommy asks, “Hast du heute frei, ich hab für uns gekocht” (Are you free today? I cooked for us). Although it wasn’t perfect, Tommy found his desperate expression amusing and insisted on keeping it.

Fine-Tuning AI Workflows: Tackling Technical Hurdles in Video Production

We faced several technical challenges, particularly with LivePortrait and ComfyUI, as adding driving sound to the original footage often shortened the clips by five frames. We suspect this is a bug in ComfyUI, and we hope it will be addressed in future updates. After five days of intensive digital production and editing, we completed the music video, and Tommy finally got the visuals for his catchy song which you can watch on YOUTUBE.

The Future of AI in Creative Production: Lessons Learned and Next Steps

This project highlighted the transformative potential of generative AI in creative production. In just five days, we were able to produce a fully AI-driven music video, overcoming challenges that would have been insurmountable using traditional methods. Despite some technical hurdles, the creative possibilities these tools offer are groundbreaking.

Looking ahead, we’re excited to further explore how AI can enhance digital storytelling. Tommy’s video has opened the door to more innovative projects where AI plays a central role, pushing the limits of what’s possible in content creation.

This is just the beginning. As these technologies evolve, so will our approach to harnessing them, blending creativity with cutting-edge AI to shape the future of media production.

Company

About Us

Cases

Pricing

Contact

Legal

Imprint

Privacy Policy

Terms & Conditions

Subscribe to Our Newsletter