The AI underground has me feeling nostalgic

One of my favorite memories using software was learning Blender for the first time. I was a freshman in high school, and our final English project was to research and do a presentation on history's most famous playwright, Shakespeare. I wouldn't say I was the biggest Shakespeare fan at the time, but I figured I'd make the most of having to slug through the assignment. I ended up building a very rough 3D model of the Globe Theatre, the popular theater of Shakespeare's time.

We take it for granted now, but software wasn't as easy to use then as it is today. There wasn't glTF or USDZ (3D file packaging formats) to make 3D models easily viewable across different platforms. I had to figure out how to burn my project onto a CD and then get it to open on the ancient computer in my English class. Luckily it all worked out and, from what I recall, my teacher liked the project.

I've been spending some time learning about the lesser known software and workflows for AI-powered media generation, and I can't help but feel a bit nostalgic. It feels like those pre-gLTF and USDZ days when you had to do heavy finagling to get things done. There are amazing tools in the mainstream, like Veo 3 and Sora 2 for video generation, Firefly for image generation, and Nano Banana for photo editing. These tools are incredible in what they can produce with just a prompt, but when working creatively, a chatbot interface is far too small a canvas to paint with.

Beyond the chat box, there's a range of tools and workflows that more adventurous AI enthusiasts have been using, and what stood out to me is the depth of work required to produce their desired outputs. I am surprised to see that the level of technical work involved is not dissimilar in depth to some of the filmmaking post-production processes I've seen in my career, like compositing and other VFX work.

These AI enthusiasts are deploying advanced workflows and creating custom tools to craft what they desire, unbounded by the limits imposed by OpenAI, Google, and the other big tech companies that provide state-of-the-art foundation models. I'll break down a little bit of what I'm seeing.

First, a little bit about open versus closed models. Most of the AI that we interface with on a regular basis are closed AI models. These are the best-in-class models from the big tech companies: GPT-5, Claude Sonnet 4.5, Gemini 2.5, and Grok. Millions of dollars were spent training and fine-tuning these models before they made their way to our browsers and devices. In some cases we get to access these models for free, and in other cases we pay for the privilege.

Either way, we are bounded by the rules set by their providers in terms of what we can prompt these models to produce, and we can typically only interface with these models through APIs or the application frontends that companies provide. In other words, we can only use the models insofar as the companies allow us to.

There are also open-weight models. These models are sometimes released by big companies like Google (Gemma 3) and OpenAI (GPT-OSS), but also by smaller or foreign companies. If you remember the news from some months back, there was a big splash when the China-based DeepSeek model was released. It made waves in part because of how powerful it was as an open model and because it apparently used new techniques to make its training less expensive.

While these models are capable, they're generally less powerful than the closed models we use when we access ChatGPT or Gemini. But what's traded for raw power is accessibility. With these models, users aren't bound to the policies and restrictions set by providers. And with a powerful enough computer, users can even download and run these models locally. This unlocks whole new possibilities for expression.

On the video generation side, the Wan class of open-weight models, developed by Alibaba, are particularly powerful. For simplicity's sake, I would say they're comparable to Veo 2 in the fidelity of the videos they can generate, but not as advanced in generating complex actions and scenes like what state-of-the-art models such as Sora 2 can do. These models can generate impressive videos, and what's fascinating is that the majority of users experimenting with Wan do so from their personal PCs. This means that users can experiment and explore far more than what's allowed when using Sora 2 and Veo 3.

The latest Wan series of models come in three flavors. First, the base Wan 2.2 model for typical text and image prompt-driven video generation. Second, Wan 2.2-S2V (Sound to Video), which allows you to supply the audio, like dialog, that will be used in the video generation. Finally, there's Wan 2.2-Animate, which allows for impressive character animations and character replacements.

These three model variants in combination offer more capabilities than what users can achieve through the Sora 2 or Veo 3 apps. Want to provide an image and prompt for a video? The base Wan 2.2 model can do that. Want to generate a scene where you're providing the exact audio for the voice-over? The Wan 2.2-S2V model can do that. Want to have a character in your video doing a precise custom dance you just came up with? The Wan 2.2-Animate model can help you with that.

So the range of what's possible with these models is wider than the more popular state-of-the-art models. But that's not what really impressed me when I was observing the AI enthusiast community. What impressed me was how the community was using LoRAs to shape their generations.

LoRAs (Low-Rank Adaptations) are kind of like stackable plugins that can be applied in workflows using open-weight models. They're basically a way to efficiently fine-tune model outputs. Let's say you want your generated video to have a particular lighting style. If you were using Veo 3 or Sora 2, you'd have to prompt and pray that the model would understand the lighting style you want. This is wildly unpredictable and runs into myriad consistency issues.

The alternative is to build a workflow around an open-weight model like WAN and train a LoRA on the lighting style you want. You would run images with the desired style through a training algorithm (this takes some time and sourcing work). Once the training is done and the LoRA is ready, you apply it to your video generation workflow, and voilà! You've now standardized the lighting across your generations! There's more to it than what I'm explaining here, but you hopefully get the gist.

People have been creating LoRAs to augment model generations as they see fit, from generating different character styles to different visual aesthetics, from upscaling generations to controlling character poses and actions, to endless other use cases. Since this process involves training, whatever you can train, you can have the model do. LoRAs are so popular in the community now that there are dedicated websites where users can share and trade them, while other users guard their LoRAs like the secret sauce in a food recipe.

With LoRAs and other workflow integrations I don’t have the space to cover here, AI enthusiasts are building complex workflows to generate the images and videos they desire. A popular tool for orchestrating the settings and processes for image and video generation with open-weight models is ComfyUI.

ComfyUI is a node-based workflow orchestrator that allows users to build incredibly complex pipelines. Seeing some of the advanced workflows that users share, I'm reminded of software like Nuke, which is used for professional-grade VFX compositing for films and commercials. In some ways, when I see what people are building with ComfyUI, I think I'm peeking into the future of media generation.

The thing about these models like Wan, these generation workflow tools like LoRAs, and orchestration tools like ComfyUI is that their combination compresses the image and video production process. There's still a lot of complexity in the work it takes to get a desired output, but these tools are allowing AI enthusiasts to precisely pipeline pre-production, production, and post-production into one extremely capable production pipeline. It goes far beyond just typing a prompt into a chat box and I can already see the building blocks of what's coming once compute unbounds how fast ideas can move through these pipelines.

It's still rough around the edges, though, and that’s why it feels nostalgic. For all the complex workflows I've seen, there are just as many people not getting the results they want. There are workarounds people have to find to get their generations the way they want them, from combining different LoRAs to trying different workflow orchestrations. But this is why it feels a bit like the old days like when I was in high school trying to run Blender from a CD in my English teacher's classroom. Things didn't just work. You had to figure out how to make them work.

The AI Underground—this open-weight model space—is the most under-appreciated part of the AI enthusiast community because they are not just waiting for the next state-of-the-art model from Google or OpenAI to drop. The expression truly feels like it's bound only by the user's GPU and the time they are willing to put into training and orchestration. This is opening tremendous opportunities for storytelling and I can’t wait to dive deeper to start crafting with some of these tools.

Video Credits:

Dominic Hawgood

Mohanad Aboalatta