Most agencies describing their “AI video workflow” mean something like this: write a prompt, generate an image, maybe run it through a video model, call it content.
That's not a pipeline. That's a coin flip with extra steps.
What we built for this project is different. It's a production system — with a defined panel grammar, a motion language layer, an emotional performance layer, a compositing surface, a quality gate at every step, and a single aesthetic rule that overrides everything else:
It has to look like a real person shot it on their phone in a real kitchen.
Not cinematic. Not commercial. Not AI-gorgeous. iPhone.
Here's exactly how we did it — and why the aesthetic constraint turned out to be the most important engineering decision in the whole pipeline.
The Problem With AI Food Video
The default output of every major AI image and video model is beautiful in the wrong way.
You ask for a steak on a cutting board and you get something that looks like a shoot for a Michelin three-star restaurant — deep bokeh, dramatic side-lighting, hyperreal glistening surfaces, perfect char patterns that no real grill ever produced. It's visually arresting and completely inauthentic.
The audience recognizes it instantly. Not consciously — most viewers couldn't tell you why it feels off. But the engagement data says they feel it: lower retention, lower sharing, lower return rate. The content looks expensive and reads as fake.
Flat phone-sensor depth of field. Whatever light came through the window. A fruit bowl, fridge magnets, a paper towel roll at the frame edge. That authenticity isn't a budget compromise — it's a trust signal.
The iPhone aesthetic is the counter-move. Real social cooking content — the content that actually builds audiences — looks like someone grabbed their phone and filmed while they were cooking. That authenticity tells the viewer: this is a real person, in a real kitchen, making real food.
The problem: AI models have no idea what this looks like. Their training data skews toward polished photography. Left to their defaults, they produce the opposite of what we need. So the first engineering problem was not “how do we generate recipe video” — it was “how do we systematically prevent the models from doing what they naturally want to do.”
The Architecture: Three Tools, One Job Each
We settled on a three-tool architecture after testing various combinations. Each tool has exactly one job.
“Four Magnific node types are permanently banned: Upscaler Creative, Upscaler Precision, the Mystic Image Generator, and Relight/Style Transfer. Every one of them pulls the output toward commercial polish — aesthetic weapons pointed at the thing we're protecting. If a panel needs refinement, we regenerate in GPT-2 with a better prompt. We never upscale, never relight, never style-transfer.”
The Panel Prompt Grammar
Every individual panel in the pipeline is generated with the same seven-element prompt structure. In order:
The lesson from building this block: explicit artifact prohibition outperforms generic avoidance. Telling the model “avoid AI look” does nothing. Naming the specific things — starburst glints on granite, perfectly symmetrical slice arrangements, cinematic skin rendering — produces reliable suppression.
Laban Movement Analysis as a Prompt Layer
One of the less obvious engineering decisions: we added a formal movement grammar layer between the production plan and the panel prompts. Laban Movement Analysis is a choreographic notation system — describing how a body moves across four qualities: Weight (Strong/Light), Time (Sudden/Sustained), Space (Direct/Indirect), and Flow (Bound/Free).
| Action tag | Cooking action | Laban signature |
|---|---|---|
| PRODUCT_UNCAP | Twisting open a jar lid | Strong-Sudden-Direct-Bound |
| COATING_RUB | Working marinade into protein by hand | Strong-Sudden-Indirect-Free |
| PLATE_GARNISH | Drizzling sauce, scattering herbs | Light-Sustained-Indirect-Free |
| GRILL_FLIP | Quick flip with tongs | Strong-Sudden-Direct-Bound |
In a panel prompt, this turns a vague verb into a precise action. “A hand stirring sauce” becomes “a right hand performing MIX_STIR [Strong-Sustained-Indirect-Bound] — wooden spoon moving in a continuous circular path against thick simmering sauce, visible resistance against the sauce body, spoon staying in constant contact with the bottom of the pan.”
The same tags feed directly into the Seedance motion prompts. Seedance reads Laban signatures as motion-quality instructions — producing more authentic cooking movement than a vague verb would generate.
FACS + Valence-Arousal for Face-Visible Shots
The second grammar layer handles emotional performance on the roughly 20% of shots where the chef's face is visible. FACS (Facial Action Coding System) describes expressions as discrete muscle movements — Action Units. Valence-Arousal is a 2D emotional coordinate system: how positive/negative (Valence) and how activated/calm (Arousal) the state is.
The critical constraint for cooking content: real chefs do not emote like actors. The cooking-authentic emotional range is narrow. Valence: −0.3 to +0.7 (mostly positive, never deeply negative). Arousal: −0.5 to +0.5 (mostly low to moderate, never extreme). Any face beat outside that rectangle reads as performance, not presence.
“Every smile must include AU6 (cheek raiser) paired with AU12 (lip corner puller). AU12 alone — the lip corner pull without the cheek raise — is a social marketing smile. It reads as fake. AU6+AU12 is a Duchenne smile, the involuntary marker of genuine positive emotion. Naked AU12 is an influencer face.”
The emotional arc across a full recipe follows a gentle wave: opening prep is WORKING_CONCENTRATION (neutral valence, low arousal), hero product introductions get a brief HERO_REVEAL_PROUD beat, the main cook is ATTENTIVE_WATCH or METHODICAL_FLOW, and the payoff is TASTING_SATISFIED or RESOLUTION_BEAT. A wave, not a flat line — and not a rollercoaster.
The Demo Recipe: Tandoori Tomahawk
The pipeline was validated on a specific recipe: Tandoori Tomahawk Steak with Heritage Tomato Salad and Green Tikka Yogurt Dip, produced as a Gymkhana × Caraway branded commercial. A bone-in tomahawk ribeye coated in Gymkhana Classic Tandoori Marinade, rested two hours, grilled over direct heat, and plated with Caraway cookware throughout.
The structure: 28 shots across three batches of approximately nine panels each.
The chef appears in six of twenty-eight shots (21%) — at narrative beats only, not craft beats. This matches real reference cooking content. Brand integration hits at roughly every seven shots — Gymkhana Classic jar (shot 3), Caraway knife (shot 9), Gymkhana Green Tikka jar (shot 16), Caraway cutting board (shot 25). Even rhythm, no crowding.
One of the efficiency decisions that makes this pipeline scalable: we designed the sequence around four reusable backgrounds rather than a unique environment per shot.
| Background | Description | Used in |
|---|---|---|
| BG-1 | Overhead quartz countertop | All overhead prep action |
| BG-2 | Oblique quartz, working perspective | Working-angle prep, all plating/payoff |
| BG-3 | Apartment kitchen wide | Establishing shots, brand hero reveals |
| BG-4 | Outdoor backyard grill | All grilling action |
Four backgrounds for 28 shots. The variation comes from the action and framing, not from constantly building new environments.
What the Storyboard Sheet Delivers
The output of each batch isn't just a video prompt. It's a complete storyboard sheet — the format our production team actually works from. Every panel in the sheet has:
The sheet closes with the same tagline on every batch, every recipe — iPhone aesthetic: real kitchen, natural daylight, no studio polish, no dark cinematic moodiness. That line isn't decorative. It's a production contract. Every person who touches a panel reads it before they do anything.
The One Thing That Changes Everything
We tested a lot of approaches before landing on the grammar above. The single decision that produced the biggest improvement wasn't the Laban layer or the FACS layer or the compositing surface. It was this:
The prompt opens by naming what the output is, not what it looks like. “Still frame from an iPhone-shot food video” — not “cinematic food photography” or “professional cooking video.”
When the model understands it's generating a frame from a phone video, every other decision it makes downstream changes. Depth of field, color treatment, framing instincts, lighting character — they all shift toward the right register before the rest of the prompt does any work.
“The medium anchor is the load-bearing element. Tell the model it's shooting a still from an iPhone food video before anything else, and depth of field, color, framing, and lighting all snap to the authentic register on their own. Everything else in the grammar is refinement.”
The Pipeline in Summary
| Stage | Tool | Job |
|---|---|---|
| Genre classification | Planning | DNA doc, aesthetic anchors, shot rhythm |
| Phase 0 plan | Claude | Batch map, continuity locks, brand beats, finished-dish North Star |
| Panel generation | GPT-Image 2 | 9 panels per batch, 7-element grammar + references |
| Compositing | Magnific Spaces | Hand/background/food inpaint fixes; Seedance stitch |
| Video generation | Seedance Pro | ~10–15 sec per batch, keyframe-to-motion interpolation |
| Quality gate | 3 eval skills | Panel QC before anything goes to Seedance |
- 01The medium anchor — name the output as an iPhone food frame before anything else
- 02Three tools, one job each — GPT-2 generates, Spaces composites, Seedance moves
- 03The banned nodes — no upscale, no relight, no style-transfer, ever
- 04Explicit artifact prohibition — name the specific failure, don't say “avoid AI look”
- 05Laban motion grammar — precise action signatures the video model can read
- 06FACS + V-A performance — Duchenne smiles only, a narrow authentic emotional range
- 07Four reusable backgrounds — variation from action and framing, not new environments
- 08The storyboard sheet as contract — the iPhone tagline closes every batch
Three batches. ~30 seconds of finished recipe video. Two hero brands integrated at natural story beats. One aesthetic rule that everything else serves.
None of this required a bigger model or a larger budget. It required a different architecture — and the discipline to stop the tools from doing what they naturally want to do.