How We Built an AI Recipe Video Pipeline That Looks Like an iPhone Shot It | The Lab

Most agencies describing their “AI video workflow” mean something like this: write a prompt, generate an image, maybe run it through a video model, call it content.

That's not a pipeline. That's a coin flip with extra steps.

What we built for this project is different. It's a production system — with a defined panel grammar, a motion language layer, an emotional performance layer, a compositing surface, a quality gate at every step, and a single aesthetic rule that overrides everything else:

It has to look like a real person shot it on their phone in a real kitchen.

Not cinematic. Not commercial. Not AI-gorgeous. iPhone.

Here's exactly how we did it — and why the aesthetic constraint turned out to be the most important engineering decision in the whole pipeline.

The Problem With AI Food Video

The default output of every major AI image and video model is beautiful in the wrong way.

You ask for a steak on a cutting board and you get something that looks like a shoot for a Michelin three-star restaurant — deep bokeh, dramatic side-lighting, hyperreal glistening surfaces, perfect char patterns that no real grill ever produced. It's visually arresting and completely inauthentic.

The audience recognizes it instantly. Not consciously — most viewers couldn't tell you why it feels off. But the engagement data says they feel it: lower retention, lower sharing, lower return rate. The content looks expensive and reads as fake.

iPhone

The counter-move to AI-gorgeous food video

Flat phone-sensor depth of field. Whatever light came through the window. A fruit bowl, fridge magnets, a paper towel roll at the frame edge. That authenticity isn't a budget compromise — it's a trust signal.

The iPhone aesthetic is the counter-move. Real social cooking content — the content that actually builds audiences — looks like someone grabbed their phone and filmed while they were cooking. That authenticity tells the viewer: this is a real person, in a real kitchen, making real food.

The problem: AI models have no idea what this looks like. Their training data skews toward polished photography. Left to their defaults, they produce the opposite of what we need. So the first engineering problem was not “how do we generate recipe video” — it was “how do we systematically prevent the models from doing what they naturally want to do.”

The Architecture: Three Tools, One Job Each

We settled on a three-tool architecture after testing various combinations. Each tool has exactly one job.

GPT-Image 2 — panel generation

The validated baseline for producing individual storyboard panels. With reference images attached and the right prompt grammar, GPT-2 consistently produces iPhone-aesthetic output. It doesn't try to be cinematic, doesn't hallucinate production polish, and responds well to specific negative constraints.

Magnific Spaces — compositing and cleanup

The assembly surface. Hand swaps (inpainting with a hands reference when a panel has broken fingers), background swaps, batch cleanup across all nine panels — and the Seedance video stitch. Spaces is not the panel generator.

Seedance Pro (via Spaces) — video generation

The nine cleaned panels from each batch feed a Seedance Pro node as sequential keyframes; it interpolates motion between them. Output is roughly 10–15 seconds of continuous recipe video per batch. Three batches of nine panels ≈ 30 seconds of final content.

&lightning; The Banned Nodes

“Four Magnific node types are permanently banned: Upscaler Creative, Upscaler Precision, the Mystic Image Generator, and Relight/Style Transfer. Every one of them pulls the output toward commercial polish — aesthetic weapons pointed at the thing we're protecting. If a panel needs refinement, we regenerate in GPT-2 with a better prompt. We never upscale, never relight, never style-transfer.”

The Panel Prompt Grammar

Every individual panel in the pipeline is generated with the same seven-element prompt structure. In order:

Medium anchor

Opens the prompt and locks the aesthetic mode — e.g. “Still frame from an iPhone-shot food video, vertical 9:16.” What never appears here: Kodak Portra, 35mm film, film grain, halation. Those phrases engage filmic rendering and are the single most common failure mode in AI food prompting.

Shot type and framing

Default to “hip-to-chest height POV, slight upward tilt, phone-held-close perspective” — the position a person naturally holds their phone while cooking. Not eye-level portrait. Not overhead drone.

Subject and action

What's happening, with exact wardrobe and anatomy locked to reference. This is where the two motion grammar layers plug in — Laban for the body, FACS + Valence-Arousal for the face.

Environment

Specific kitchen details. Real surfaces. Everyday clutter. Not staged styling.

Lighting

Natural language, never camera specs. “Bright natural daylight from off-frame window camera-left, soft directional shadows” — not “f/1.8, 1/200, ISO 200,” which imply a DSLR and produce DSLR aesthetics.

Color and texture

“Natural saturation, slight computational HDR sharpening typical of modern smartphone cameras, flat phone-sensor depth-of-field with everything roughly in focus — NOT shallow cinematic bokeh.”

Constraints

An explicit negative block, every item earned: avoid film grain, analog halation, Kodak Portra color science, studio polish, cinematic moodiness, portrait-mode bokeh, AI-glossy food surfaces, over-saturation, commercial advertising aesthetic, staged styling, model-perfect skin, posed expression, text or watermarks, extra hands or fingers.

The lesson from building this block: explicit artifact prohibition outperforms generic avoidance. Telling the model “avoid AI look” does nothing. Naming the specific things — starburst glints on granite, perfectly symmetrical slice arrangements, cinematic skin rendering — produces reliable suppression.

Laban Movement Analysis as a Prompt Layer

One of the less obvious engineering decisions: we added a formal movement grammar layer between the production plan and the panel prompts. Laban Movement Analysis is a choreographic notation system — describing how a body moves across four qualities: Weight (Strong/Light), Time (Sudden/Sustained), Space (Direct/Indirect), and Flow (Bound/Free).

Action tag	Cooking action	Laban signature
PRODUCT_UNCAP	Twisting open a jar lid	Strong-Sudden-Direct-Bound
COATING_RUB	Working marinade into protein by hand	Strong-Sudden-Indirect-Free
PLATE_GARNISH	Drizzling sauce, scattering herbs	Light-Sustained-Indirect-Free
GRILL_FLIP	Quick flip with tongs	Strong-Sudden-Direct-Bound

In a panel prompt, this turns a vague verb into a precise action. “A hand stirring sauce” becomes “a right hand performing MIX_STIR [Strong-Sustained-Indirect-Bound] — wooden spoon moving in a continuous circular path against thick simmering sauce, visible resistance against the sauce body, spoon staying in constant contact with the bottom of the pan.”

The same tags feed directly into the Seedance motion prompts. Seedance reads Laban signatures as motion-quality instructions — producing more authentic cooking movement than a vague verb would generate.

FACS + Valence-Arousal for Face-Visible Shots

The second grammar layer handles emotional performance on the roughly 20% of shots where the chef's face is visible. FACS (Facial Action Coding System) describes expressions as discrete muscle movements — Action Units. Valence-Arousal is a 2D emotional coordinate system: how positive/negative (Valence) and how activated/calm (Arousal) the state is.

The critical constraint for cooking content: real chefs do not emote like actors. The cooking-authentic emotional range is narrow. Valence: −0.3 to +0.7 (mostly positive, never deeply negative). Arousal: −0.5 to +0.5 (mostly low to moderate, never extreme). Any face beat outside that rectangle reads as performance, not presence.

&lightning; The Duchenne Rule

“Every smile must include AU6 (cheek raiser) paired with AU12 (lip corner puller). AU12 alone — the lip corner pull without the cheek raise — is a social marketing smile. It reads as fake. AU6+AU12 is a Duchenne smile, the involuntary marker of genuine positive emotion. Naked AU12 is an influencer face.”

The emotional arc across a full recipe follows a gentle wave: opening prep is WORKING_CONCENTRATION (neutral valence, low arousal), hero product introductions get a brief HERO_REVEAL_PROUD beat, the main cook is ATTENTIVE_WATCH or METHODICAL_FLOW, and the payoff is TASTING_SATISFIED or RESOLUTION_BEAT. A wave, not a flat line — and not a rollercoaster.

The Demo Recipe: Tandoori Tomahawk

The pipeline was validated on a specific recipe: Tandoori Tomahawk Steak with Heritage Tomato Salad and Green Tikka Yogurt Dip, produced as a Gymkhana × Caraway branded commercial. A bone-in tomahawk ribeye coated in Gymkhana Classic Tandoori Marinade, rested two hours, grilled over direct heat, and plated with Caraway cookware throughout.

The structure: 28 shots across three batches of approximately nine panels each.

Batch 1 — Indoor Prep

Jar hero → marinade application → rest → tomato prep → knife work → dip mix.

Batch 2 — Outdoor Grill

Fire-up → raw steak present → grill place → sear, baste, flip → doneness check → grill lift.

Batch 3 — Rest + Plate

Steak rest → slice reveal → interior reveal → dip garnish → final plate → chef payoff.

The chef appears in six of twenty-eight shots (21%) — at narrative beats only, not craft beats. This matches real reference cooking content. Brand integration hits at roughly every seven shots — Gymkhana Classic jar (shot 3), Caraway knife (shot 9), Gymkhana Green Tikka jar (shot 16), Caraway cutting board (shot 25). Even rhythm, no crowding.

One of the efficiency decisions that makes this pipeline scalable: we designed the sequence around four reusable backgrounds rather than a unique environment per shot.

Background	Description	Used in
BG-1	Overhead quartz countertop	All overhead prep action
BG-2	Oblique quartz, working perspective	Working-angle prep, all plating/payoff
BG-3	Apartment kitchen wide	Establishing shots, brand hero reveals
BG-4	Outdoor backyard grill	All grilling action

Four backgrounds for 28 shots. The variation comes from the action and framing, not from constantly building new environments.

What the Storyboard Sheet Delivers

The output of each batch isn't just a video prompt. It's a complete storyboard sheet — the format our production team actually works from. Every panel in the sheet has:

✓

Cell title — short, all caps, 1–3 words (e.g. HERO PRODUCT / MARINADE RUB / GRILL POSITION)

✓

Caption — one line, all caps terse style (~10–15 words describing the shot)

✓

Action tag — the Laban label (e.g. COATING_RUB, PRODUCT_UNCAP, GRILL_FLIP)

✓

Face tag — the V-A library beat name, only on face-visible panels; omitted on hand-only POV shots

✓

Notes / continuity block — 3–5 bullets summarizing batch intent and continuity locks

The sheet closes with the same tagline on every batch, every recipe — iPhone aesthetic: real kitchen, natural daylight, no studio polish, no dark cinematic moodiness. That line isn't decorative. It's a production contract. Every person who touches a panel reads it before they do anything.

The One Thing That Changes Everything

We tested a lot of approaches before landing on the grammar above. The single decision that produced the biggest improvement wasn't the Laban layer or the FACS layer or the compositing surface. It was this:

The prompt opens by naming what the output is, not what it looks like. “Still frame from an iPhone-shot food video” — not “cinematic food photography” or “professional cooking video.”

When the model understands it's generating a frame from a phone video, every other decision it makes downstream changes. Depth of field, color treatment, framing instincts, lighting character — they all shift toward the right register before the rest of the prompt does any work.

&lightning; Snackable Cut

“The medium anchor is the load-bearing element. Tell the model it's shooting a still from an iPhone food video before anything else, and depth of field, color, framing, and lighting all snap to the authentic register on their own. Everything else in the grammar is refinement.”

→

The Pipeline in Summary

Stage	Tool	Job
Genre classification	Planning	DNA doc, aesthetic anchors, shot rhythm
Phase 0 plan	Claude	Batch map, continuity locks, brand beats, finished-dish North Star
Panel generation	GPT-Image 2	9 panels per batch, 7-element grammar + references
Compositing	Magnific Spaces	Hand/background/food inpaint fixes; Seedance stitch
Video generation	Seedance Pro	~10–15 sec per batch, keyframe-to-motion interpolation
Quality gate	3 eval skills	Panel QC before anything goes to Seedance

What actually made it work

01The medium anchor — name the output as an iPhone food frame before anything else
02Three tools, one job each — GPT-2 generates, Spaces composites, Seedance moves
03The banned nodes — no upscale, no relight, no style-transfer, ever
04Explicit artifact prohibition — name the specific failure, don't say “avoid AI look”
05Laban motion grammar — precise action signatures the video model can read
06FACS + V-A performance — Duchenne smiles only, a narrow authentic emotional range
07Four reusable backgrounds — variation from action and framing, not new environments
08The storyboard sheet as contract — the iPhone tagline closes every batch

Three batches. ~30 seconds of finished recipe video. Two hero brands integrated at natural story beats. One aesthetic rule that everything else serves.

None of this required a bigger model or a larger budget. It required a different architecture — and the discipline to stop the tools from doing what they naturally want to do.

Scott Ownbey

Founder & Creative Director · Animatic Media

Scott has led Animatic Media for 28 years, delivering 10,000+ projects for Fortune 500 brands and agencies worldwide. The Gymkhana × Caraway pipeline was developed as an internal production experiment to validate AI-driven recipe storyboard methodology for CPG branded food content.

How We Built an AI Recipe Video Pipeline That Actually Looks Like an iPhone Shot It