Every few weeks a familiar Reddit post appears on r/comfyui. An architecture student has been generating images for months, they have the tool installed, they've watched the tutorials, and they still can't turn their hand sketches into renders that look like their actual building. The replies they get are mostly wrong, often from people who haven't actually tried to do this work.

This article is the workflow Vista Studios uses for sketch-to-photoreal conversion. It's been running stable since FLUX.1 Dev landed last year and it gets us reliable architectural output from hand-drawn input. It's also the workflow we'd point that student to.

Two things up front. First, this is the local-ComfyUI workflow, you're not paying per render. Second, if you have access to Nano Banana 2 through Veras or the Chaos API, you'll get better results. NB2 is genuinely better at this. But it's not free, and if you're a student or a small practice running on what you already have, this workflow gets you 80% of the way for $0 of inference cost.

What you need first

Hardware: 16GB+ VRAM is comfortable (RTX 4080 / 4090 / 5080). 12GB works with quantized FLUX. 8GB cards should run SDXL instead, same workflow logic, weaker output.

Models:

Custom nodes: ComfyUI Manager (you should already have this), the standard ControlNet Auxiliary Preprocessors pack, and the Inspire Pack for some convenience nodes. Nothing exotic.

The four-stage workflow

Most tutorials show a single chain: sketch in, render out. That's the source of most failures. The workflow that actually works has four distinct stages, each with a job.

Stage 1, Pre-process the sketch. Run your hand sketch (or rough digital sketch) through both the Scribble and the Canny preprocessors. This gives you two control signals: Scribble captures the loose intent of your line work, Canny captures the hard structural lines. Save both as image outputs.

Stage 2, Stacked ControlNet conditioning. Feed both control signals into your conditioning chain at moderate strength. This is the step most tutorials skip or do wrong.

Stage 3, Sampling with the architectural LoRA. Run a standard FLUX KSampler with the LoRA loaded. Specific numbers below.

Stage 4, Detail upscale and post. A second pass through an upscaler (we use 4x-UltraSharp through ComfyUI's standard upscale node), then optional latent re-encoding for final detail.

Each stage has settings that matter. Let's go through them.

Stage 1: Sketch preprocessing

Start with a clean sketch. Resolution doesn't need to be high, 1024x1024 is plenty. Pen-and-paper sketches scanned at 300dpi work fine. The model wants to see your line logic; it does not want to see the texture of your paper.

Pre-clean the sketch. Threshold it in Photoshop or similar so you have black lines on white, no shading, no texture. If your hand sketch has shading, lose the shading, Scribble will read it as additional structural intent and confuse the output. We've seen students keep their pencil shading in and then wonder why the AI rendered a building with a giant gray cloud over it. The cloud was their shadow.

Run the cleaned sketch through both:

Stage 2: Stacked ControlNet

Here's the actual numbers, which most tutorials skip:

ControlNet Apply (Scribble)
  strength: 0.75
  start_percent: 0.0
  end_percent: 0.6

ControlNet Apply (Canny)
  strength: 0.65
  start_percent: 0.0
  end_percent: 0.7

Why two? Single-ControlNet workflows force you to choose: hold to the loose intent (Scribble alone, building changes shape) or hold to the hard lines (Canny alone, output is over-constrained and looks artificial). Stacking both at moderate strength gives the model two signals to weigh against each other. The output keeps your overall massing logic and your hard structural lines while having room to interpret materials, lighting, and detail.

The end_percent values matter. Cutting both ControlNets off before sampling is complete (0.6 and 0.7 of total steps) lets the model do its detail work in the final 30-40% of inference without being forced back to the line input. This is where the photorealism actually emerges.

The mental model: ControlNets enforce the building's bones in the early steps; the model itself adds the flesh and skin in the late steps. Don't let the bones override the skin.

Stage 3: The sampler settings

FLUX sampling for architecture is forgiving but specific:

KSampler (FLUX)
  sampler: euler
  scheduler: simple
  steps: 28
  cfg: 3.8
  denoise: 1.0

LoRA strength: 0.6 - 0.8 depending on LoRA

Prompt structure:
  [building type], [materials], [time of day],
  [photographic style], [optional: street view /
  axon / etc.], professional architectural
  photography, sharp focus, natural lighting

Three numbers to internalize: CFG 3.8, steps 28, LoRA 0.7. The CFG range that works is roughly 3.5-4.5. Above 5.0, the model starts ignoring the ControlNets and overfitting to your prompt. Below 3.0, it ignores your prompt and renders generic mush.

For prompts: lead with the most specific architectural detail, not with style words. "three-story timber-frame house, larch cladding, board-and-batten, golden hour, 35mm photograph" works better than "beautiful architectural photo of a modern house". The model responds to specificity. Vague prompts produce vague output.

Stage 4: Upscale and refine

FLUX outputs at 1024x1024 read fine on screen but lack detail at print size. The final stage adds resolution.

Standard pattern: 4x-UltraSharp upscaler through ComfyUI's Upscale Image (using Model) node. Then optionally re-encode to latent and run a low-denoise pass (around 0.25) through FLUX again at the upscaled resolution. This second pass adds material micro-detail, the grain in concrete, the texture of brick mortar, the irregularity of weathered metal, that wasn't visible at 1024.

Don't denoise above 0.35 in this second pass or you'll start re-interpreting the building. Stay low, let it polish.

The full node graph

Conceptually:

Sketch → [Preprocess: Scribble] ─┐
       └→ [Preprocess: Canny]    ─┤
                                  ├→ Stacked ControlNet →
Prompt → [CLIP Text Encode]      ─┤    Conditioning
                                  │
LoRA Loader (FLUX) ───────────────┤
                                  │
FLUX Model + VAE ─────────────────┴→ KSampler →
                                                 Image Decode →
                                                 Upscale Model →
                                                 (optional 2nd-pass refine) →
                                                 Save Image

If you want a working .json workflow file matching this exactly, it'll be in the next workflow drop on the ArchiGen AI Workflows page. For now, building it from this description in 20-30 minutes is straightforward if you're comfortable with ComfyUI.

What goes wrong

The four most common failure modes and how to fix them:

Output ignores the building entirely. Your CFG is too high (above 5.0) or your ControlNet strengths are too low (below 0.5). Increase ControlNet strengths first; lower CFG second.

Output looks like a sketch with color filled in. Your end_percent values are too high, you're holding the ControlNets through the entire sampling process. Drop them to 0.6/0.7. Let the model actually render the late steps.

Output is photorealistic but the building changed. Your sketch is too loose for what you're asking. Add a Canny pass with cleaner line input. Or accept that the model needed more structural information than you gave it and tighten the sketch.

Materials look wrong. Your prompt isn't specific enough about materiality, or your architectural LoRA was trained on a different style. Be explicit in the prompt about exact materials. If a LoRA keeps pulling outputs toward "sleek modernist glass house" regardless of what you ask for, that LoRA is overpowered for general use, drop it to 0.4 strength or swap it.

When not to use this workflow

This is the right tool for: turning hand sketches into client-presentable visualization, exploring design directions in early SD, generating quick atmosphere shots from rough massing studies, and doing what students and small practices need most, getting from concept to image without paying per render.

It's the wrong tool for: rendering an actual developed BIM model (use Veras), doing client-final hero shots that need to be defensible to scrutiny (still use Veras or a real renderer), or for projects where the building has to be exactly correct and the AI output will be measured against drawings (don't use any AI tool for this; use V-Ray or similar).

Scope what you're using AI for. ComfyUI sketch-to-render is excellent for the conceptual end of the spectrum. It is not a substitute for a developed rendering pipeline.

What's next

Two things to watch this year. First, FLUX.2 if it ships, Black Forest Labs has hinted at architectural fine-tunes in the roadmap, which could collapse the need for community LoRAs. Second, the ControlNet ecosystem itself is consolidating; expect a unified architectural ControlNet from one of the major labs by end of year that does what Scribble + Canny + custom training do today.

The workflow above will keep working through both transitions. The principles, stack ControlNets at moderate strength, keep CFG low, let the model interpret your sketch in the late steps, translate to whatever model you swap in. That's the whole point of working in ComfyUI rather than in a closed product. The graph is the asset.


Tested by Vista Studios on FLUX.1 Dev with RTX 4080. All node settings replicable in ComfyUI 1.x.