In plain English
Prompting an image model means writing a short description — usually one to three sentences — and letting the model turn it into a picture. Unlike a chatbot where you have a conversation, image models treat your prompt as a set of descriptive commands: every word acts like a dial, nudging the output toward certain colors, styles, and subjects.
The everyday analogy is giving a creative brief to a very fast illustrator who has seen every image on the internet. You don't need to explain how to draw; you just need to specify what you want — the subject, the mood, the visual style, and any constraints. "A red fox in a forest clearing, golden hour light, watercolor painting style" gives the model four clear signals. "A fox" gives it almost none.
Three tools dominate the space. DALL-E 3 (from OpenAI, also inside ChatGPT) understands natural language well and renders text in images reliably. Midjourney (accessed via Discord or its web app) excels at stylized, artistic results with short, punchy prompts. Stable Diffusion is an open-source family of models you run locally or via APIs — it accepts both natural-language sentences and comma-separated tag lists, and supports the widest range of fine-tuning and extensions.
Why it matters
A well-crafted prompt can be the difference between a generic stock-photo look and a precise, usable asset. The same model — same weights, same settings — will produce strikingly different output depending on whether you wrote "a city street" or "a rainy Tokyo alley at midnight, neon reflections on wet cobblestones, cinematic shot, 35mm film grain."
For product designers, marketers, and developers, this matters practically: generating 10 on-brand concept images in minutes replaces a half-day of stock-photo hunting or briefing a contractor. For developers building AI-powered apps, understanding prompt structure is the layer between your user's intent and the API call that delivers a good result.
Prompting skill also transfers across models. Midjourney and DALL-E differ in syntax, but the underlying vocabulary — subject, medium, style, lighting, composition — is universal. Learn the vocabulary once and you can adapt it to any image model, including the newest ones that ship after this article was written.
How prompts shape the output
When your text prompt enters an image model, it is first encoded into a numeric representation by a text encoder (usually a model like CLIP or T5). That representation steers the denoising process at every step — the model keeps sampling toward pixels that match the description. More specific, higher-signal words in your prompt produce stronger steering. Vague or contradictory words produce weaker or mixed steering, which is why "beautiful amazing stunning epic photo" adds almost nothing.
The prompt anatomy
A strong prompt generally has four layers, ordered from most to least important:
- Subject — the main thing being depicted. Be specific: "a tabby cat asleep on a wooden desk" beats "a cat."
- Style or medium — how it should look: "oil painting", "photorealistic DSLR photo", "flat vector illustration", "isometric 3D render", "1970s movie poster."
- Lighting and mood — "golden hour", "dramatic side lighting", "soft diffuse overcast", "neon-lit night scene", "low-key noir."
- Composition or framing — "close-up portrait", "wide establishing shot", "bird's eye view", "rule of thirds", "shallow depth of field."
Platform syntax differences
Each platform has its own dialect:
| Platform | Preferred style | Special syntax |
|---|---|---|
| DALL-E 3 | Full natural-language sentences | None needed — describe intent conversationally; use quotes for literal text in images |
| Midjourney | Short, high-signal comma-separated phrases | --ar 16:9 aspect ratio, --style raw for less stylization, --v 7 model version, --no X for negatives |
| Stable Diffusion | Comma-separated tags or natural language (model-dependent) | (word:1.3) to boost weight; [word] to reduce; separate negative prompt field |
Negative prompts, aspect ratios, and inpainting
Negative prompts
A negative prompt tells the model what to steer away from. Stable Diffusion and many other local models have a dedicated negative-prompt field. Midjourney uses --no X. DALL-E 3 handles negatives inline in natural language: "a forest with no buildings."
Effective negative prompts address two categories of common failure: quality artifacts (blurry, low resolution, watermark, jpeg artifacts, cropped, poorly drawn) and subject errors (extra fingers, two heads, disfigured, bad anatomy). A standard starter negative prompt for Stable Diffusion:
blurry, low quality, watermark, signature, text, extra fingers, extra limbs,
fused fingers, bad anatomy, disfigured, poorly drawn, cropped, out of frame,
low resolution, jpeg artifacts, grainyAspect ratios
Every platform supports non-square outputs. Getting the ratio right matters because the model distributes its compositional attention across the canvas. A portrait prompt on a landscape canvas will add unwanted background filler; a landscape scene in a square crop loses the sense of expanse.
| Use case | Ratio | Midjourney | DALL-E 3 / API | SD (width × height) |
|---|---|---|---|---|
| Social media square | 1:1 | --ar 1:1 | size: 1024x1024 | 1024 × 1024 |
| Landscape / wallpaper | 16:9 | --ar 16:9 | size: 1792x1024 | 1344 × 768 |
| Portrait / phone | 9:16 | --ar 9:16 | size: 1024x1792 | 768 × 1344 |
| Classic photo | 4:3 | --ar 4:3 | size: 1024x768 | 1024 × 768 |
| Cinematic wide | 2.39:1 | --ar 2.39:1 | not supported natively | 1536 × 640 |
Inpainting
Inpainting lets you edit a region inside an existing image without regenerating the whole thing. You supply an image, draw a mask over the area to change, and write a prompt describing what should replace it. Everything outside the mask stays untouched; the model fills the masked area conditioned on both the surrounding pixels and your prompt.
Practical uses: removing an unwanted object ("empty wooden table" over a cluttered surface), swapping one element (replace a grey sky with "dramatic sunset clouds"), or fixing a generation artifact (mask just the bad hand and reprompt). DALL-E 3 supports inpainting via the OpenAI API's edit endpoint. Stable Diffusion has a dedicated inpainting pipeline in the diffusers library and in tools like Automatic1111 and ComfyUI. Midjourney has a Vary (Region) feature on any generated image.
Style keywords and iterating efficiently
Style keywords are the highest-leverage part of a prompt after the subject. The right two words can shift the entire look. Below are reliable categories and examples:
| Category | Example keywords |
|---|---|
| Fine-art medium | oil painting, watercolor, charcoal sketch, pencil drawing, gouache, linocut |
| Photography | DSLR photo, 35mm film, Kodak Portra 400, telephoto compression, macro lens, bokeh |
| Digital art styles | concept art, matte painting, digital illustration, 3D render, Unreal Engine, ray tracing |
| Historical / era | Renaissance, Art Deco, 1970s magazine ad, Bauhaus, Soviet constructivist poster |
| Lighting mood | golden hour, blue hour, hard rim light, soft box, neon, candlelight, overcast |
| Mood / atmosphere | ethereal, gritty, whimsical, melancholic, cinematic, minimalist, maximalist |
Iterating without starting over
Don't try to nail everything in one prompt. A productive workflow:
- Lock the subject first. Get the core content right before adding style words.
- Fix the seed. Once you find a composition you like, note the seed. Changing prompt words while keeping the seed lets you explore variations of the same "skeleton" image.
- Add one variable at a time. Change lighting, then style, then framing — one pass per variable, so you know what moved the needle.
- Use image-to-image for small adjustments. Feed a generation back into the model at a low "denoising strength" (0.3–0.5) to nudge it without losing the structure. Higher strength = more change.
- Document what works. Keep a text file with the exact prompt + seed combinations that produced usable results.
Going deeper
Once you're comfortable with basic prompting, a few more techniques open up finer control:
Prompt weighting (Stable Diffusion): Wrap a phrase in parentheses with a multiplier — (oil painting:1.4) — to increase its influence, or [blurry] to reduce it. Numbers above 1 boost, below 1 suppress. The default weight of any word is 1.0. Avoid going above 1.5 or below 0.5; extreme weights produce artifacts. Not all frontends support this syntax.
Reference images and image prompts: Midjourney lets you paste an image URL at the start of a prompt to use it as a style reference (--iw 0.5 controls how much weight the image gets relative to the text). Stable Diffusion's image-to-image mode takes a starting image and evolves it toward your prompt at a controllable denoising_strength. DALL-E 3 supports sending an image in the same message via ChatGPT, and the OpenAI API's vision endpoint can describe an image for reference.
ControlNet (Stable Diffusion): ControlNet feeds the model a spatial condition — an edge map, a pose skeleton, a depth map, a scribble — that hard-constrains the composition without restricting the style prompt. Want to regenerate a scene in a new art style but keep the same character poses? Extract pose keypoints from a reference image, feed them to ControlNet, and change only the style words in your prompt. It's the most powerful compositional tool in the open-source stack.
Text rendering: Stable Diffusion and earlier DALL-E versions notoriously botched readable text inside images. DALL-E 3 improved this dramatically — use double quotation marks around the exact string you want rendered: a billboard that says "SALE TODAY". Midjourney V7 also has improved text rendering. For Stable Diffusion, specialized models and ControlNet with OCR conditions produce the most reliable results. If readable text matters for your use case, DALL-E 3 is currently the strongest off-the-shelf choice.
The guidance scale / CFG tradeoff: Most platforms expose a guidance scale (DALL-E 3 calls it implicitly set; Stable Diffusion exposes it directly). Low values (~3–5) give the model creative latitude but may drift from your prompt; high values (~10–15) force literal adherence but can produce oversaturated, "crunchy" artifacts. A value of 7 to 8 is a reliable default. Inpainting often benefits from a slightly lower guidance scale so the filled region blends smoothly with unchanged surroundings.
FAQ
What is the best prompt structure for AI image generation?
Lead with your subject, then add the visual medium or style (oil painting, DSLR photo), then lighting and mood (golden hour, dramatic rim light), then composition details (wide shot, shallow depth of field). Four layers of specificity in that order will outperform a vague sentence or a list of buzzwords.
How do negative prompts work in Stable Diffusion?
Stable Diffusion has a separate negative-prompt field. Whatever you list there steers the denoising away from those visual qualities at every step. Useful defaults include blurry, low quality, extra fingers, bad anatomy, watermark, and cropped. Keep the list focused — 15 to 20 terms is enough.
What aspect ratio should I use for Midjourney or DALL-E?
Match the ratio to the intended use. Use --ar 1:1 (Midjourney) or 1024x1024 (DALL-E API) for square social media, --ar 16:9 or 1792x1024 for landscape/wallpaper, and --ar 9:16 or 1024x1792 for portrait or mobile-first content. Composing with the right ratio avoids filler and keeps the model's attention on the main subject.
What is inpainting and when should I use it?
Inpainting lets you edit a specific region of an image while keeping everything else intact. You mask the area you want to change and write a prompt for what should replace it. It's the right tool for removing objects, fixing generation artifacts (like a bad hand), or swapping an element without regenerating the whole image.
Why does DALL-E 3 handle text better than Stable Diffusion?
DALL-E 3 was trained with a much stronger emphasis on accurately rendering readable text, and its text encoder has tighter alignment between language tokens and visual output. Stable Diffusion's open models were trained on broader datasets without that specific focus, so text letters often merge or distort. Wrapping the exact string in double quotes in a DALL-E prompt signals that it should be treated as literal text.
What is the guidance scale (CFG) in image generation?
Guidance scale controls how tightly the model follows your prompt versus exercising its own creativity. Low values (3–5) allow more variation but may drift from your description; high values (10–15) stick closely to the words but can cause oversaturation and artifacts. A value of 7 to 8 is a reliable starting default for most models.