Both Stable Diffusion XL 1.0 (SDXL 1.0) and DALL-E 3 AI image generators create amazing quality images, but which AI follows the prompt instructions better?
This is an important question for those interested in producing a range of similar style images or where the prompt engineer wants something highly specific. If we ask for an old dog, but the AI generates a cute, bright eyed puppy it’s a failed interpretation of the text to image prompt.
My previous article discussed how inconsistent the results are for short text to image prompts. So for this AI generator test we’ll be using relatively long prompts since they tend to be more consistent.
AI Generated Images for Old Dog Sitting Looking at a War Memorial in the Rain
Top 3 images above are DALL-E 3 and the bottom 3 images are SDXL 1.0.
DALL-E 3 vs SDXL 1.0 Prompt Interpretation Test
Test AI text to image prompt: “Watercolor, old dog sitting looking at a war memorial in the rain, his back facing us, over the shoulder shot, rainy gloomy atmosphere. Red poppies. Soft watercolor, complex contrast, pastel colours, masterpiece.”
Above is one of the DALL-E 3 creations and below is a SDXL 1.0 creation. They both generated an image relevant to the prompt, though DALL-E respected the “soft watercolor” part of the prompt better than Stable Diffusion. This was true for all the images created for this test: all the DALL-E 3 images were soft water colors, while Stable Diffusions were more of a vibrant watercolor.
For Stable Diffusion, sometimes I included the NightCafe Studio default negative prompts, though I don’t think they made much difference.
What I wanted from the AI image generators were watercolors of an old dog that was looking at a war memorial with red poppies in a rainy environment.
For DALL-E 3 I ran the prompt (via BING Designer) 5 times. Each run generated 4 images, so DALL-E generated 20 images.
For Stable Diffusion XL 1.0 I ran the prompt (via NightCafe Studio) 12 times. Each run generated 1 image, so SDXL 1.0 generated 12 images.
Both DALL-E 3 and SDXL 1.0 gave good results, but overall DALL-E 3 followed the prompt instruction better. DALL-E 3 interpreted the prompt with more accuracy than SDXL 1.0.
You can see from the 32 images generated (see the two combined screenshots below) there are some really good results. You wouldn’t struggle to choose a few good DALL-E 3 or SDXL 1.0 images, so for both AI image generators they created usable images.
My plan was to add a few of the best DALL-E and a few of the best Stable Diffusion XL images within this article, but I had a real problem selecting only a few: there are a LOT of good AI creations. The first two images (posted earlier) titled “DALL-E 3 Watercolor of an Old Dog Staring at a War Memorial 1” and “SDXL 1.0 Watercolor of an Old Dog Staring at a War Memorial 1” are two of my favorites. The DALL-E creations shows a really good war memorial and the SDXL creation looks like the dog is with it’s owner.
Sadly I had to choose images which showed something helpful to those reading this article, rather than the best ones. So the majority of the remaining images aren’t necessarily the best.
The next SDXL 1.0 is representative of an issue with Stable Diffusion 1.0, prompt part mixing. what I mean by prompt part mixing (I’m sure there’s better term for this?) is separate parts of the prompt mushing together.
For this text prompt the parts referencing the dog having it’s back turned:
“his back facing us, over the shoulder shot”
Results in the dog facing away (as we want), but it also tends to result in any war memorial with a soldier, person or animal also facing away. 8 of the SDXL images had a soldier, person or animal as part of the war memorial and 7 of the 8 faced away! Only 1 creation looked towards the ‘camera’. Two of the images looked a bit stupid with soldiers facing a war memorial, example above.
In comparison of the 20 DALL-E 3 creations, 17 had soldiers, people or animals as part of the war memorial, with 14 either facing the ‘camera’ or there were multiple soldiers/people and some faced the ‘camera’. The DALL-E 3 creations were far more realistic in this regard. The example below was a mixed one.
The DALL-E 3 creation above is really good, you could imagine an old veteran sitting with his old dog looking at a war memorial. The old soldier looks like he’s wearing old army issue boots. Invokes strong emotions, very impressive for an AI generated image.
A few of the Stable Diffusion creations had poorly placed dogs/war memorials (usually double war memorials). In one image the dog was sat directly in front of the war memorial and the war memorial cross intersected with the dogs head. Of course this could happen in real life, but I don’t believe any watercolor artist would deliberately paint an image like that. In two images there were two memorials and the dogs were either placed behind or too close to one of the memorials. Again, this could happen in reality, but if this were a photograph it would be a poorly thought through shot. Example below.
Looking through the 20 DALL-E images and there’s no major problems with any of them. The worst image generated wasn’t bad per se, but the war memorial had an umbrella which is highly unlikely to exist in reality. If you were looking for a serious war memorial image, this would be your last choice. Looking for funny, maybe.
For the old dog at a war memorial prompt, both SDXL and DALL-E did a good job generating relevant images, but IMHO DALL-E 3 was overall better than Stable Diffusion XL 1.0 regarding interpreting the prompt correctly.
SDXL 1.0 Prompt Interpretation Fail
My wife also creates AI image using SDXL 1.0 and DALL-E 3 and she had a major problem with a “Tornado beast” prompt. She could not get a good image in Stable Diffusion XL.
Text to Image Prompt: “A Tornado beast. The tornado is a beast made from black clouds, a black tornado beast with claws and sharp teeth, fills the entire space, feelings of fear, feeling of horror, a devastated city and people running away in fear. Epic cinematic brilliant stunning intricate meticulously detailed dramatic atmospheric maximalist digital matte painting.”
With Stable Diffusion the settings where left at their defaults with no negative prompt.
You can see from the image above, SDXL 1.0 failed to interpret the prompt correctly. Stable Diffusion XL basically generated a city experiencing a devastation tornado: they are good creations, but NOT what the prompt asked for. There’s very little indication of a “Tornado beast” or “a black tornado beast with claws and sharp teeth”.
In comparison, DALL-E 3 interpreted the prompt accurately. All 4 DALL-E creations depicted a tornado beast with sharp teeth and claws. DALL-E also better depicted people running away in fear on 3 of the 4 images, whilst SDXL only had 2 images with people and it didn’t depict them “running away in fear”.
DALL-E 3 succeeded, where SDXL 1.0 failed. Let me know what you think in the comment section below?