Procedural Aesthetics | Emerging Styles by Ensembling Procedurally-Anchored Models

Generating Procedural Aesthetics - Emerging Styles without Art Datasets

Current text-to-image models like Stable Diffusion are incredible, but they carry an inherent limitation: their stylistic range is strongly shaped by what they’ve encountered in large-scale training data, which can include copyrighted material. This creates:

Legal & Ethical Risks: Dependence on copyrighted artwork.
Data Bias: Styles are limited to what's in the datasets (online / digitized archives).

This gallery presents flower images that look painted, printed, stitched, or some form of artistic stylized - yet the system that generated them was never trained on paintings, illustrations, or any art dataset. No pre-trained checkpoints of any art style / dataset as well.

Research Questions:

(1) How much can a generative model internalize a recognizable visual style from procedural supervision alone, without ever seeing human-made artworks?

(2) Can the ensemble of several models trained with procedural anchoring of different art style create some emerging, yet visually coherent art styles? ⤵ (click for more...)

To explore this, I trained a diffusion-based text-to-image model entirely from scratch using only real flower photographs (Oxford Flower-102). No pre-trained checkpoints. No inherited aesthetic priors. Any “artistic” qualities here are not copied from art history; they are grown through constraint.

The method is called procedural style anchoring. After learning to generate flowers from text, the model is fine-tuned on the same photographs after they are transformed by deterministic procedural filters i.e. posterization, Gouache-watercolor, pointillist sampling, mosaic tiling, cross-hatching fields, low-poly abstraction, anisotropic Kuwahara smoothing, and more. These procedures act like a kind of visual physics: they impose consistent rules about edges, texture, and color, and the model internalizes those rules as a stable aesthetic.

The highlight of this work — and the engine of its creative surprise — is ensembling. Instead of relying on one style model at a time, I combine multiple procedurally anchored models during generation using weighted soft blending, sometimes block-wise across the network. Different “style instincts” can influence different parts of the diffusion process, allowing hybrid aesthetics to emerge: halftone structure with fauvist color, mosaic geometry with painterly softness, felt-like texture fused with poster-flat tonal fields. The result is not a simple average, but an interference pattern where new visual dialects appear between established styles.

More detailed approach used in making the ensemble is given after the 1st Gallery section. I tried to limit the technicallity of the description as this page focuses on the art and aesthetics.

Code of implementation is available on GitHub.

"Can a model grow a recognizable “art style” without learning from human artworks — purely from rules, constraints, and the dynamics of generation? Browse slowly. Some images demonstrate single anchored styles; others embrace the ensemble hybrids where the most unexpected forms bloom. Enjoy viewing!"

Gallery

The prompt is standardized across all images in this gallery for a fair benchmark.

Prompt used: "a {flower_name} in the wild, in {art_style} painting style"

Here we compare several different procedural styles per flower class

Oxeye Daisy

Models Ensemble: Oil + Poster + Felt, Mix Plan={"down": [0, 1, 0, 2], "mid": 2, "up": [1, 2, 0, 1]}

Oil 20 / Poster 20 / Lowpoly 60 — Models Ensemble: Soft-blend Oil 0.20/Poster 0.20/Lowpoly 0.60

Oil 30 / Poster 25 / Felt 45 — Models Ensemble: Soft-blend Oil 0.30/Poster 0.25/Felt 0.45

Watercolor 30 / Pointillism 40 / Mosaic 30 — Models Ensemble: Soft-blend Watercolor 0.30/Pointillism 0.40/Mosaic 0.30

Watercolor 40 / Lowpoly 10 / Pointillism 50 — Models Ensemble: Soft-blend Watercolor 0.40/Lowpoly 0.10/Pointillism 0.50

Mix Plan: Lowpoly + Mosaic + Pointillism — Models Ensemble: Lowpoly + Mosaic + Pointillism, Mix Plan={"down": [1, 0, 1, 2], "mid": 2, "up": [1, 1, 0, 1]}

Want to know more about the ensembling techniques?

Two techniques of crossing the block-wise layers were employed: soft-blend and hard-voting using a mix-plan. ⤵ (click for more...)

Soft-blend

Only blocks with cross-attention (e.g., CrossAttnDownBlock2D, mid_block if cross-attn, and CrossAttnUpBlock2D) were soft-blended. For example, in the case of [Soft-blend Oil 0.30/Poster 0.25/Felt 0.45]:

0.45 Felt (dominant): pushes the attention-modulated features toward material-like texture statistics, e.g., fibrous, tactile, craft-like surfaces.
0.30 Oil (secondary): contributes painterly continuity, smooth pigment transitions, thicker “stroke-like” coherence.
0.25 Poster (supporting): tends to reinforce graphic structure, simplified tonal regions, sharper contour intent.

By restricting soft blending to cross-attention blocks, we are mixing models primarily in the components that perform text-guided feature modulation (semantic/style binding), while leaving the rest of the convolutional/residual machinery largely intact (often from a single base model).

Mix-Plan (hard-voting)

All the UNet share the same architecture with 4 down blocks, 1 mid block, and 4 up blocks: Ũ = {Down1, Down2, Down3, Down4, Mid, Up1, Up2, Up3, Up4}. The mix plan is a block-wise parameter grafting rule that builds a single composite UNet, Ũ by selecting (hard mixing) which trained model supplies the parameters of each block. For example, in the case of [Post-Impressionism + Poster + Felt, Mix Plan={"down": [0, 1, 0, 2], "mid": 2, "up": [1, 2, 0, 1]}]:

Firstly, it means three independently trained UNets:

Model index-0: Post-Impressionism style
Model index-1: Poster style
Model index-2: Felt style

Then, the mix plan specifies how to combine these models across different blocks:

Down path (encoder): "down": [0, 1, 0, 2]: -> the index numbers indicate which model to use
- Earlier down blocks (Down1, Down2) operate at higher spatial resolutions, shaping edges, local brush marks, and early compositional cues.
- Later down blocks (Down3, Down4) operate at lower resolutions (more abstract), encoding global structure and coarse style statistics.
- So this plan says: start with Post-Imp low-level flavor, inject Poster stylization in the second down stage, return to Post-Imp mid-encoder features, then push the deepest encoder representation toward Felt’s coarse “material” prior.
Bottleneck: "mid": 2:
- The mid block is a high-receptive-field transform operating on the most compressed latent representation. In practice it strongly influences global texture regime and long-range coherence. Assigning it to Felt biases the global representation toward material-like, fibrous, and craft-texture patterns.
Up path (decoder): "up": [1, 2, 0, 1]:
- Up blocks reconstruct detail while fusing skip connections from the encoder. Later up blocks (closer to output) tend to control fine textures, edge crispness, and rendering style.
- This plan says: use Poster to begin decoding (likely encouraging flatter, graphic structure), then Felt to introduce material texture, then Post-Imp to reintroduce painterly color/softness, and finally Poster again to enforce crisp contours / posterization at the output stage.