AI Art Generation Handbook/Training

What is training ?

Training is a method to introduce a concept to the model so that the model are able to "learn" new concepts introduced . Currently there are established methods used for training:

(a) Dreambooth

Dreambooth is a method of training a model on a specific topic while keeping its original features intact. Class Images help to retain the information from the base model. The output is a new model. It is best for all sort of use-case concepts (Subjects and Styles) but the downside is the size.

(b) Fine-tuning

Fine-tuning is similar to Dreambooth but it modifies the model’s knowledge of the trained topic. No Class Images are needed because the goal is to overwrite the existing concepts. The output is also a new model.

It is best for all sort of use-case for specific niche concept generations (for example: generating Egyptian / Norse mythological creatures) that are currently unobtainable in current Stable Diffusions versions . The downside is , it may have catastrophic "memory loss" of the previous trained weights as it is overwrite existing concepts.

(c) LoRA ( Low Rank Adaption )

Lora training is a simpler version of Dreambooth that uses less resources and produces smaller files. It uses a technique named "low-rank approximation" of the weight matrix. This method is used to reduce the number of parameters in the model and improve its generalization ability. However, it may compromise some quality for these advantages. The output is a Lora model (a mini version of a model) that can work with any base model.

It is good for styles (especially anime styles) but for not good for subject of realistic faces.

LyCORIS - Lora beYond Conventional

LyCORIS is a new training method that tries to find better ways of fine-tuning Stable Diffusion models with less parameters. It uses Lora models, which are smaller and simpler versions of Stable Diffusion models that can run faster and use less memory. LyCORIS experiments with different methods of changing Lora models to fit different tasks or domains, such as using a mathematical operation called Hadamard product to combine Lora models with other models.

(d) Textual Inversion ( TI )

Textual Inversion training allows you to train a person/object/style as a separate token that can be applied to any model (depending on the web-ui). The output is an Embedding that contains the trained token.

(e) Hypernetwork

A hypernetwork is a way of fine-tuning a model’s style by attaching a small network to it. The small network modifies the cross-attention module of the model while the main model is frozen, so it is fast and efficient. This will help it train a model without altering its weight

At this moment, many of the extensions are not working well in Automatic1111 (due to dependencies conflicts/ Gradio bugs / pyTorch bugs) therefore preferably used alternative app such as Stable Tuner / Kohya-ss, Everydreamer 2 for training.

Rules of thumb:

If you want to train for a single concept (such as trained on a type of face , type of cars, etc...) , it is recommended to train on lora. If else, train for Dreambooth/finetuning

What is a concept meant?

A concept is an associated word that you want the model to learn, as long as you have enough dataset to show and train the model with.

Concepts can be divided into two types: Subjects and Styles.

(a) Subjects are the main objects or entities that you want the model to recognize or generate, such as faces, animals (i.e: Certain dog breeds) , cars (i.e: Vintage racing cars) , etc.

(b) Styles are the visual characteristics or features that you want the model to apply or modify, such as colors, textures, shapes, etc. (i.e: Certain arts tyle such as Hokusai art, paper cut)

There is no fixed limit, but thumbs of rule are : The more concepts you train, the more high GPU VRAM usage and time-consuming the training will be.

Since models are observed not increased in size during training, we can infer that the longer you train, the more likely you will lose some quality on what your model already knew before the training.

What is weight meant ?

In neural network concepts, "weight" are the values that are multiplied by the inputs of each neuron in a neural network. They represent how important each input is for the output of the neuron. Weights are initialized randomly at first and then updated during training to minimize the error between the predicted output and the actual output The weights in Stable Diffusion are the values that represents what an AI's model has retained. Those numbers are what drives Stable Diffusion into making its choices while making pictures and following a prompt. that determine how the model transforms the input images or noise into the output images. The weights are learned during training by minimizing the error between the generated images and the target images.

As models don't grow in size during training, the longer you train, the more quality will be harmed on what your model knew before the training. Training on other things as well (regularization/class data) can help reduce this effect to some extent.

What is pre-requisite of model training?

Here are the most important pre-requisite for model training (regardless of what type of training you choose)

(a) Computer systems with powerful GPU with at least 10GB VRAM and above . Preferably NVidia as many training method support CUDA out of the box

Stable Diffusion training is computationally intensive task that requires a lot of memory to store and process large amounts of data. VRAM stands for video random access memory, which is the memory that a GPU uses to store and access data. The more VRAM a GPU has, the more data it can store and process at once, which can speed up the training process and allow for larger models and higher resolutions (i.e: SD1.5 process up to 512*512px - 4GB VRAM meanwhile SD2.1 process up to 768*768px).

(b) High quality images prepped as dataset

To be regarded as high quality images

(i) Refer to any stock images website to get a grasp on how the images should looked like (see like compositions, lighting, contrast,etc...)

(ii) Preferably high resolution images (Recommended 512*512px , or best is more than 768*768px). If the images are needed to be compressed to 512px*512px, there won't be much details lost.

(iii) If trained as subject, preferably the subject should be the main focus in the images . There should be not many other subjects that may interfere with training in the pictures.

(iv) Diversity of the images. The images consists of subject should be as diverse as possible (such as taken in different lighting conditions, different places, different poses, different expression, ...)