Skip to content

no outputs(weights)/validation result images are created during training #11437

New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Closed
dedoogong opened this issue Apr 28, 2025 · 1 comment
Closed
Labels
bug Something isn't working

Comments

@dedoogong
Copy link

dedoogong commented Apr 28, 2025

Describe the bug

I'm running train_controlnet_flux.py in A100 with accelerate launch train_controlnet_flux.py
--pretrained_model_name_or_path="black-forest-labs/FLUX.1-dev"
--conditioning_image_column=conditioning_image
--image_column=image
--caption_column=text
--output_dir="output"
--mixed_precision="bf16"
--resolution=512
--learning_rate=1e-5
--max_train_steps=150000000
--validation_steps=100
--checkpointing_steps=200
--validation_image "./1.jpg" "./2.jpg" "./3.jpg" "./4.jpg"
--validation_prompt "outdoor, building, car, city street, person, pavement, sidewalk, pole, road, traffic light, street corner, street scene, street sign, urban, tree" "outdoor, building, car, city street, person, pavement, sidewalk, pole, road, traffic light, street corner, street scene, street sign, urban, tree" "outdoor, building, car, city street, person, pavement, sidewalk, pole, road, traffic light, street corner, street scene, street sign, urban, tree" "outdoor, building, car, city street, person, pavement, sidewalk, pole, road, traffic light, street corner, street scene, street sign, urban, tree"
--train_batch_size=1
--gradient_accumulation_steps=1
--lr_scheduler="cosine"
--num_double_layers=4
--num_single_layers=0
--seed=42
--jsonl_for_train=./train.jsonl
even after 10000/3661161 steps passed, there are no outputs in the output folder. I thought there should be anything after every checkpointing_steps(200) steps or every validation_steps(100) steps like generated image for validation.
I didn't set report_to, so it automatically depends on tensorboard.

I also inserted some print() to debug under if accelerator.distributed_type == DistributedType.DEEPSPEED or accelerator.is_main_process: but nothing printed ! how??
please help me.
thank you!

Reproduction

accelerate launch train_controlnet_flux.py
--pretrained_model_name_or_path="black-forest-labs/FLUX.1-dev"
--conditioning_image_column=conditioning_image
--image_column=image
--caption_column=text
--output_dir="output"
--mixed_precision="bf16"
--resolution=512
--learning_rate=1e-5
--max_train_steps=150000000
--validation_steps=100
--checkpointing_steps=200
--validation_image "./1.jpg" "./2.jpg" "./3.jpg" "./4.jpg"
--validation_prompt "outdoor, building, car, city street, person, pavement, sidewalk, pole, road, traffic light, street corner, street scene, street sign, urban, tree" "outdoor, building, car, city street, person, pavement, sidewalk, pole, road, traffic light, street corner, street scene, street sign, urban, tree" "outdoor, building, car, city street, person, pavement, sidewalk, pole, road, traffic light, street corner, street scene, street sign, urban, tree" "outdoor, building, car, city street, person, pavement, sidewalk, pole, road, traffic light, street corner, street scene, street sign, urban, tree"
--train_batch_size=1
--gradient_accumulation_steps=1
--lr_scheduler="cosine"
--num_double_layers=4
--num_single_layers=0
--seed=42
--jsonl_for_train=./train.jsonl

Logs

Detected kernel version 4.19.93, which is below the recommended minimum of 5.5.0; this can cause the process to hang. It is recommended to upgrade the kernel to the minimum version or higher.
04/28/2025 07:04:13 - INFO - __main__ - Distributed environment: NO
Num processes: 1
Process index: 0
Local process index: 0
Device: cuda

Mixed precision type: bf16

You set `add_prefix_space`. The tokenizer needs to be converted from the slow tokenizers
04/28/2025 07:04:14 - INFO - accelerate.utils.modeling - We will use 90% of the memory on device 0 for storing the model, and 10% for the buffer to avoid OOM. You can set `max_memory` in to a higher value to use more memory (at your own risk).
Downloading shards: 100%|███████████████████████████████████████████████████████████████████████████| 2/2 [00:00<00:00, 11634.69it/s]
04/28/2025 07:04:15 - INFO - accelerate.utils.modeling - We will use 90% of the memory on device 0 for storing the model, and 10% for the buffer to avoid OOM. You can set `max_memory` in to a higher value to use more memory (at your own risk).
Loading checkpoint shards: 100%|███████████████████████████████████████████████████████████████████████| 2/2 [00:13<00:00,  6.99s/it]
04/28/2025 07:04:30 - INFO - accelerate.utils.modeling - We will use 90% of the memory on device 0 for storing the model, and 10% for the buffer to avoid OOM. You can set `max_memory` in to a higher value to use more memory (at your own risk).
All model checkpoint weights were used when initializing AutoencoderKL.

All the weights of AutoencoderKL were initialized from the model checkpoint at black-forest-labs/FLUX.1-dev.
If your task is similar to the task the model of the checkpoint was trained on, you can already use AutoencoderKL for predictions without further training.
Fetching 3 files: 100%|█████████████████████████████████████████████████████████████████████████████| 3/3 [00:00<00:00, 51569.31it/s]
{'out_channels', 'axes_dims_rope'} was not found in config. Values will be initialized to default values.
04/28/2025 07:04:31 - INFO - accelerate.utils.modeling - We will use 90% of the memory on device 0 for storing the model, and 10% for the buffer to avoid OOM. You can set `max_memory` in to a higher value to use more memory (at your own risk).
Loading checkpoint shards: 100%|███████████████████████████████████████████████████████████████████████| 3/3 [00:31<00:00, 10.51s/it]
All model checkpoint weights were used when initializing FluxTransformer2DModel.

All the weights of FluxTransformer2DModel were initialized from the model checkpoint at black-forest-labs/FLUX.1-dev.
If your task is similar to the task the model of the checkpoint was trained on, you can already use FluxTransformer2DModel for predictions without further training.
04/28/2025 07:05:03 - INFO - __main__ - Initializing controlnet weights from transformer
{'num_mode', 'conditioning_embedding_channels', 'axes_dims_rope'} was not found in config. Values will be initialized to default values.
04/28/2025 07:05:13 - INFO - __main__ - all models loaded successfully
{'use_exponential_sigmas', 'use_beta_sigmas', 'invert_sigmas', 'time_shift_type', 'stochastic_sampling', 'shift_terminal', 'use_karras_sigmas'} was not found in config. Values will be initialized to default values.
[2025-04-28 07:05:14,299] [INFO] [real_accelerator.py:239:get_accelerator] Setting ds_accelerator to cuda (auto detect)
Map:   3%|█▉                                                                         | 9300/366414 [48:14<30:52:00,  3.21 examples/s]

System Info

Copy-and-paste the text below in your GitHub issue and FILL OUT the two last points.

  • 🤗 Diffusers version: 0.34.0.dev0
  • Platform: Linux-4.19.93-1.nbp.el7.x86_64-x86_64-with-glibc2.35
  • Running on Google Colab?: No
  • Python version: 3.10.12
  • PyTorch version (GPU?): 2.6.0+cu124 (True)
  • Flax version (CPU?/GPU?/TPU?): not installed (NA)
  • Jax version: not installed
  • JaxLib version: not installed
  • Huggingface_hub version: 0.29.2
  • Transformers version: 4.49.0
  • Accelerate version: 1.6.0
  • PEFT version: 0.14.0
  • Bitsandbytes version: not installed
  • Safetensors version: 0.5.3
  • xFormers version: 0.0.29.post2
  • Accelerator: NVIDIA A100-SXM-80GB, 81251 MiB
  • Using GPU in script?:
  • Using distributed or parallel set-up in script?: no. I'm using 1 gpu for test

Who can help?

it's you! or @sayakpaul @PromeAIpro

@dedoogong dedoogong added the bug Something isn't working label Apr 28, 2025
@PromeAIpro
Copy link
Contributor

you are at data preprocess step, training was not started, make a smaller dataset and test again.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
bug Something isn't working
Projects
None yet
Development

No branches or pull requests

2 participants