You signed in with another tab or window. Reload to refresh your session.You signed out in another tab or window. Reload to refresh your session.You switched accounts on another tab or window. Reload to refresh your session.Dismiss alert
I'm running train_controlnet_flux.py in A100 with accelerate launch train_controlnet_flux.py
--pretrained_model_name_or_path="black-forest-labs/FLUX.1-dev"
--conditioning_image_column=conditioning_image
--image_column=image
--caption_column=text
--output_dir="output"
--mixed_precision="bf16"
--resolution=512
--learning_rate=1e-5
--max_train_steps=150000000
--validation_steps=100
--checkpointing_steps=200
--validation_image "./1.jpg" "./2.jpg" "./3.jpg" "./4.jpg"
--validation_prompt "outdoor, building, car, city street, person, pavement, sidewalk, pole, road, traffic light, street corner, street scene, street sign, urban, tree" "outdoor, building, car, city street, person, pavement, sidewalk, pole, road, traffic light, street corner, street scene, street sign, urban, tree" "outdoor, building, car, city street, person, pavement, sidewalk, pole, road, traffic light, street corner, street scene, street sign, urban, tree" "outdoor, building, car, city street, person, pavement, sidewalk, pole, road, traffic light, street corner, street scene, street sign, urban, tree"
--train_batch_size=1
--gradient_accumulation_steps=1
--lr_scheduler="cosine"
--num_double_layers=4
--num_single_layers=0
--seed=42
--jsonl_for_train=./train.jsonl
even after 10000/3661161 steps passed, there are no outputs in the output folder. I thought there should be anything after every checkpointing_steps(200) steps or every validation_steps(100) steps like generated image for validation.
I didn't set report_to, so it automatically depends on tensorboard.
I also inserted some print() to debug under if accelerator.distributed_type == DistributedType.DEEPSPEED or accelerator.is_main_process: but nothing printed ! how??
please help me.
thank you!
Reproduction
accelerate launch train_controlnet_flux.py
--pretrained_model_name_or_path="black-forest-labs/FLUX.1-dev"
--conditioning_image_column=conditioning_image
--image_column=image
--caption_column=text
--output_dir="output"
--mixed_precision="bf16"
--resolution=512
--learning_rate=1e-5
--max_train_steps=150000000
--validation_steps=100
--checkpointing_steps=200
--validation_image "./1.jpg" "./2.jpg" "./3.jpg" "./4.jpg"
--validation_prompt "outdoor, building, car, city street, person, pavement, sidewalk, pole, road, traffic light, street corner, street scene, street sign, urban, tree" "outdoor, building, car, city street, person, pavement, sidewalk, pole, road, traffic light, street corner, street scene, street sign, urban, tree" "outdoor, building, car, city street, person, pavement, sidewalk, pole, road, traffic light, street corner, street scene, street sign, urban, tree" "outdoor, building, car, city street, person, pavement, sidewalk, pole, road, traffic light, street corner, street scene, street sign, urban, tree"
--train_batch_size=1
--gradient_accumulation_steps=1
--lr_scheduler="cosine"
--num_double_layers=4
--num_single_layers=0
--seed=42
--jsonl_for_train=./train.jsonl
Logs
Detected kernel version 4.19.93, which is below the recommended minimum of 5.5.0; this can cause the process to hang. It is recommended to upgrade the kernel to the minimum version or higher.
04/28/2025 07:04:13 - INFO - __main__ - Distributed environment: NO
Num processes: 1
Process index: 0
Local process index: 0
Device: cuda
Mixed precision type: bf16
You set`add_prefix_space`. The tokenizer needs to be converted from the slow tokenizers
04/28/2025 07:04:14 - INFO - accelerate.utils.modeling - We will use 90% of the memory on device 0 forstoring the model, and 10% for the buffer to avoid OOM. You can set `max_memory`in to a higher value to use more memory (at your own risk).
Downloading shards: 100%|███████████████████████████████████████████████████████████████████████████| 2/2 [00:00<00:00, 11634.69it/s]
04/28/2025 07:04:15 - INFO - accelerate.utils.modeling - We will use 90% of the memory on device 0 forstoring the model, and 10% for the buffer to avoid OOM. You can set `max_memory`in to a higher value to use more memory (at your own risk).
Loading checkpoint shards: 100%|███████████████████████████████████████████████████████████████████████| 2/2 [00:13<00:00, 6.99s/it]
04/28/2025 07:04:30 - INFO - accelerate.utils.modeling - We will use 90% of the memory on device 0 forstoring the model, and 10% for the buffer to avoid OOM. You can set `max_memory`in to a higher value to use more memory (at your own risk).
All model checkpoint weights were used when initializing AutoencoderKL.
All the weights of AutoencoderKL were initialized from the model checkpoint at black-forest-labs/FLUX.1-dev.
If your task is similar to the task the model of the checkpoint was trained on, you can already use AutoencoderKL for predictions without further training.
Fetching 3 files: 100%|█████████████████████████████████████████████████████████████████████████████| 3/3 [00:00<00:00, 51569.31it/s]
{'out_channels', 'axes_dims_rope'} was not found in config. Values will be initialized to default values.
04/28/2025 07:04:31 - INFO - accelerate.utils.modeling - We will use 90% of the memory on device 0 forstoring the model, and 10% for the buffer to avoid OOM. You can set `max_memory`in to a higher value to use more memory (at your own risk).
Loading checkpoint shards: 100%|███████████████████████████████████████████████████████████████████████| 3/3 [00:31<00:00, 10.51s/it]
All model checkpoint weights were used when initializing FluxTransformer2DModel.
All the weights of FluxTransformer2DModel were initialized from the model checkpoint at black-forest-labs/FLUX.1-dev.
If your task is similar to the task the model of the checkpoint was trained on, you can already use FluxTransformer2DModel for predictions without further training.
04/28/2025 07:05:03 - INFO - __main__ - Initializing controlnet weights from transformer
{'num_mode', 'conditioning_embedding_channels', 'axes_dims_rope'} was not found in config. Values will be initialized to default values.
04/28/2025 07:05:13 - INFO - __main__ - all models loaded successfully
{'use_exponential_sigmas', 'use_beta_sigmas', 'invert_sigmas', 'time_shift_type', 'stochastic_sampling', 'shift_terminal', 'use_karras_sigmas'} was not found in config. Values will be initialized to default values.
[2025-04-28 07:05:14,299] [INFO] [real_accelerator.py:239:get_accelerator] Setting ds_accelerator to cuda (auto detect)
Map: 3%|█▉ | 9300/366414 [48:14<30:52:00, 3.21 examples/s]
System Info
Copy-and-paste the text below in your GitHub issue and FILL OUT the two last points.
Describe the bug
I'm running train_controlnet_flux.py in A100 with accelerate launch train_controlnet_flux.py
--pretrained_model_name_or_path="black-forest-labs/FLUX.1-dev"
--conditioning_image_column=conditioning_image
--image_column=image
--caption_column=text
--output_dir="output"
--mixed_precision="bf16"
--resolution=512
--learning_rate=1e-5
--max_train_steps=150000000
--validation_steps=100
--checkpointing_steps=200
--validation_image "./1.jpg" "./2.jpg" "./3.jpg" "./4.jpg"
--validation_prompt "outdoor, building, car, city street, person, pavement, sidewalk, pole, road, traffic light, street corner, street scene, street sign, urban, tree" "outdoor, building, car, city street, person, pavement, sidewalk, pole, road, traffic light, street corner, street scene, street sign, urban, tree" "outdoor, building, car, city street, person, pavement, sidewalk, pole, road, traffic light, street corner, street scene, street sign, urban, tree" "outdoor, building, car, city street, person, pavement, sidewalk, pole, road, traffic light, street corner, street scene, street sign, urban, tree"
--train_batch_size=1
--gradient_accumulation_steps=1
--lr_scheduler="cosine"
--num_double_layers=4
--num_single_layers=0
--seed=42
--jsonl_for_train=./train.jsonl
even after 10000/3661161 steps passed, there are no outputs in the output folder. I thought there should be anything after every checkpointing_steps(200) steps or every validation_steps(100) steps like generated image for validation.
I didn't set report_to, so it automatically depends on tensorboard.
I also inserted some print() to debug under if accelerator.distributed_type == DistributedType.DEEPSPEED or accelerator.is_main_process: but nothing printed ! how??
please help me.
thank you!
Reproduction
accelerate launch train_controlnet_flux.py
--pretrained_model_name_or_path="black-forest-labs/FLUX.1-dev"
--conditioning_image_column=conditioning_image
--image_column=image
--caption_column=text
--output_dir="output"
--mixed_precision="bf16"
--resolution=512
--learning_rate=1e-5
--max_train_steps=150000000
--validation_steps=100
--checkpointing_steps=200
--validation_image "./1.jpg" "./2.jpg" "./3.jpg" "./4.jpg"
--validation_prompt "outdoor, building, car, city street, person, pavement, sidewalk, pole, road, traffic light, street corner, street scene, street sign, urban, tree" "outdoor, building, car, city street, person, pavement, sidewalk, pole, road, traffic light, street corner, street scene, street sign, urban, tree" "outdoor, building, car, city street, person, pavement, sidewalk, pole, road, traffic light, street corner, street scene, street sign, urban, tree" "outdoor, building, car, city street, person, pavement, sidewalk, pole, road, traffic light, street corner, street scene, street sign, urban, tree"
--train_batch_size=1
--gradient_accumulation_steps=1
--lr_scheduler="cosine"
--num_double_layers=4
--num_single_layers=0
--seed=42
--jsonl_for_train=./train.jsonl
Logs
System Info
Copy-and-paste the text below in your GitHub issue and FILL OUT the two last points.
Who can help?
it's you! or @sayakpaul @PromeAIpro
The text was updated successfully, but these errors were encountered: