Skip to content

【Hackathon 8th No.13】 data_efficient_nopt 论文复现 #1111

New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Open
wants to merge 8 commits into
base: develop
Choose a base branch
from

Conversation

xiaoyewww
Copy link

PR types

New Features

PR changes

Others

Describe

support data_efficient_nopt

Copy link

paddle-bot bot commented Mar 23, 2025

Thanks for your contribution!

@xiaoyewww
Copy link
Author

@wangguan1995
Copy link
Contributor

https://github.com/delta-lab-ai/data_efficient_nopt/blob/main/pretrain_basic.py#L262 @wangguan1995 paddle目前没有gaussian_br的实现吗

image
目前没有

@wangguan1995
Copy link
Contributor

如果有可复现的精度结果,可以日志截图到github+上传log,这边可以开始测试

@xiaoyewww
Copy link
Author

https://github.com/delta-lab-ai/data_efficient_nopt/blob/main/pretrain_basic.py#L262 @wangguan1995 paddle目前没有gaussian_br的实现吗

image 目前没有

ok,已经参考实现了paddle的gaussian_blur

@xiaoyewww
Copy link
Author

xiaoyewww commented Apr 2, 2025

如果有可复现的精度结果,可以日志截图到github+上传log,这边可以开始测试

image 目前复现了一下poisson fno 预训练,pd和pt没有固定随机数种子,所以前期loss会有差异,经过几百个step后趋势一致。

复现结果和论文中有点差异,猜测超参哪里有差异,论文上没看到相关描述:
image

@xiaoyewww
Copy link
Author

poisson fno推理结果,采用官方提供权重.

# torch
RMSE: 0.25861763998531323 RMSE (normalized) 0.14146761527157586 R2: 0.9765389656726264 Slope: 0.9752451781576813

# paddle
RMSE: 0.25861764924824066 RMSE (normalized) 0.14146758505387425 R2: 0.9765389632378143 Slope: 0.9752452311012886

@xiaoyewww
Copy link
Author

xiaoyewww commented Apr 6, 2025

如果有可复现的精度结果,可以日志截图到github+上传log,这边可以开始测试

image 目前复现了一下poisson fno 预训练,pd和pt没有固定随机数种子,所以前期loss会有差异,经过几百个step后趋势一致。
复现结果和论文中有点差异,猜测超参哪里有差异,论文上没看到相关描述: image

前10个step对比:
poisson_fno_combined_train_loss_10steps

paddle:

Epoch 1 Batch 0 Train Loss 0.3359823226928711 train_l2 loss 1.0018577575683594 train_rmse loss 0.7787399888038635
Total Times. Global step: 0, Batch: 0, Rank: 0, Data Shape: [128, 4, 64, 64], Data time: 1.274991750717163, Forward: 0.5737528800964355, Backward: 0.18942928314208984, Optimizer: 0.023642539978027344
Epoch 1 Batch 1 Train Loss 0.3453991115093231 train_l2 loss 0.9957258105278015 train_rmse loss 0.7938657999038696
Total Times. Global step: 1, Batch: 1, Rank: 0, Data Shape: [128, 4, 64, 64], Data time: 0.08668327331542969, Forward: 0.019646406173706055, Backward: 0.012578487396240234, Optimizer: 0.02925419807434082
Epoch 1 Batch 2 Train Loss 0.33508121967315674 train_l2 loss 0.9866492748260498 train_rmse loss 0.7812024354934692
Total Times. Global step: 2, Batch: 2, Rank: 0, Data Shape: [128, 4, 64, 64], Data time: 0.08733677864074707, Forward: 0.016697168350219727, Backward: 0.011176347732543945, Optimizer: 0.032007694244384766
Epoch 1 Batch 3 Train Loss 0.3373328149318695 train_l2 loss 0.9720052480697632 train_rmse loss 0.7818474769592285
Total Times. Global step: 3, Batch: 3, Rank: 0, Data Shape: [128, 4, 64, 64], Data time: 0.08364057540893555, Forward: 0.01732492446899414, Backward: 0.011551856994628906, Optimizer: 0.0324702262878418
Epoch 1 Batch 4 Train Loss 0.3260154128074646 train_l2 loss 0.9649569988250732 train_rmse loss 0.7634750604629517
Total Times. Global step: 4, Batch: 4, Rank: 0, Data Shape: [128, 4, 64, 64], Data time: 0.08973145484924316, Forward: 0.017143964767456055, Backward: 0.011417150497436523, Optimizer: 0.0321955680847168
Epoch 1 Batch 5 Train Loss 0.33446627855300903 train_l2 loss 0.9470340609550476 train_rmse loss 0.7787452936172485
Total Times. Global step: 5, Batch: 5, Rank: 0, Data Shape: [128, 4, 64, 64], Data time: 0.08418393135070801, Forward: 0.015747785568237305, Backward: 0.010698318481445312, Optimizer: 0.03490447998046875
Epoch 1 Batch 6 Train Loss 0.31356751918792725 train_l2 loss 0.9271667003631592 train_rmse loss 0.7467784285545349
Total Times. Global step: 6, Batch: 6, Rank: 0, Data Shape: [128, 4, 64, 64], Data time: 0.08341240882873535, Forward: 0.015986919403076172, Backward: 0.010836601257324219, Optimizer: 0.03387713432312012
Epoch 1 Batch 7 Train Loss 0.32571274042129517 train_l2 loss 0.918164074420929 train_rmse loss 0.7629383206367493
Total Times. Global step: 7, Batch: 7, Rank: 0, Data Shape: [128, 4, 64, 64], Data time: 0.08514046669006348, Forward: 0.01694965362548828, Backward: 0.011275529861450195, Optimizer: 0.033127784729003906
Epoch 1 Batch 8 Train Loss 0.3198857605457306 train_l2 loss 0.8946603536605835 train_rmse loss 0.7452360391616821
Total Times. Global step: 8, Batch: 8, Rank: 0, Data Shape: [128, 4, 64, 64], Data time: 0.08858108520507812, Forward: 0.01703476905822754, Backward: 0.011278867721557617, Optimizer: 0.032628536224365234
Epoch 1 Batch 9 Train Loss 0.28028005361557007 train_l2 loss 0.8539849519729614 train_rmse loss 0.6707751154899597
Total Times. Global step: 9, Batch: 9, Rank: 0, Data Shape: [128, 4, 64, 64], Data time: 0.08603501319885254, Forward: 0.017177581787109375, Backward: 0.011240959167480469, Optimizer: 0.032387495040893555
Epoch 1 Batch 10 Train Loss 0.303079217672348 train_l2 loss 0.8385870456695557 train_rmse loss 0.7334427833557129
Total Times. Global step: 10, Batch: 10, Rank: 0, Data Shape: [128, 4, 64, 64], Data time: 0.07994580268859863, Forward: 0.01645064353942871, Backward: 0.011035442352294922, Optimizer: 0.03404521942138672

torch:

Epoch 1 Batch 0 Train Loss 0.3359190821647644
Total Times. Batch: 0, Rank: 0, Data Shape: torch.Size([128, 4, 64, 64]), Data time: 1.5983459949493408, Forward: 2.262549877166748, Backward: 0.5132086277008057, Optimizer: 0.012012720108032227
Epoch 1 Batch 1 Train Loss 0.34538960456848145
Total Times. Batch: 1, Rank: 0, Data Shape: torch.Size([128, 4, 64, 64]), Data time: 0.03945159912109375, Forward: 0.03422045707702637, Backward: 0.045007944107055664, Optimizer: 0.010618925094604492
Epoch 1 Batch 2 Train Loss 0.33507877588272095
Total Times. Batch: 2, Rank: 0, Data Shape: torch.Size([128, 4, 64, 64]), Data time: 0.03759288787841797, Forward: 0.011688470840454102, Backward: 0.06759905815124512, Optimizer: 0.010719060897827148
Epoch 1 Batch 3 Train Loss 0.3374229967594147
Total Times. Batch: 3, Rank: 0, Data Shape: torch.Size([128, 4, 64, 64]), Data time: 0.03585243225097656, Forward: 0.011916399002075195, Backward: 0.06729936599731445, Optimizer: 0.010690450668334961
Epoch 1 Batch 4 Train Loss 0.32614579796791077
Total Times. Batch: 4, Rank: 0, Data Shape: torch.Size([128, 4, 64, 64]), Data time: 0.03865694999694824, Forward: 0.011200428009033203, Backward: 0.06806373596191406, Optimizer: 0.010601520538330078
Epoch 1 Batch 5 Train Loss 0.33475780487060547
Total Times. Batch: 5, Rank: 0, Data Shape: torch.Size([128, 4, 64, 64]), Data time: 0.03668999671936035, Forward: 0.011771440505981445, Backward: 0.06752943992614746, Optimizer: 0.010657072067260742
Epoch 1 Batch 6 Train Loss 0.3140023946762085
Total Times. Batch: 6, Rank: 0, Data Shape: torch.Size([128, 4, 64, 64]), Data time: 0.0358736515045166, Forward: 0.011792898178100586, Backward: 0.0673990249633789, Optimizer: 0.010664939880371094
Epoch 1 Batch 7 Train Loss 0.3263723850250244
Total Times. Batch: 7, Rank: 0, Data Shape: torch.Size([128, 4, 64, 64]), Data time: 0.03751945495605469, Forward: 0.011594295501708984, Backward: 0.06769108772277832, Optimizer: 0.010838031768798828
Epoch 1 Batch 8 Train Loss 0.3208008110523224
Total Times. Batch: 8, Rank: 0, Data Shape: torch.Size([128, 4, 64, 64]), Data time: 0.03836321830749512, Forward: 0.011111259460449219, Backward: 0.06809735298156738, Optimizer: 0.010608196258544922
Epoch 1 Batch 9 Train Loss 0.28148674964904785
Total Times. Batch: 9, Rank: 0, Data Shape: torch.Size([128, 4, 64, 64]), Data time: 0.03702974319458008, Forward: 0.01204824447631836, Backward: 0.06731843948364258, Optimizer: 0.010616540908813477
Epoch 1 Batch 10 Train Loss 0.30438655614852905
Total Times. Batch: 10, Rank: 0, Data Shape: torch.Size([128, 4, 64, 64]), Data time: 0.034720659255981445, Forward: 0.011367082595825195, Backward: 0.06775617599487305, Optimizer: 0.010625123977661133

@xiaoyewww
Copy link
Author

helmholtz_64 fno和possion_64 fno一致,采用相同模型结构。

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Projects
None yet
Development

Successfully merging this pull request may close these issues.

3 participants