You signed in with another tab or window. Reload to refresh your session.You signed out in another tab or window. Reload to refresh your session.You switched accounts on another tab or window. Reload to refresh your session.Dismiss alert
I hope to finetune the xm_transformer_unity pre-trained model so that its mbart decoder can recognize some new words.
I followed https://github.com/facebookresearch/fairseq/blob/ust/examples/speech_to_speech/docs/enhanced_direct_s2st_discrete_units.md and https://github.com/facebookresearch/fairseq/ust/examples/speech_to_speech/docs/enhanced_direct_s2st_discrete_units.md.
But is it normal for me to have a high loss value during training? It's loss reached 50 and multitask loss reached 1000. Here are some console outputs:
024-08-07 15:20:51 | INFO | dev | epoch 163 | valid on 'dev' subset | loss 50.401 | nll_loss 18.932 | multitask_target_letter_loss 918.037 | ppl 500238 | wps 0 | wpb 470 | bsz 2 | multitask_target_letter_loss_weight 8 | num_updates 652
2024-08-07 15:20:51 | INFO | fairseq_cli.train | end of epoch 163 (average epoch stats below)
2024-08-07 15:20:51 | INFO | train | epoch 163 | loss 65.175 | nll_loss 19.787 | total None | n_correct None | multitask_target_letter_loss 1303.92 | ppl 904912 | wps 791.8 | ups 0.55 | wpb 1440 | bsz 6.2 | num_updates 652 | multitask_target_letter_loss_weight 8 | lr 7.066e-08 | gnorm 1667.83 | clip 100 | loss_scale None | train_wall 7 | gb_free None | cuda_gb_allocated 16.9 | cuda_gb_reserved 22.1 | cuda_gb_free 22.5 | wall 0
Additionally, I also encountered some problems like AssertionError: Optimizer does not match; please reset the optimizer (--reset-optimizer). FP16Optimizer vs FairseqAdam and exp_avg.mul_(beta1).add_(grad, alpha=1 - beta1) RuntimeError: The size of tensor a (755724736) must match the size of tensor b (931956160) at non-singleton dimension 0 and gradient overflow. I guess maybe there are some problems in my custom datasets? But when I dont use the --fp16` in the running command it works.
In fact, I'm not if my steps are correct. So I hope to seek some help. Thank you!
Firstly, I prepare manifest file by python examples/wav2vec/wav2vec_manifest.py /home/s2ut/TGT_AUDIO/train --dest /home/s2ut/TGT_AUDIO/train --ext wav --valid-percent 0
Secondly, I run the command python examples/textless_nlp/gslm/speech2unit/clustering/quantize_with_kmeans.py --feature_type hubert \ --kmeans_model_path /home/s2ut/mhubert_base_vp_en_es_fr_it3_L11_km1000.bin --acoustic_model_path /home/s2ut/mhubert_base_vp_en_es_fr_it3.pt \ --layer 11 --manifest_path /home/s2ut/TGT_AUDIO/train/train.tsv \ --out_quantized_file_path /home/s2ut/TGT_AUDIO/train.txt --extension ".wav" to extract units by mhubert_base_vp_en_es_fr_it3_L11_km1000 released in https://github.com/facebookresearch/fairseq/blob/ust/examples/speech_to_speech/docs/textless_s2st_real_data.md.
Then, I formate data by python examples/speech_to_speech/preprocessing/prep_s2ut_data.py \ --source-dir /home/s2ut/SRC_AUDIO --target-dir /home/s2ut/TGT_AUDIO \ --data-split train dev --output-root /home/s2ut/FormattingData/DATA_ROOT \ --reduce-unit --vocoder-checkpoint /home/s2ut/g_00500000 \ --vocoder-cfg /home/s2ut/vocoder_code_hifigan_mhubert_vp_en_es_fr_it3_400k_layer11_km1000_config.json to get a config.yaml
My task data format is like the following:
id audio n_frames tgt_text tgt_n_frames
26 /home/s2ut/SRC_AUDIO/train/26.wav 547 864 497 248
16 /home/s2ut/SRC_AUDIO/train/16.wav 445 39 6 54 192 232
I used bpe to generate subwords and find the subwords'id in en_zh_spm.dict, and I write these tokens in tgt_text of multitask
My multitask data format is like the following:
❓ Questions and Help
What is your question?
I hope to finetune the xm_transformer_unity pre-trained model so that its mbart decoder can recognize some new words.
I followed
https://github.com/facebookresearch/fairseq/blob/ust/examples/speech_to_speech/docs/enhanced_direct_s2st_discrete_units.md and https://github.com/facebookresearch/fairseq/ust/examples/speech_to_speech/docs/enhanced_direct_s2st_discrete_units.md.
But is it normal for me to have a high loss value during training? It's loss reached 50 and multitask loss reached 1000. Here are some console outputs:
024-08-07 15:20:51 | INFO | dev | epoch 163 | valid on 'dev' subset | loss 50.401 | nll_loss 18.932 | multitask_target_letter_loss 918.037 | ppl 500238 | wps 0 | wpb 470 | bsz 2 | multitask_target_letter_loss_weight 8 | num_updates 652
2024-08-07 15:20:51 | INFO | fairseq_cli.train | end of epoch 163 (average epoch stats below)
2024-08-07 15:20:51 | INFO | train | epoch 163 | loss 65.175 | nll_loss 19.787 | total None | n_correct None | multitask_target_letter_loss 1303.92 | ppl 904912 | wps 791.8 | ups 0.55 | wpb 1440 | bsz 6.2 | num_updates 652 | multitask_target_letter_loss_weight 8 | lr 7.066e-08 | gnorm 1667.83 | clip 100 | loss_scale None | train_wall 7 | gb_free None | cuda_gb_allocated 16.9 | cuda_gb_reserved 22.1 | cuda_gb_free 22.5 | wall 0
Additionally, I also encountered some problems like
AssertionError: Optimizer does not match; please reset the optimizer (--reset-optimizer). FP16Optimizer vs FairseqAdam
andexp_avg.mul_(beta1).add_(grad, alpha=1 - beta1) RuntimeError: The size of tensor a (755724736) must match the size of tensor b (931956160) at non-singleton dimension 0
andgradient overflow
. I guess maybe there are some problems in my custom datasets? But when I dont use the
--fp16` in the running command it works.In fact, I'm not if my steps are correct. So I hope to seek some help. Thank you!
Code
This is my command:
fairseq-train /home/s2ut/FormattingData/DATA_ROOT \ --config-yaml/home/s2ut/FormattingData/DATA_ROOT/config.yaml \ --multitask-config-yaml /home/s2ut/FormattingData/DATA_ROOT/multitask_config.yaml \ --task speech_to_text --arch xm_transformer_t2 \ --criterion speech_to_unit_translatotron2 --label-smoothing 0.1 \ --share-decoder-input-output-embed --adaptor-n-layers 1 --normalize \ --dropout 0.1 --attention-dropout 0.1 --relu-dropout 0.1 \ --train-subset train --valid-subset dev \ --load-pretrained-decoder-from /root/autodl-tmp/code/trained_model/checkpoint_last.pt --w2v-path /root/autodl-tmp/code/trained_model/checkpoint_last.pt \ --mask-prob 0.3 --mask-channel-length 32 --mask-channel-prob 0.25 \ --save-dir /root/autodl-tmp/code/trained_model --checkpoint-activations --encoder-proj \ --lr 0.00000001 --dropout 0.1 --attention-dropout 0.1 --lr-scheduler inverse_sqrt \ --warmup-init-lr 1e-7 --warmup-updates 2000 \ --optimizer adam --adam-betas "(0.9,0.98)" --clip-norm 10.0 \ --max-update 80000 --max-tokens 5000 --max-tokens-valid 5000 --max-source-positions 5000 \ --max-target-positions 5000 --update-freq 1 \ --seed 1234 --num-workers 1 \ --reset-dataloader --reset-optimizer --batch-size 16 --max-epoch 1000 --save-interval 1000
What have you tried?
Firstly, I prepare manifest file by
python examples/wav2vec/wav2vec_manifest.py /home/s2ut/TGT_AUDIO/train --dest /home/s2ut/TGT_AUDIO/train --ext wav --valid-percent 0
Secondly, I run the command
python examples/textless_nlp/gslm/speech2unit/clustering/quantize_with_kmeans.py --feature_type hubert \ --kmeans_model_path /home/s2ut/mhubert_base_vp_en_es_fr_it3_L11_km1000.bin --acoustic_model_path /home/s2ut/mhubert_base_vp_en_es_fr_it3.pt \ --layer 11 --manifest_path /home/s2ut/TGT_AUDIO/train/train.tsv \ --out_quantized_file_path /home/s2ut/TGT_AUDIO/train.txt --extension ".wav"
to extract units by mhubert_base_vp_en_es_fr_it3_L11_km1000 released in https://github.com/facebookresearch/fairseq/blob/ust/examples/speech_to_speech/docs/textless_s2st_real_data.md.Then, I formate data by
python examples/speech_to_speech/preprocessing/prep_s2ut_data.py \ --source-dir /home/s2ut/SRC_AUDIO --target-dir /home/s2ut/TGT_AUDIO \ --data-split train dev --output-root /home/s2ut/FormattingData/DATA_ROOT \ --reduce-unit --vocoder-checkpoint /home/s2ut/g_00500000 \ --vocoder-cfg /home/s2ut/vocoder_code_hifigan_mhubert_vp_en_es_fr_it3_400k_layer11_km1000_config.json
to get aconfig.yaml
My task data format is like the following:
id audio n_frames tgt_text tgt_n_frames
26 /home/s2ut/SRC_AUDIO/train/26.wav 547 864 497 248
16 /home/s2ut/SRC_AUDIO/train/16.wav 445 39 6 54 192 232
I used bpe to generate subwords and find the subwords'id in en_zh_spm.dict, and I write these tokens in tgt_text of multitask
My multitask data format is like the following:
id tgt_text
26 3476765 2692239 80799 68322236
16 36544 38935 372148
To recongize new words, so I replace some original words in the dict. May I ask if this tgt_text should be texts or tokens?
My task file:
input_channels: 1
input_feat_per_channel: 80
specaugment:
freq_mask_F: 27
freq_mask_N: 1
time_mask_N: 1
time_mask_T: 100
time_mask_p: 1.0
time_wrap_W: 0
transforms:
'*':
_train:
vocoder:
checkpoint: /home/s2ut/g_00500000
config: /home/s2ut/vocoder_code_hifigan_hubert_base_100_lj_config.json
type: code_hifigan
decoder_type: transformer
decoder_layer: 2
encoder_layer: 1
loss_weight: 8.0
prepend_bos_and_append_tgt_lang_tag: true
eos_token: lang:en
rdrop_alpha: 10.0
tgt_lang: lang:en
dict: /home/s2ut/FormattingData/DATA_ROOT/dict.txt
standardize_audio: true
use_audio_input: false
apply_ucmvn: true
My multitask file:
target_letter:
target_type: text
decoder_type: transformer
encoder_layer: 1
loss_weight: 8.0
prepend_bos_and_append_tgt_lang_tag: true
eos_token: "[en_XX]"
rdrop_alpha: 10.0
data: /home/s2ut/FormattingData/DATA_ROOT/target_letter
tgt_lang: lang:en
src_lang: lang:hok
dict: /home/s2ut/FormattingData/DATA_ROOT/en_zh_spm.dict
standardize_audio: true
use_audio_input: true
apply_ucmvn: true
What's your environment?
I use python 3.8, fairseq ust 0.12.0 on Linux
The text was updated successfully, but these errors were encountered: