-
Notifications
You must be signed in to change notification settings - Fork 1.5k
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
formula 22 in DeepSeek V3 technical report #238
Comments
Perhaps I have to disagree with your claim and let me express my opinion. The loss = LLM(input_ids=input_ids, labels=labels).loss while in its internal loss function, they will do a shift to ensure they use tokens before a specific token to predict it, but for simplify the thinking procedure, we can see if you wanna do next-token prediction, your If you agree with this, let's proceed to the MTP scenario, in a batch of tokens, if you want to use the exactly the same loss = TRM(input_ids=input_ids[:-1], labels[1:]) In this implementation, the first token of And for Next$^k$ token prediction you need to shift more, thus making the formula like |
I understand your claim. The main point of my explanation is that your
In this figure, the tokens on the left and right sides are omitted, what really happened was what I described. The formula 22 is actually describing the condition that you need to shift the hidden states as |
thanks for the great model.
I have one question about formula 22 below, could you help, thanks.
suppose k=1, that's
MTP module 1
in the red circle of below figure. And T is 4 in the example. So, T-k=4-1=3, and so h(1:T-k) is h(1:3).My question is why it is 1:3, not 1:4? From the figure below, finally there are 4 outputs of
MTP module 1
. Is it a typo of h(1:T)?The text was updated successfully, but these errors were encountered: