I haven't taken a closer look at ReMoE yet, but if ReLU is used as the expert selection principle, it seems that there is a possibility that every expert will have a negative "score" early in the training and "no experts will be selected".
And happy New Year Benjamin! As always, I look forward to your more sharing in the New Year!
Yes, I have read the same, several times. It was a Reddit recipe: we don't know why it works, but it works. It's nice to have a "scientific" study confirming it.
I haven't taken a closer look at ReMoE yet, but if ReLU is used as the expert selection principle, it seems that there is a possibility that every expert will have a negative "score" early in the training and "no experts will be selected".
And happy New Year Benjamin! As always, I look forward to your more sharing in the New Year!
The way to merge the fine-tuning and base model, I've seen some geeks mention it on reddit, and it seems that the point is to prevent forgetting.
Yes, I have read the same, several times. It was a Reddit recipe: we don't know why it works, but it works. It's nice to have a "scientific" study confirming it.
Happy New Year to you too!