3 Comments

I haven't taken a closer look at ReMoE yet, but if ReLU is used as the expert selection principle, it seems that there is a possibility that every expert will have a negative "score" early in the training and "no experts will be selected".

And happy New Year Benjamin! As always, I look forward to your more sharing in the New Year!

Expand full comment

The way to merge the fine-tuning and base model, I've seen some geeks mention it on reddit, and it seems that the point is to prevent forgetting.

Expand full comment

Yes, I have read the same, several times. It was a Reddit recipe: we don't know why it works, but it works. It's nice to have a "scientific" study confirming it.

Happy New Year to you too!

Expand full comment