Revision history [back]

It is generally better to split the transformer's queries, keys, and values before going through the linear layers. This is because the linear layers operate on the entire input sequence, whereas the attention mechanism only operates on subsets of the input (i.e. the queries, keys, and values). Splitting the inputs before the linear layers allows for more efficient computation and reduces the amount of irrelevant information that the attention mechanism has to process. Additionally, splitting the inputs before the linear layers means that the attention mechanism can be applied to different portions of the input independently, which can be beneficial for tasks like translation where different parts of the input may require different levels of attention.