Is it better to split the transformer's queries, keys, and values before or after going through the linear layers?

answered 2023-05-16 02:25:02 +0000

pufferfish
41 ●3 ●2

It is generally better to split the transformer's queries, keys, and values before going through the linear layers. This is because the linear layers operate on the entire input sequence, whereas the attention mechanism only operates on subsets of the input (i.e. the queries, keys, and values). Splitting the inputs before the linear layers allows for more efficient computation and reduces the amount of irrelevant information that the attention mechanism has to process. Additionally, splitting the inputs before the linear layers means that the attention mechanism can be applied to different portions of the input independently, which can be beneficial for tasks like translation where different parts of the input may require different levels of attention.

edit flag offensive delete link

add a comment

Your Answer

Please start posting anonymously - your entry will be published after you log in or create a new account. This space is reserved only for answers. If you would like to engage in a discussion, please instead post a comment under the question or an answer that you would like to discuss

Add Answer

Is it better to split the transformer's queries, keys, and values before or after going through the linear layers?

1 Answer

Your Answer

Question Tools

Stats

Related questions

Is it better to split the transformer's queries, keys, and values before or after going through the linear layers? edit

1 Answer