Ask Your Question
1

Is it better to split the transformer's queries, keys, and values before or after going through the linear layers?

asked 2023-05-16 02:08:02 +0000

woof gravatar image

edit retag flag offensive close merge delete

1 Answer

Sort by ยป oldest newest most voted
3

answered 2023-05-16 02:25:02 +0000

pufferfish gravatar image

It is generally better to split the transformer's queries, keys, and values before going through the linear layers. This is because the linear layers operate on the entire input sequence, whereas the attention mechanism only operates on subsets of the input (i.e. the queries, keys, and values). Splitting the inputs before the linear layers allows for more efficient computation and reduces the amount of irrelevant information that the attention mechanism has to process. Additionally, splitting the inputs before the linear layers means that the attention mechanism can be applied to different portions of the input independently, which can be beneficial for tasks like translation where different parts of the input may require different levels of attention.

edit flag offensive delete link more

Your Answer

Please start posting anonymously - your entry will be published after you log in or create a new account. This space is reserved only for answers. If you would like to engage in a discussion, please instead post a comment under the question or an answer that you would like to discuss

Add Answer


Question Tools

Stats

Asked: 2023-05-16 02:08:02 +0000

Seen: 11 times

Last updated: May 16 '23