It is generally better to split the transformer's queries, keys, and values before going through the linear layers. This is because the linear layers operate on the entire input sequence, whereas the attention mechanism only operates on subsets of the input (i.e. the queries, keys, and values). Splitting the inputs before the linear layers allows for more efficient computation and reduces the amount of irrelevant information that the attention mechanism has to process. Additionally, splitting the inputs before the linear layers means that the attention mechanism can be applied to different portions of the input independently, which can be beneficial for tasks like translation where different parts of the input may require different levels of attention.
Please start posting anonymously - your entry will be published after you log in or create a new account. This space is reserved only for answers. If you would like to engage in a discussion, please instead post a comment under the question or an answer that you would like to discuss
Asked: 2023-05-16 02:08:02 +0000
Seen: 11 times
Last updated: May 16 '23
Do api keys meet the requirements of ASVS standards?
What is the most effective approach to establish foreign keys for three adjacent tables?
How can I check in WebTestClient that the JSON contains only certain specified keys?
When utilizing CStr for keys in VBA Dictionary, why are entries being duplicated?
What characters are permitted for translation keys in i18next?
Does it make sense that the speed of synchronous `multi_get` exceeds asynchronous `get`?