Un-Answered Questions: #
-
Difference between Float16 vs Bfloat16 vs Tensor-Float16 ?
-
Vector Databases: HNSW vs IVF ?
-
Difference between vector DB’s and FAISS library (by Meta) ?
- From my current knowledge both are same, but then why is everyone behind vector DB’s instead of using FAISS directly ?
-
Null Hypothesis test » p-values » calculated using t-test or z-test
Weeb Union Umar Jamil Cohehre’s embedding v3
RetNet –> Saw it on high note –> Removes the softmax from the Transformer network and adds a exponential-moving-average-weights to the network by which it gives more importance to the recent tokens
why cant LLM be used for embedding generation in RAG –> Current understanding: Just look at titan, the embedding generation is bad and token len is high –> which explains that when high info is compressed into a single vector (emb), it loses its meaning –> LLM’s has very high token len, it can suffer the same fate –> second reason being, LLM are decoder only (casual language models) which does not have two way info. –> Hence, the embeddings might be good for NSP but not good for retrieval
Speculative sampling: 2 different ML LLMS (one small and one big) (Eg: llama2 7b and 70b) –> The autoregressive task will be done for next “N” tokens by smaller model –> Once the inference is done, the bigger model is try to correct the “N” tokens in a single pass
Distributed training: - data parallelism - model parallalism - Pipeline papallelism: Cut layer by layer - Model Parallalism: Cut Vertically
NEED TO READ: FastViT RetVisionNet float 16 vs bfloat16 vs tfloat16
https://kipp.ly/transformer-inference-arithmetic/
questions:
- How many samples required for finetunign ? –> Lamnini Deep learning course: atleast 1000 samples to start with
- Huggingface CLMTrainer vs SFT ?
- Orcha method LLM
- YaRn method for RoPE for increasing the context length to infinite
perplexity of LLM = exp of (log likelihood)
Please help me understanding if my below questions: A, B and C.) are correct:
I am instuction finetuning a LLM. I am pre-processing my text data which also includes tokenization. My prompt struction is something like: “““Task: {task}, Context: {context}, Response: {response}””” where {response} is the text which will eventually get generated by the LLM.
Lets say I am padding every sample to 1024 tokens, then is my following understanding correct for processing a single data sample (input + output):
-> “““Task: {task}, Context: {context} Response: “””—> lets say this took: 256 tokens
-> “”"{response}””" —> Took another 256 tokens
A.) Hence I will pad this with 512 token to make it total: 1024 tokens right ?
B.) During label creation (Loss used is cross-entropy, and framework is Pytorch) for instruction fine-tuning, my labels will look something like: -> “““Task: {task}, Context: {context} Response: “””—> “-100” 256 times to avoid calculating cross-entropy loss on instruction part -> “”"{response}””" —> 256 tokens of response corresponding to actual labels -> “““padding””” –> “-100” 512 times to not calculate any loss for 512 pad tokens right ?
C.) For the corresponding attention mask, it should be “1” for first 512 times (Task + context + response) and 0 for the last 512 times, right ?