Un-Answered Questions

Un-Answered Questions: #

Difference between Float16 vs Bfloat16 vs Tensor-Float16 ?
Vector Databases: HNSW vs IVF ?
Difference between vector DB’s and FAISS library (by Meta) ?
- From my current knowledge both are same, but then why is everyone behind vector DB’s instead of using FAISS directly ?
Null Hypothesis test » p-values » calculated using t-test or z-test

Weeb Union Umar Jamil Cohehre’s embedding v3

RetNet –> Saw it on high note –> Removes the softmax from the Transformer network and adds a exponential-moving-average-weights to the network by which it gives more importance to the recent tokens

why cant LLM be used for embedding generation in RAG –> Current understanding: Just look at titan, the embedding generation is bad and token len is high –> which explains that when high info is compressed into a single vector (emb), it loses its meaning –> LLM’s has very high token len, it can suffer the same fate –> second reason being, LLM are decoder only (casual language models) which does not have two way info. –> Hence, the embeddings might be good for NSP but not good for retrieval

Speculative sampling: 2 different ML LLMS (one small and one big) (Eg: llama2 7b and 70b) –> The autoregressive task will be done for next “N” tokens by smaller model –> Once the inference is done, the bigger model is try to correct the “N” tokens in a single pass

Distributed training: - data parallelism - model parallalism - Pipeline papallelism: Cut layer by layer - Model Parallalism: Cut Vertically

NEED TO READ: FastViT RetVisionNet float 16 vs bfloat16 vs tfloat16

https://kipp.ly/transformer-inference-arithmetic/

questions:

How many samples required for finetunign ? –> Lamnini Deep learning course: atleast 1000 samples to start with
Huggingface CLMTrainer vs SFT ?
Orcha method LLM
YaRn method for RoPE for increasing the context length to infinite

perplexity of LLM = exp of (log likelihood)

Please help me understanding if my below questions: A, B and C.) are correct:

I am instuction finetuning a LLM. I am pre-processing my text data which also includes tokenization. My prompt struction is something like: “““Task: {task}, Context: {context}, Response: {response}””” where {response} is the text which will eventually get generated by the LLM.

Lets say I am padding every sample to 1024 tokens, then is my following understanding correct for processing a single data sample (input + output):
-> “““Task: {task}, Context: {context} Response: “””—> lets say this took: 256 tokens -> “”"{response}””" —> Took another 256 tokens A.) Hence I will pad this with 512 token to make it total: 1024 tokens right ?

B.) During label creation (Loss used is cross-entropy, and framework is Pytorch) for instruction fine-tuning, my labels will look something like: -> “““Task: {task}, Context: {context} Response: “””—> “-100” 256 times to avoid calculating cross-entropy loss on instruction part -> “”"{response}””" —> 256 tokens of response corresponding to actual labels -> “““padding””” –> “-100” 512 times to not calculate any loss for 512 pad tokens right ?

C.) For the corresponding attention mask, it should be “1” for first 512 times (Task + context + response) and 0 for the last 512 times, right ?