Benchmarking Inference with Torchserve #
Last Edited | 05/01/2024 |
Pytorch default - g4dn.xlarge #
Notes: #
-
Instance Type: ml.g4dn.xlarge
- GPU: Nvidia T4
- vCPU no: 4
- CPU memory: 16 GB
- GPU memory: 16 GB
-
Max RPS achieved: 32
- With various different configuration ranging from min/max worker = 1 to 4 and batch-size 4 to 32, the max RPS possible was only 32.
- Locust Configuration: Max Users: 200, Spawn Rate: 10
- With various different configuration ranging from min/max worker = 1 to 4 and batch-size 4 to 32, the max RPS possible was only 32.
-
Max Response time at 95th percentile: ~5-6 sec
Configuration: #
enable_envvars_config=true
load_models=all
model_store=./model_store
models={\
"vit_l_16": {\
"1.0": {\
"defaultVersion": true,\
"marName": "vit_l_16.mar",\
"minWorkers": 4,\
"maxWorkers": 4,\
"batchSize": 16,\
"maxBatchDelay": 50\
}\
}\
}
Pytorch default - g4dn.2xlarge #
Notes: #
-
Instance Type: ml.g4dn.2xlarge
- GPU: Nvidia T4
- vCPU no: 8
- CPU memory: 32 GB
- GPU memory: 16 GB
-
Max RPS achieved: 32
- With various different configuration ranging from min/max worker = 1 to 4 and batch-size 4 to 64, the max RPS possible was only around 32.
- Locust Configuration: Max Users: 200, Spawn Rate: 10
- With various different configuration ranging from min/max worker = 1 to 4 and batch-size 4 to 64, the max RPS possible was only around 32.
-
Max Response time at 95th percentile: ~5-6 sec
- It can be noted that the GPU utilization is at 100% but the gpu memory is underutilized and the vCPU’s are also unutilized
- Never the less, with any change in configuration in number-of-model-workers or batch-size or delay, the results and utilization numbers does not change.
Configuration: #
enable_envvars_config=true
load_models=all
model_store=./model_store
models={\
"vit_l_16": {\
"1.0": {\
"defaultVersion": true,\
"marName": "vit_l_16.mar",\
"minWorkers": 1,\
"maxWorkers": 1,\
"batchSize": 64,\
"maxBatchDelay": 200,\
"responseTimeout": 240\
}\
}\
}
Does Dynamic Batching Really Help (PS: It dooes) #
Instance used: (G4dn.2xlarge)
- I can imagine 2 scenarioes,
- where workers is set to 1 and batch size 1
- where workers > 1 (atleast 4) and batch-size 1
Batch-size 1 and workers 1: #
-
Instance Type: ml.g4dn.2xlarge
- GPU: Nvidia T4
- vCPU no: 8
- CPU memory: 32 GB
- GPU memory: 16 GB
-
Max RPS achieved: 21
- min/max workers = 1 and batch-size = 1, the max RPS possible was only around 21.
- Locust Configuration: Max Users: 200, Spawn Rate: 10
- min/max workers = 1 and batch-size = 1, the max RPS possible was only around 21.
-
Max Response time at 95th percentile: ~5 sec (close to 4.9 sec)
- Its a bit less than dynamic batching as torchserve does not have to wait for extra time for creation of batches
- Note: The GPU utilization is also not full
Configuration: #
enable_envvars_config=true
load_models=all
model_store=./model_store
models={\
"vit_l_16": {\
"1.0": {\
"defaultVersion": true,\
"marName": "vit_l_16.mar",\
"minWorkers": 1,\
"maxWorkers": 1,\
"batchSize": 1,\
"maxBatchDelay": 200,\
"responseTimeout": 240\
}\
}\
}
Batch-size 1 and workers 4: #
-
Note: For some reason, with workers 4, the GPU utilization is 100 but the RPS is still the same
-
Instance Type: ml.g4dn.2xlarge
- GPU: Nvidia T4
- vCPU no: 8
- CPU memory: 32 GB
- GPU memory: 16 GB
-
Max RPS achieved: 21
- min/max workers = 4 and batch-size = 1, the max RPS possible was only around 21.
- Locust Configuration: Max Users: 200, Spawn Rate: 10
- min/max workers = 4 and batch-size = 1, the max RPS possible was only around 21.
-
Max Response time at 95th percentile: ~5 sec (close to 4.9 sec)
- Its a bit less than dynamic batching as torchserve does not have to wait for extra time for creation of batches
Configuration: #
enable_envvars_config=true
load_models=all
model_store=./model_store
models={\
"vit_l_16": {\
"1.0": {\
"defaultVersion": true,\
"marName": "vit_l_16.mar",\
"minWorkers": 1,\
"maxWorkers": 1,\
"batchSize": 1,\
"maxBatchDelay": 200,\
"responseTimeout": 240\
}\
}\
}