Benchmarking Inference with Torchserve #


Last Edited	05/01/2024

Pytorch default - g4dn.xlarge #

Notes: #

Instance Type: ml.g4dn.xlarge
- GPU: Nvidia T4
- vCPU no: 4
- CPU memory: 16 GB
- GPU memory: 16 GB
Max RPS achieved: 32
- With various different configuration ranging from min/max worker = 1 to 4 and batch-size 4 to 32, the max RPS possible was only 32.
  - Locust Configuration: Max Users: 200, Spawn Rate: 10
Max Response time at 95th percentile: ~5-6 sec

Alt text

Configuration: #

enable_envvars_config=true
load_models=all
model_store=./model_store
models={\
  "vit_l_16": {\
    "1.0": {\
        "defaultVersion": true,\
        "marName": "vit_l_16.mar",\
        "minWorkers": 4,\
        "maxWorkers": 4,\
        "batchSize": 16,\
        "maxBatchDelay": 50\
    }\
  }\
}

Pytorch default - g4dn.2xlarge #

Notes: #

Instance Type: ml.g4dn.2xlarge
- GPU: Nvidia T4
- vCPU no: 8
- CPU memory: 32 GB
- GPU memory: 16 GB
Max RPS achieved: 32
- With various different configuration ranging from min/max worker = 1 to 4 and batch-size 4 to 64, the max RPS possible was only around 32.
  - Locust Configuration: Max Users: 200, Spawn Rate: 10
Max Response time at 95th percentile: ~5-6 sec

Alt text

It can be noted that the GPU utilization is at 100% but the gpu memory is underutilized and the vCPU’s are also unutilized
Never the less, with any change in configuration in number-of-model-workers or batch-size or delay, the results and utilization numbers does not change.

Alt text

Configuration: #

enable_envvars_config=true
load_models=all
model_store=./model_store
models={\
  "vit_l_16": {\
    "1.0": {\
        "defaultVersion": true,\
        "marName": "vit_l_16.mar",\
        "minWorkers": 1,\
        "maxWorkers": 1,\
        "batchSize": 64,\
        "maxBatchDelay": 200,\
        "responseTimeout": 240\
    }\
  }\
}

Does Dynamic Batching Really Help (PS: It dooes) #

Instance used: (G4dn.2xlarge)

I can imagine 2 scenarioes,
1. where workers is set to 1 and batch size 1
2. where workers > 1 (atleast 4) and batch-size 1

Batch-size 1 and workers 1: #

Instance Type: ml.g4dn.2xlarge
- GPU: Nvidia T4
- vCPU no: 8
- CPU memory: 32 GB
- GPU memory: 16 GB
Max RPS achieved: 21
- min/max workers = 1 and batch-size = 1, the max RPS possible was only around 21.
  - Locust Configuration: Max Users: 200, Spawn Rate: 10
Max Response time at 95th percentile: ~5 sec (close to 4.9 sec)
- Its a bit less than dynamic batching as torchserve does not have to wait for extra time for creation of batches

Alt text

Note: The GPU utilization is also not full

Alt text

Configuration: #

enable_envvars_config=true
load_models=all
model_store=./model_store
models={\
  "vit_l_16": {\
    "1.0": {\
        "defaultVersion": true,\
        "marName": "vit_l_16.mar",\
        "minWorkers": 1,\
        "maxWorkers": 1,\
        "batchSize": 1,\
        "maxBatchDelay": 200,\
        "responseTimeout": 240\
    }\
  }\
}

Batch-size 1 and workers 4: #

Note: For some reason, with workers 4, the GPU utilization is 100 but the RPS is still the same
Instance Type: ml.g4dn.2xlarge
- GPU: Nvidia T4
- vCPU no: 8
- CPU memory: 32 GB
- GPU memory: 16 GB
Max RPS achieved: 21
- min/max workers = 4 and batch-size = 1, the max RPS possible was only around 21.
  - Locust Configuration: Max Users: 200, Spawn Rate: 10
Max Response time at 95th percentile: ~5 sec (close to 4.9 sec)
- Its a bit less than dynamic batching as torchserve does not have to wait for extra time for creation of batches

Alt text

Configuration: #

enable_envvars_config=true
load_models=all
model_store=./model_store
models={\
  "vit_l_16": {\
    "1.0": {\
        "defaultVersion": true,\
        "marName": "vit_l_16.mar",\
        "minWorkers": 1,\
        "maxWorkers": 1,\
        "batchSize": 1,\
        "maxBatchDelay": 200,\
        "responseTimeout": 240\
    }\
  }\
}