Maximize the utilization of your inference hardware by running multiple models in parallel on the same device.
Increase inference throughput by running multiple instances of your model on the same device.
Running large batch sizes? boost inference speed by running the next inference without waiting for the prediction of the previous one.
Use best practices of all APIs bundled in a single, always-up-to-date Python library.
Save time and hassle by getting a containerized inference engine that can be easily plugged into your production environment.
Easily switch between inference frameworks with zero code changes. Compatible with multiple frameworks and hardware types.
Discover your model’s bottlenecks. See Inference performance per layer.
Get a deep analysis of your models’ performance. Automatically measure various model runtime statistics for specific hardware such as batch size, input dimensions, warmup calls, repetitions etc.
from transformers import AutoFeatureExtractor, AutoModelForImageClassification
extractor = AutoFeatureExtractor.from_pretrained("microsoft/resnet-50")
model = AutoModelForImageClassification.from_pretrained("microsoft/resnet-50")