NT2I.ONNX — Guide d'utilisation

Documentation des architectures et configurations supportées par la solution. Chaque architecture peut être utilisée selon 4 axes orthogonaux :

Axe	Valeurs
Cardinalité	`Single Image` (1 image) ou `Image Batch` (N images)
Mode	`Solo` (engine seul) ou `Hub` (image partagée entre plusieurs modèles)
Provider	`CPU` ou `GPU` (CUDA / TensorRT / DirectML)
Format d'entrée	`BGR entrelacé` (packed) ou `Planar R/G/B`

Les exemples ci-dessous couvrent les 16 cellules de cette matrice pour les 4 architectures principales : YOLO Detection, RF-DETR Detection, RF-DETR Segmentation, SAM2.

Chaque exemple est indépendant : copier-coller suffit, à condition d'avoir le helper de chargement d'image présenté en §1.3.

1. Concepts communs

1.1 Cycle de vie d'une architecture

Toutes les architectures (sauf SAM2 qui possède 2 engines) suivent le même schéma :

// 1. Construction (avec ou sans Initialize immédiat)
using var engine = new YoloDetection(options, preprocessor, postprocessor);

// 2. Préchauffage (compile les kernels CUDA / fixe le batch dimensionné en cas dynamique)
engine.WarmUp(batchSize: 1);

// 3. Définir l'entrée — 1 des 5 chemins :
//    a) Single Image BGR packed
engine.SetInputImageBgr(bgrBytes, new ImageSize(w, h), inputIndex: 0);
//    b) Image Batch BGR packed
engine.SetInputBatchImageBgr(arrayOfBgrBytes, arrayOfSizes, inputIndex: 0);
//    c) Single Image Planar R/G/B
engine.SetInputImagePlanar(rArr, gArr, bArr, size, inputIndex: 0);
//    d) Image Batch Planar R/G/B
engine.SetInputBatchImagePlanar(rArrs, gArrs, bArrs, sizes, inputIndex: 0);
//    e) Hub (Single ou Batch selon le contexte fourni)
engine.BindFromContext(sharedContext);

// 4. Inférence
await engine.RunInferenceAsync(clearInputAfterRun: false);

// 5. Lecture des sorties (typée selon l'architecture)
var detections = engine.GetOutputDetection(0.5f);     // YOLO / RF-DETR
var segs       = engine.GetOutputSegmentation(0.5f);  // RF-DETR Seg
var mask       = await sam2.GetDetections(objectId);  // SAM2

// 6. Nettoyage (Dispose ou bloc using)

1.2 OnnxSessionOptions et OutputBindingTarget

using NT2I.ONNX.Engine;
using NT2I.ONNX.Abstractions.Configuration;
using NT2I.ONNX.Abstractions.Enumerations;

var options = new OnnxSessionOptions(
    config:          OnnxConfigEnum.GPU_CUDA_FP32,  // ou CPU, GPU_TRT_FP16_ENGINE, etc.
    modelData:       File.ReadAllBytes("model.onnx"),
    modelName:       "MyEngine",
    calibrationData: null,
    gpuDeviceId:     0,
    enableProfiling: false,
    outputDevice:    OutputBindingTarget.Host);

// Alternative : pointer un fichier au lieu d'un byte[]
var optsFromFile = new OnnxFileSessionOptions(
    config:    OnnxConfigEnum.CPU,
    modelPath: "model.onnx",
    modelName: "MyEngine");

OutputBindingTarget :

Host (défaut) — outputs rapatriés en RAM ; toujours valide.
ExecutionDevice — outputs gardés en VRAM (GPU) ; uniquement sur provider GPU. Optimisation zéro-copie utile si la sortie est consommée par un autre engine GPU (cas de l'encodeur SAM2 → prédicteur SAM2).

Valeurs OnnxConfigEnum : CPU, DirectML, GPU_CUDA_FP32, GPU_TRT_FP32_ENGINE, GPU_TRT_FP32_TIMING, GPU_TRT_FP16_ENGINE, GPU_TRT_FP16_TIMING, GPU_TRT_INT8_ENGINE, GPU_TRT_INT8_TIMING.

1.3 Chargement d'une image BGR depuis un fichier

La bibliothèque attend des byte[] au layout BGR entrelacé (B0,G0,R0, B1,G1,R1, …) de taille width * height * 3. C'est le layout natif d'OpenCV et la convention utilisée par tous les SetInputImageBgr* et les contextes Hub.

Aucune dépendance d'image n'est imposée — n'importe quelle lib produisant ce buffer convient (OpenCvSharp, System.Drawing+LockBits, ImageSharp, etc.). Voici un helper portable avec SixLabors.ImageSharp (NuGet SixLabors.ImageSharp) :

using SixLabors.ImageSharp;
using SixLabors.ImageSharp.PixelFormats;
using SixLabors.ImageSharp.Processing;

// Charge un fichier (JPG/PNG/BMP) et retourne le buffer BGR packed + dimensions.
public static byte[] LoadBgrPacked(string path, out int width, out int height)
{
    using var image = Image.Load<Bgr24>(path);   // ImageSharp gère JPG/PNG/BMP/etc.
    width  = image.Width;
    height = image.Height;
    var buffer = new byte[width * height * 3];
    image.CopyPixelDataTo(buffer);                // layout natif BGR24 → BGR packed
    return buffer;
}

// Variante planar R/G/B (3 buffers séparés de taille w*h).
public static (byte[] R, byte[] G, byte[] B) LoadPlanarRgb(string path, out int width, out int height)
{
    using var image = Image.Load<Rgb24>(path);
    width  = image.Width;
    height = image.Height;
    int n = width * height;
    var r = new byte[n]; var g = new byte[n]; var b = new byte[n];
    image.ProcessPixelRows(accessor =>
    {
        for (int y = 0, k = 0; y < accessor.Height; y++)
        {
            var row = accessor.GetRowSpan(y);
            for (int x = 0; x < accessor.Width; x++, k++)
            {
                r[k] = row[x].R;
                g[k] = row[x].G;
                b[k] = row[x].B;
            }
        }
    });
    return (r, g, b);
}

Tests internes : le projet utilise NT2I_ImageConversion.WPF (lib privée Windows) via le helper TestMethodHelper.LoadBgr24ImageFromRessource(byte[]). Pour du code applicatif tiers, ImageSharp ci-dessus est la solution recommandée (cross-platform .NET 8+).

1.4 Le Hub SharedImageCoordinator

Permet de pré-traiter une image source une seule fois pour la distribuer à plusieurs architectures (chaque modèle reçoit le tenseur normalisé/redimensionné selon ses propres Requirements).

Important : coord.CreateContext(...) appelle automatiquement BindFromContext sur chaque modèle enregistré — pas besoin de le faire manuellement.

using NT2I.ONNX.Hub;

// CPU (par défaut)
var coord = new SharedImageCoordinator();
coord.RegisterModel(yolo);     // YoloDetection implémente IImageConsumingArchitecture
coord.RegisterModel(rfdetr);
coord.RegisterModel(sam2);

// Single Image — 1 image source
using var ctxSingle = coord.CreateContext(bgrBytes, width, height);
// BindFromContext est automatiquement appelé sur les 3 modèles.

// Image Batch — N images source (tailles potentiellement hétérogènes)
using var ctxBatch = coord.CreateContext(
    new[] { img0, img1 },
    new[] { 1920, 1280 },
    new[] { 1080, 720  });

1.5 Choix CPU vs GPU avec le Coordinator

Le périphérique cible est déterminé par la ISharedImageContextFactory injectée. Symétrie :

using NT2I.ONNX.Hub;

// CPU (default) — déjà inclus dans NT2I.ONNX.Hub
var coordCpu = new SharedImageCoordinator();
// équivalent à : new SharedImageCoordinator(new CpuSharedImageContextFactory())

// GPU CUDA — requiert le NuGet additionnel NT2I.ONNX.Hub.Cuda (C++/CLI, dépend de CUDA 12.x)
using NT2I.ONNX.Hub.Cuda;

var coordGpu = new SharedImageCoordinator(new CudaSharedImageContextFactory());

// Même API publique, le contexte produit est CpuSharedImageContext ou CudaSharedImageContext
// selon la factory.
using var ctx = coordGpu.CreateContext(bgrBytes, width, height);
// ctx.Device == DataHandlingDeviceEnum.GPU

Packages NuGet :

NT2I.ONNX.Hub — coordinateur + chemin CPU. Toujours requis.
NT2I.ONNX.Hub.Cuda — factory CUDA (CudaSharedImageContextFactory + CudaSharedImageContext). Requiert driver NVIDIA + runtime CUDA 12.x.
NT2I.ONNX.DataHandling.Cuda — kernels CUDA de pré/post-traitement (référencé transitivement par Hub.Cuda).

1.6 WarmUp via le Hub

Préchauffe tous les modèles enregistrés en un seul appel — utile pour stabiliser les sessions ONNX (notamment compilation TensorRT) avant la première frame réelle.

SharedImageCoordinator.WarmUp itère sur les modèles enregistrés et appelle IImageConsumingArchitecture.WarmUp(batchSize, iterations) sur chacun.

using NT2I.ONNX.Hub;
using NT2I.ONNX.Hub.Cuda;

var coord = new SharedImageCoordinator(new CudaSharedImageContextFactory());
coord.RegisterModel(yolo);
coord.RegisterModel(rfdetr);

// Synchrone — propage à tous les modèles enregistrés
coord.WarmUp(batchSize: 1, iterations: 1);

// Asynchrone — Task.WhenAll sous le capot
await coord.WarmUpAsync(batchSize: 1);

1.7 MultiGrab (pipeline vidéo) LoadFrame et RebindAll

Le pattern « MultiGrab » désigne la réacquisition successive de frames depuis une caméra industrielle ou un flux vidéo, chaque frame devant être traitée par les mêmes modèles. Au lieu d'allouer un nouveau contexte (et donc des cudaMalloc complets) par frame, on réutilise un contexte unique tant que la résolution est stable :

using NT2I.ONNX.Hub;
using NT2I.ONNX.Hub.Cuda;

var coord = new SharedImageCoordinator(new CudaSharedImageContextFactory());
coord.RegisterModel(yolo);
coord.RegisterModel(rfdetr);
coord.WarmUp(batchSize: 1);

// 1re frame : alloue les buffers master + caches preprocessés
using var ctx = coord.CreateContext(frame0, 1920, 1080);
// CreateContext binde déjà les modèles automatiquement.

await yolo.RunInferenceAsync(false);

// Frames suivantes : ne réalloue rien tant que la résolution est stable.
foreach (var frame in nextFrames)
{
    ctx.LoadFrame(frame, 1920, 1080);   // cudaMemcpy H->D dans le buffer master existant
    coord.RebindAll(ctx);               // re-binde tous les modèles (cache invalidé)

    await yolo.RunInferenceAsync(false);
    var spans = yolo.GetOutputDetectionAsSpan(0.3f, outputIndex: 0);
    for (int slot = 0; slot < spans.BatchSize; slot++)
        Console.WriteLine($"Slot {slot} : {spans[slot].Count} détections");

    await rfdetr.RunInferenceAsync(false);
}

Si les dimensions changent, LoadFrame réalloue automatiquement le buffer master (free + nouveau cudaMalloc / nouvelle allocation managée côté CPU) ; le cache de tenseurs preprocessés est systématiquement invalidé pour rester cohérent avec les nouveaux pixels.

Pour le batch, utiliser ctx.LoadFrames(byte[][], int[] widths, int[] heights) (la taille du batch reste figée à la création du contexte ; seules les dimensions de chaque slot peuvent évoluer).

Tests de référence :

Tests/NT2I.ONNX.Test.Hub/Hub/MultiGrabTests.cs — invariants mécaniques CPU (réutilisation buffer, invalidation cache, garde-fous).
Tests/NT2I.ONNX.Test.Hub.Cuda/CudaMultiGrabTests.cs — variante GPU + boucle Zidane→Horses→Zidane avec validation sémantique IoU.

2. YOLO Detection

Modèle : YOLOv7 / YOLOv12, format ONNX, NMS embarqué (sortie [N, 7] = [batchIdx, x1, y1, x2, y2, classId, confidence]).

2.1 Solo Single Image CPU

using System.IO;
using NT2I.ONNX.Abstractions;
using NT2I.ONNX.Abstractions.Configuration;
using NT2I.ONNX.Architectures.YOLO;
using NT2I.ONNX.DataHandling.Cpu.Yolo.Detection;
using NT2I.ONNX.Engine;

// --- Chargement de l'image (cf. §1.3) ---
byte[] bgr = LoadBgrPacked("frame.png", out int width, out int height);

// --- Configuration ONNX ---
var options = new OnnxSessionOptions(
    OnnxConfigEnum.CPU,
    File.ReadAllBytes("yolov7.onnx"),
    modelName: "Yolo_CPU");

// --- Initialisation + WarmUp ---
using var yolo = new YoloDetection(
    options,
    new YoloV7DetectionPreprocessor(),
    new YoloV7DetectionPostprocessor());

yolo.WarmUp(batchSize: 1);

// --- Inférence ---
yolo.SetInputImageBgr(bgr, new ImageSize(width, height), inputIndex: 0);
await yolo.RunInferenceAsync(clearInputAfterRun: false);

// --- Lecture des résultats ---
var batches = yolo.GetOutputDetection(0.5f).ToList();
foreach (var box in batches[0])
    Console.WriteLine($"class={box.ClassId} conf={box.Confidence:F2} " +
                      $"({box.X},{box.Y},{box.Width},{box.Height})");

2.2 Solo Single Image GPU

using System.IO;
using NT2I.ONNX.Abstractions;
using NT2I.ONNX.Abstractions.Configuration;
using NT2I.ONNX.Architectures.YOLO;
using NT2I.ONNX.DataHandling.Cpu.Yolo.Detection;          // postprocessor CPU
using NT2I.ONNX.DataHandling.Gpu.Yolo.Detection;          // preprocessor CUDA
using NT2I.ONNX.Engine;

// --- Chargement de l'image en planar R/G/B (cf. §1.3) ---
var (r, g, b) = LoadPlanarRgb("frame.png", out int width, out int height);

// --- Configuration ONNX (provider CUDA) ---
var options = new OnnxSessionOptions(
    OnnxConfigEnum.GPU_CUDA_FP32,
    File.ReadAllBytes("yolov7.onnx"),
    modelName: "Yolo_CUDA");

// --- Initialisation + WarmUp ---
// Le preprocessor CUDA upload H->D + letterbox + /255 directement sur GPU.
using var yolo = new YoloDetection(
    options,
    new NT2I.ONNX.DataHandling.Gpu.Yolo.Detection.YoloV7DetectionPreprocessor(),
    new NT2I.ONNX.DataHandling.Cpu.Yolo.Detection.YoloV7DetectionPostprocessor());

yolo.WarmUp(batchSize: 1);

// --- Inférence ---
yolo.SetInputImagePlanar(r, g, b, new ImageSize(width, height), inputIndex: 0);
await yolo.RunInferenceAsync(clearInputAfterRun: false);

// --- Lecture des résultats ---
var detections = yolo.GetOutputDetection(0.5f).First().ToList();
Console.WriteLine($"{detections.Count} détections");

2.3 Solo Image Batch CPU

using System.IO;
using NT2I.ONNX.Abstractions;
using NT2I.ONNX.Abstractions.Configuration;
using NT2I.ONNX.Architectures.YOLO;
using NT2I.ONNX.DataHandling.Cpu.Yolo.Detection;
using NT2I.ONNX.Engine;

// --- Chargement de 2 images de tailles hétérogènes (cf. §1.3) ---
byte[] img0 = LoadBgrPacked("zidane.jpg",  out int w0, out int h0);
byte[] img1 = LoadBgrPacked("horses.jpg",  out int w1, out int h1);

byte[][]    bgrs  = { img0, img1 };
ImageSize[] sizes = { new(w0, h0), new(w1, h1) };

// --- Configuration ONNX ---
var options = new OnnxSessionOptions(
    OnnxConfigEnum.CPU,
    File.ReadAllBytes("yolov7.onnx"),
    modelName: "Yolo_CPU_B2");

// --- Initialisation + WarmUp dimensionné pour batch=2 ---
using var yolo = new YoloDetection(
    options,
    new YoloV7DetectionPreprocessor(),
    new YoloV7DetectionPostprocessor());

yolo.WarmUp(batchSize: 2);

// --- Inférence ---
yolo.SetInputBatchImageBgr(bgrs, sizes, inputIndex: 0);
await yolo.RunInferenceAsync(clearInputAfterRun: false);

// --- Lecture des résultats par slot ---
var batches = yolo.GetOutputDetection(0.5f).ToList();
Console.WriteLine($"Slot 0 : {batches[0].Count()} détections");
Console.WriteLine($"Slot 1 : {batches[1].Count()} détections");

2.4 Solo Image Batch GPU

using System.IO;
using NT2I.ONNX.Abstractions;
using NT2I.ONNX.Abstractions.Configuration;
using NT2I.ONNX.Architectures.YOLO;
using NT2I.ONNX.DataHandling.Cpu.Yolo.Detection;          // postprocessor CPU
using NT2I.ONNX.DataHandling.Gpu.Yolo.Detection;          // preprocessor CUDA
using NT2I.ONNX.Engine;

// --- Chargement de 2 images de tailles hétérogènes (cf. §1.3) ---
byte[] img0 = LoadBgrPacked("zidane.jpg", out int w0, out int h0);
byte[] img1 = LoadBgrPacked("horses.jpg", out int w1, out int h1);

byte[][]    bgrs  = { img0, img1 };
ImageSize[] sizes = { new(w0, h0), new(w1, h1) };

// --- Configuration ONNX (provider CUDA) ---
var options = new OnnxSessionOptions(
    OnnxConfigEnum.GPU_CUDA_FP32,
    File.ReadAllBytes("yolov7.onnx"),
    modelName: "Yolo_CUDA_B2");

// --- Initialisation + WarmUp ---
using var yolo = new YoloDetection(
    options,
    new NT2I.ONNX.DataHandling.Gpu.Yolo.Detection.YoloV7DetectionPreprocessor(),
    new NT2I.ONNX.DataHandling.Cpu.Yolo.Detection.YoloV7DetectionPostprocessor());

yolo.WarmUp(batchSize: 2);

// --- Inférence ---
yolo.SetInputBatchImageBgr(bgrs, sizes, inputIndex: 0);
await yolo.RunInferenceAsync(clearInputAfterRun: false);

// --- Lecture des résultats sous forme de spans ---
// Layout span : [batchIdx, x1, y1, x2, y2, classId, confidence]
var spans = yolo.GetOutputDetectionAsSpan(0.5f, outputIndex: 0);
for (int slot = 0; slot < spans.BatchSize; slot++)
    Console.WriteLine($"Slot {slot} : {spans[slot].Count} détections");

2.5 Hub Single Image CPU

using System.IO;
using NT2I.ONNX.Abstractions.Configuration;
using NT2I.ONNX.Architectures.YOLO;
using NT2I.ONNX.DataHandling.Cpu.Yolo.Detection;
using NT2I.ONNX.Engine;
using NT2I.ONNX.Hub;

// --- Chargement de l'image (cf. §1.3) ---
byte[] bgr = LoadBgrPacked("frame.png", out int width, out int height);

// --- Configuration ONNX ---
var options = new OnnxSessionOptions(
    OnnxConfigEnum.CPU,
    File.ReadAllBytes("yolov7.onnx"),
    modelName: "Yolo_CPU_Hub");

// --- Initialisation du modèle ---
using var yolo = new YoloDetection(
    options,
    new YoloV7DetectionPreprocessor(),
    new YoloV7DetectionPostprocessor());

// --- Hub : enregistrer le modèle puis warmup centralisé ---
var coord = new SharedImageCoordinator();   // CPU par défaut
coord.RegisterModel(yolo);
coord.WarmUp(batchSize: 1);

// --- Inférence via le contexte partagé ---
using var ctx = coord.CreateContext(bgr, width, height);
// BindFromContext est appelé automatiquement par CreateContext.

await yolo.RunInferenceAsync(clearInputAfterRun: false);

// --- Lecture des résultats ---
var detections = yolo.GetOutputDetection(0.5f).First().ToList();
Console.WriteLine($"{detections.Count} détections");

2.6 Hub Single Image GPU

using System.IO;
using NT2I.ONNX.Abstractions.Configuration;
using NT2I.ONNX.Architectures.YOLO;
using NT2I.ONNX.DataHandling.Cpu.Yolo.Detection;          // postprocessor CPU
using NT2I.ONNX.DataHandling.Gpu.Yolo.Detection;          // preprocessor CUDA
using NT2I.ONNX.Engine;
using NT2I.ONNX.Hub;
using NT2I.ONNX.Hub.Cuda;                                  // factory GPU

// --- Chargement de l'image (cf. §1.3) ---
byte[] bgr = LoadBgrPacked("frame.png", out int width, out int height);

// --- Configuration ONNX (provider CUDA) ---
var options = new OnnxSessionOptions(
    OnnxConfigEnum.GPU_CUDA_FP32,
    File.ReadAllBytes("yolov7.onnx"),
    modelName: "Yolo_CUDA_Hub");

// --- Initialisation du modèle ---
using var yolo = new YoloDetection(
    options,
    new NT2I.ONNX.DataHandling.Gpu.Yolo.Detection.YoloV7DetectionPreprocessor(),
    new NT2I.ONNX.DataHandling.Cpu.Yolo.Detection.YoloV7DetectionPostprocessor());

// --- Hub GPU : factory CUDA + WarmUp centralisé ---
var coord = new SharedImageCoordinator(new CudaSharedImageContextFactory());
coord.RegisterModel(yolo);
coord.WarmUp(batchSize: 1);

// --- Inférence via le contexte VRAM partagé ---
using var ctx = coord.CreateContext(bgr, width, height);
await yolo.RunInferenceAsync(clearInputAfterRun: false);

// --- Lecture des résultats ---
var detections = yolo.GetOutputDetection(0.5f).First().ToList();

2.7 Hub Image Batch CPU

using System.IO;
using NT2I.ONNX.Abstractions;
using NT2I.ONNX.Abstractions.Configuration;
using NT2I.ONNX.Architectures.YOLO;
using NT2I.ONNX.DataHandling.Cpu.Yolo.Detection;
using NT2I.ONNX.Engine;
using NT2I.ONNX.Hub;

// --- Chargement de 2 images (cf. §1.3) ---
byte[] img0 = LoadBgrPacked("zidane.jpg", out int w0, out int h0);
byte[] img1 = LoadBgrPacked("horses.jpg", out int w1, out int h1);

// --- Configuration ONNX ---
var options = new OnnxSessionOptions(
    OnnxConfigEnum.CPU,
    File.ReadAllBytes("yolov7.onnx"),
    modelName: "Yolo_CPU_Hub_B2");

// --- Initialisation ---
using var yolo = new YoloDetection(
    options,
    new YoloV7DetectionPreprocessor(),
    new YoloV7DetectionPostprocessor());

// --- Hub + WarmUp dimensionné batch=2 ---
var coord = new SharedImageCoordinator();
coord.RegisterModel(yolo);
coord.WarmUp(batchSize: 2);

// --- Contexte batch (tailles hétérogènes OK) ---
using var ctx = coord.CreateContext(
    new[] { img0, img1 },
    new[] { w0,   w1   },
    new[] { h0,   h1   });

await yolo.RunInferenceAsync(clearInputAfterRun: false);

// --- Lecture par slot ---
var batches = yolo.GetOutputDetection(0.5f).ToList();
Console.WriteLine($"Slot 0 : {batches[0].Count()} détections");
Console.WriteLine($"Slot 1 : {batches[1].Count()} détections");

2.8 Hub Image Batch GPU

using System.IO;
using NT2I.ONNX.Abstractions;
using NT2I.ONNX.Abstractions.Configuration;
using NT2I.ONNX.Architectures.YOLO;
using NT2I.ONNX.DataHandling.Cpu.Yolo.Detection;          // postprocessor CPU
using NT2I.ONNX.DataHandling.Gpu.Yolo.Detection;          // preprocessor CUDA
using NT2I.ONNX.Engine;
using NT2I.ONNX.Hub;
using NT2I.ONNX.Hub.Cuda;

// --- Chargement de 2 images (cf. §1.3) ---
byte[] img0 = LoadBgrPacked("zidane.jpg", out int w0, out int h0);
byte[] img1 = LoadBgrPacked("horses.jpg", out int w1, out int h1);

// --- Configuration ONNX (provider CUDA) ---
var options = new OnnxSessionOptions(
    OnnxConfigEnum.GPU_CUDA_FP32,
    File.ReadAllBytes("yolov7.onnx"),
    modelName: "Yolo_CUDA_Hub_B2");

// --- Initialisation ---
using var yolo = new YoloDetection(
    options,
    new NT2I.ONNX.DataHandling.Gpu.Yolo.Detection.YoloV7DetectionPreprocessor(),
    new NT2I.ONNX.DataHandling.Cpu.Yolo.Detection.YoloV7DetectionPostprocessor());

// --- Hub GPU + WarmUp dimensionné batch=2 ---
var coord = new SharedImageCoordinator(new CudaSharedImageContextFactory());
coord.RegisterModel(yolo);
coord.WarmUp(batchSize: 2);

// --- Contexte batch en VRAM ---
using var ctx = coord.CreateContext(
    new[] { img0, img1 },
    new[] { w0,   w1   },
    new[] { h0,   h1   });

await yolo.RunInferenceAsync(clearInputAfterRun: false);

// --- Lecture sous forme de spans ---
var spans = yolo.GetOutputDetectionAsSpan(0.5f, outputIndex: 0);
for (int slot = 0; slot < spans.BatchSize; slot++)
    Console.WriteLine($"Slot {slot} : {spans[slot].Count} détections");

3. RF-DETR Detection

Modèle : RF-DETR (Roboflow), entrée 560×560 ou 640×640, normalisation ImageNet, pas de NMS (transformer assigne 1 objet par query). Sorties : boxes [B, N, 4]

logits [B, N, C].

GetOutputDetectionAsSpan retourne IBatchDetections<float> où chaque détection est un ReadOnlySpan<float> de 6 valeurs : [x, y, w, h, confidence, classId].

3.1 Solo Single Image CPU

using System.IO;
using NT2I.ONNX.Abstractions;
using NT2I.ONNX.Abstractions.Configuration;
using NT2I.ONNX.Architectures.RFDetr;
using NT2I.ONNX.DataHandling.Cpu.RFDetr.Detection;
using NT2I.ONNX.Engine;

// --- Chargement de l'image (cf. §1.3) ---
byte[] bgr = LoadBgrPacked("frame.png", out int width, out int height);

// --- Configuration ONNX ---
var options = new OnnxSessionOptions(
    OnnxConfigEnum.CPU,
    File.ReadAllBytes("rfdetr_base.onnx"),
    modelName: "RFDetr_CPU");

// --- Initialisation + WarmUp ---
using var rfdetr = new RFDetrDetection(
    options,
    new RFDetrDetectionPreprocessor(),
    new RFDetrDetectionPostprocessor());

rfdetr.WarmUp(batchSize: 1);

// --- Inférence ---
rfdetr.SetInputImageBgr(bgr, new ImageSize(width, height), inputIndex: 0);
await rfdetr.RunInferenceAsync(clearInputAfterRun: false);

// --- Lecture sous forme de spans ---
// Layout : [x, y, w, h, confidence, classId]
var spans = rfdetr.GetOutputDetectionAsSpan(0.4f, boxesOutputIndex: 0, logitsOutputIndex: 1);
var slot0 = spans[0];
for (int i = 0; i < slot0.Count; i++)
{
    var det = slot0[i];
    Console.WriteLine($"class={(int)det[5]} conf={det[4]:F2} " +
                      $"box=({det[0]:F0},{det[1]:F0} {det[2]:F0}x{det[3]:F0})");
}

3.2 Solo Image Batch GPU

using System.IO;
using NT2I.ONNX.Abstractions;
using NT2I.ONNX.Abstractions.Configuration;
using NT2I.ONNX.Architectures.RFDetr;
using NT2I.ONNX.DataHandling.Cpu.RFDetr.Detection;        // postprocessor CPU
using NT2I.ONNX.DataHandling.Gpu.RFDetr;                  // preprocessor CUDA
using NT2I.ONNX.Engine;

// --- Chargement de 2 images (cf. §1.3) ---
byte[] img0 = LoadBgrPacked("zidane.jpg", out int w0, out int h0);
byte[] img1 = LoadBgrPacked("horses.jpg", out int w1, out int h1);

byte[][]    bgrs  = { img0, img1 };
ImageSize[] sizes = { new(w0, h0), new(w1, h1) };

// --- Configuration ONNX (provider CUDA) ---
var options = new OnnxSessionOptions(
    OnnxConfigEnum.GPU_CUDA_FP32,
    File.ReadAllBytes("rfdetr_base.onnx"),
    modelName: "RFDetr_GPU_B2");

// --- Initialisation + WarmUp ---
using var rfdetr = new RFDetrDetection(
    options,
    new NT2I.ONNX.DataHandling.Gpu.RFDetr.RFDetrDetectionPreprocessor(),
    new NT2I.ONNX.DataHandling.Cpu.RFDetr.Detection.RFDetrDetectionPostprocessor());

rfdetr.WarmUp(batchSize: 2);

// --- Inférence batch ---
rfdetr.SetInputBatchImageBgr(bgrs, sizes, inputIndex: 0);
await rfdetr.RunInferenceAsync(clearInputAfterRun: false);

// --- Lecture par slot ---
var spans = rfdetr.GetOutputDetectionAsSpan(0.4f);
Console.WriteLine($"Batch={spans.BatchSize} (slot0={spans[0].Count}, slot1={spans[1].Count})");

3.3 Hub Single Image CPU

using System.IO;
using NT2I.ONNX.Abstractions.Configuration;
using NT2I.ONNX.Architectures.RFDetr;
using NT2I.ONNX.DataHandling.Cpu.RFDetr.Detection;
using NT2I.ONNX.Engine;
using NT2I.ONNX.Hub;

// --- Chargement de l'image (cf. §1.3) ---
byte[] bgr = LoadBgrPacked("frame.png", out int width, out int height);

// --- Configuration ONNX ---
var options = new OnnxSessionOptions(
    OnnxConfigEnum.CPU,
    File.ReadAllBytes("rfdetr_base.onnx"),
    modelName: "RFDetr_CPU_Hub");

// --- Initialisation ---
using var rfdetr = new RFDetrDetection(
    options,
    new RFDetrDetectionPreprocessor(),
    new RFDetrDetectionPostprocessor());

// --- Hub + WarmUp centralisé ---
var coord = new SharedImageCoordinator();
coord.RegisterModel(rfdetr);
coord.WarmUp(batchSize: 1);

// --- Inférence via le contexte partagé ---
using var ctx = coord.CreateContext(bgr, width, height);
await rfdetr.RunInferenceAsync(clearInputAfterRun: false);

// --- Lecture ---
var spans = rfdetr.GetOutputDetectionAsSpan(0.4f);
var slot0 = spans[0];
Console.WriteLine($"{slot0.Count} détections");

3.4 Hub Image Batch GPU

using System.IO;
using NT2I.ONNX.Abstractions.Configuration;
using NT2I.ONNX.Architectures.RFDetr;
using NT2I.ONNX.DataHandling.Cpu.RFDetr.Detection;        // postprocessor CPU
using NT2I.ONNX.DataHandling.Gpu.RFDetr;                  // preprocessor CUDA
using NT2I.ONNX.Engine;
using NT2I.ONNX.Hub;
using NT2I.ONNX.Hub.Cuda;

// --- Chargement de 2 images (cf. §1.3) ---
byte[] zidane = LoadBgrPacked("zidane.jpg", out int wZ, out int hZ);
byte[] horses = LoadBgrPacked("horses.jpg", out int wH, out int hH);

// --- Configuration ONNX (provider CUDA) ---
var options = new OnnxSessionOptions(
    OnnxConfigEnum.GPU_CUDA_FP32,
    File.ReadAllBytes("rfdetr_base.onnx"),
    modelName: "RFDetr_GPU_Hub_B2");

// --- Initialisation ---
using var rfdetr = new RFDetrDetection(
    options,
    new NT2I.ONNX.DataHandling.Gpu.RFDetr.RFDetrDetectionPreprocessor(),
    new NT2I.ONNX.DataHandling.Cpu.RFDetr.Detection.RFDetrDetectionPostprocessor());

// --- Hub GPU + WarmUp batch=2 ---
var coord = new SharedImageCoordinator(new CudaSharedImageContextFactory());
coord.RegisterModel(rfdetr);
coord.WarmUp(batchSize: 2);

// --- Contexte batch VRAM ---
using var ctx = coord.CreateContext(
    new[] { zidane, horses },
    new[] { wZ,     wH     },
    new[] { hZ,     hH     });

await rfdetr.RunInferenceAsync(clearInputAfterRun: false);

// --- Lecture par slot ---
var spans = rfdetr.GetOutputDetectionAsSpan(0.4f);
var slot0Persons = 0;
for (int i = 0; i < spans[0].Count; i++)
    if ((int)spans[0][i][5] == 1) slot0Persons++;   // classId 1 = personne (COCO-91)

Console.WriteLine($"Zidane : {slot0Persons} personnes détectées");

4. RF-DETR Segmentation

Hérite de RFDetrDetection et ajoute un 3ᵉ output masks [B, N, H', W'] upsamplé à la taille de l'image originale. Renvoie des IInstanceSegmentation = boîte + masque float[] ∈ [0, 1] (à seuiller à 0.5 pour binaire).

4.1 Solo Single Image CPU

using System.IO;
using NT2I.ONNX.Abstractions;
using NT2I.ONNX.Abstractions.Configuration;
using NT2I.ONNX.Architectures.RFDetr;
using NT2I.ONNX.DataHandling.Cpu.RFDetr.Detection;
using NT2I.ONNX.DataHandling.Cpu.RFDetr.Segmentation;
using NT2I.ONNX.Engine;

// --- Chargement de l'image (cf. §1.3) ---
byte[] bgr = LoadBgrPacked("frame.png", out int width, out int height);

// --- Configuration ONNX ---
var options = new OnnxSessionOptions(
    OnnxConfigEnum.CPU,
    File.ReadAllBytes("rfdetr_base_seg.onnx"),
    modelName: "RFDetrSeg_CPU");

// --- Initialisation + WarmUp (4 arguments : detection + segmentation postproc) ---
using var seg = new RFDetrSegmentation(
    options,
    new RFDetrDetectionPreprocessor(),
    new RFDetrDetectionPostprocessor(),
    new RFDetrSegmentationPostprocessor());

seg.WarmUp(batchSize: 1);

// --- Inférence ---
seg.SetInputImageBgr(bgr, new ImageSize(width, height), inputIndex: 0);
await seg.RunInferenceAsync(clearInputAfterRun: false);

// --- Lecture : IEnumerable<IEnumerable<IInstanceSegmentation>> ---
var instances = seg.GetOutputSegmentation(0.5f).First().ToList();
foreach (var inst in instances)
{
    Console.WriteLine($"class={inst.ClassId} bbox=({inst.X:F0},{inst.Y:F0} " +
                      $"{inst.Width:F0}x{inst.Height:F0}) maskLen={inst.Mask.Length}");
    // inst.Mask : float[width*height] post-sigmoid, valeurs [0,1].
    // Pour binariser : (inst.Mask[i] >= 0.5f ? 255 : 0)
}

4.2 Solo Image Batch GPU

using System.IO;
using NT2I.ONNX.Abstractions;
using NT2I.ONNX.Abstractions.Configuration;
using NT2I.ONNX.Architectures.RFDetr;
using NT2I.ONNX.DataHandling.Cpu.RFDetr.Detection;        // postproc CPU
using NT2I.ONNX.DataHandling.Cpu.RFDetr.Segmentation;     // postproc segmentation CPU
using NT2I.ONNX.DataHandling.Gpu.RFDetr;                  // preproc CUDA
using NT2I.ONNX.Engine;

// --- Chargement de 2 images (cf. §1.3) ---
byte[] img0 = LoadBgrPacked("zidane.jpg", out int w0, out int h0);
byte[] img1 = LoadBgrPacked("horses.jpg", out int w1, out int h1);

byte[][]    bgrs  = { img0, img1 };
ImageSize[] sizes = { new(w0, h0), new(w1, h1) };

// --- Configuration ONNX (provider CUDA) ---
var options = new OnnxSessionOptions(
    OnnxConfigEnum.GPU_CUDA_FP32,
    File.ReadAllBytes("rfdetr_base_seg.onnx"),
    modelName: "RFDetrSeg_GPU_B2");

// --- Initialisation + WarmUp batch=2 ---
using var seg = new RFDetrSegmentation(
    options,
    new NT2I.ONNX.DataHandling.Gpu.RFDetr.RFDetrDetectionPreprocessor(),
    new NT2I.ONNX.DataHandling.Cpu.RFDetr.Detection.RFDetrDetectionPostprocessor(),
    new NT2I.ONNX.DataHandling.Cpu.RFDetr.Segmentation.RFDetrSegmentationPostprocessor());

seg.WarmUp(batchSize: 2);

// --- Inférence batch ---
seg.SetInputBatchImageBgr(bgrs, sizes, inputIndex: 0);
await seg.RunInferenceAsync(clearInputAfterRun: false);

// --- Lecture des masques par slot ---
var batches = seg.GetOutputSegmentation(0.4f).ToList();
for (int slot = 0; slot < batches.Count; slot++)
{
    int n = batches[slot].Count();
    Console.WriteLine($"Slot {slot} : {n} instances segmentées");
}

4.3 Hub Single Image CPU

using System.IO;
using NT2I.ONNX.Abstractions.Configuration;
using NT2I.ONNX.Architectures.RFDetr;
using NT2I.ONNX.DataHandling.Cpu.RFDetr.Detection;
using NT2I.ONNX.DataHandling.Cpu.RFDetr.Segmentation;
using NT2I.ONNX.Engine;
using NT2I.ONNX.Hub;

// --- Chargement de l'image (cf. §1.3) ---
byte[] bgr = LoadBgrPacked("frame.png", out int width, out int height);

// --- Configuration ONNX ---
var options = new OnnxSessionOptions(
    OnnxConfigEnum.CPU,
    File.ReadAllBytes("rfdetr_base_seg.onnx"),
    modelName: "RFDetrSeg_CPU_Hub");

// --- Initialisation ---
using var seg = new RFDetrSegmentation(
    options,
    new RFDetrDetectionPreprocessor(),
    new RFDetrDetectionPostprocessor(),
    new RFDetrSegmentationPostprocessor());

// --- Hub + WarmUp centralisé ---
var coord = new SharedImageCoordinator();
coord.RegisterModel(seg);
coord.WarmUp(batchSize: 1);

// --- Inférence via contexte partagé ---
using var ctx = coord.CreateContext(bgr, width, height);
await seg.RunInferenceAsync(clearInputAfterRun: false);

// --- Lecture ---
var instances = seg.GetOutputSegmentation(0.4f).First().ToList();
foreach (var inst in instances)
    Console.WriteLine($"class={inst.ClassId} bbox={inst.Width:F0}x{inst.Height:F0}");

4.4 Hub Image Batch GPU

using System.IO;
using NT2I.ONNX.Abstractions.Configuration;
using NT2I.ONNX.Architectures.RFDetr;
using NT2I.ONNX.DataHandling.Cpu.RFDetr.Detection;        // postproc CPU
using NT2I.ONNX.DataHandling.Cpu.RFDetr.Segmentation;     // postproc segmentation CPU
using NT2I.ONNX.DataHandling.Gpu.RFDetr;                  // preproc CUDA
using NT2I.ONNX.Engine;
using NT2I.ONNX.Hub;
using NT2I.ONNX.Hub.Cuda;

// --- Chargement de 2 images (cf. §1.3) ---
byte[] zidane = LoadBgrPacked("zidane.jpg", out int wZ, out int hZ);
byte[] horses = LoadBgrPacked("horses.jpg", out int wH, out int hH);

// --- Configuration ONNX (provider CUDA) ---
var options = new OnnxSessionOptions(
    OnnxConfigEnum.GPU_CUDA_FP32,
    File.ReadAllBytes("rfdetr_base_seg.onnx"),
    modelName: "RFDetrSeg_GPU_Hub_B2");

// --- Initialisation ---
using var seg = new RFDetrSegmentation(
    options,
    new NT2I.ONNX.DataHandling.Gpu.RFDetr.RFDetrDetectionPreprocessor(),
    new NT2I.ONNX.DataHandling.Cpu.RFDetr.Detection.RFDetrDetectionPostprocessor(),
    new NT2I.ONNX.DataHandling.Cpu.RFDetr.Segmentation.RFDetrSegmentationPostprocessor());

// --- Hub GPU + WarmUp batch=2 ---
var coord = new SharedImageCoordinator(new CudaSharedImageContextFactory());
coord.RegisterModel(seg);
coord.WarmUp(batchSize: 2);

// --- Contexte batch VRAM ---
using var ctx = coord.CreateContext(
    new[] { zidane, horses },
    new[] { wZ,     wH     },
    new[] { hZ,     hH     });

await seg.RunInferenceAsync(clearInputAfterRun: false);

// --- Lecture des masques ---
var batches = seg.GetOutputSegmentation(0.35f).ToList();
foreach (var inst in batches[0])    // instances trouvées sur Zidane
{
    // inst.Mask : float[wZ*hZ] post-sigmoid
    // pixels >= 0.5 = avant-plan
    int fg = 0;
    for (int i = 0; i < inst.Mask.Length; i++)
        if (inst.Mask[i] >= 0.5f) fg++;
    Console.WriteLine($"class={inst.ClassId} pixels avant-plan={fg}");
}

5. SAM2 (Segment Anything)

SAM2 est l'unique architecture à utiliser deux engines ONNX :

Encodeur — image → 3 tenseurs d'embedding (Image_embeddings, HighResFeatures1/2).
Prédicteur — embeddings + prompts (points / boxes / masques précédents) → masques de segmentation.

L'encodeur ne supporte que batch=1 côté ONNX. Le prédicteur accepte plusieurs annotations par image (plusieurs points ou rectangles par objet).

Les masques retournés par GetDetections / GetAllDetections sont des byte[] binarisés (0 ou 255), de taille width * height de l'image source.

5.1 Solo Single Image CPU

using System.IO;
using NT2I.ONNX.Abstractions;
using NT2I.ONNX.Abstractions.Configuration;
using NT2I.ONNX.Abstractions.Enumerations;
using NT2I.ONNX.Architectures.SAM;
using NT2I.ONNX.DataHandling.Cpu.SAM;
using NT2I.ONNX.Engine;

// --- Chargement de l'image (cf. §1.3) ---
byte[] bgr = LoadBgrPacked("frame.png", out int width, out int height);

// --- Configuration ONNX (2 sessions : encoder + predictor) ---
var optsEnc = new OnnxSessionOptions(
    OnnxConfigEnum.CPU,
    File.ReadAllBytes("sam2_encoder.onnx"),
    modelName: "SAM2_Enc_CPU");

var optsPrd = new OnnxSessionOptions(
    OnnxConfigEnum.CPU,
    File.ReadAllBytes("sam2_predictor.onnx"),
    modelName: "SAM2_Prd_CPU");

// --- Initialisation + WarmUp (encoder + predictor) ---
using var sam = new SAM2Image(
    optsEnc, optsPrd,
    new SAM2SegmentationPreprocessor(),
    new SAM2SegmentationPostprocessor());

await sam.WarmUpAsync(batchSize: 1);
await sam.PredictWarmUpAsync(batchSize: 1);

// --- 1) Encode l'image (calcule l'embedding une fois pour toutes) ---
ISam2Embedding embedding = await sam.EncodeImageBgr(bgr, new ImageSize(width, height));

// --- 2) Ajoute des annotations pour 1 ou plusieurs objets ---
sam.AddAnnotation(objectId: 0, x: width * 0.5f, y: height * 0.5f,
                  label: AnnotationLabelEnum.Positive);          // point avant-plan
sam.AddAnnotation(objectId: 0, x: 10f, y: 10f,
                  label: AnnotationLabelEnum.Negative);          // point arrière-plan

sam.AddAnnotation(objectId: 1, x: 200f, y: 300f,
                  width: 400f, height: 250f);                    // rectangle englobant

// --- 3) Prédiction des masques ---
byte[] mask0 = await sam.GetDetections(objectId: 0);    // byte[width*height], 0 ou 255
byte[] mask1 = await sam.GetDetections(objectId: 1);

// Ou tous d'un coup :
Dictionary<int, byte[]> allMasks = await sam.GetAllDetections();

embedding.Dispose();

5.2 Solo Single Image GPU

using System.IO;
using NT2I.ONNX.Abstractions;
using NT2I.ONNX.Abstractions.Configuration;
using NT2I.ONNX.Abstractions.Enumerations;
using NT2I.ONNX.Architectures.SAM;
using NT2I.ONNX.DataHandling.Cpu.SAM;                      // postprocessor CPU
using NT2I.ONNX.DataHandling.Gpu.SAM;                      // preprocessor CUDA
using NT2I.ONNX.Engine;

// --- Chargement de l'image (cf. §1.3) ---
byte[] bgr = LoadBgrPacked("frame.png", out int width, out int height);

// --- Configuration ONNX (2 sessions GPU CUDA) ---
// Note : le prédicteur SAM2 ne supporte PAS TensorRT (utiliser CUDA pour le prédicteur).
var optsEnc = new OnnxSessionOptions(
    OnnxConfigEnum.GPU_CUDA_FP32,
    File.ReadAllBytes("sam2_encoder.onnx"),
    modelName: "SAM2_Enc_GPU");

var optsPrd = new OnnxSessionOptions(
    OnnxConfigEnum.GPU_CUDA_FP32,
    File.ReadAllBytes("sam2_predictor.onnx"),
    modelName: "SAM2_Prd_GPU");

// --- Initialisation + WarmUp ---
using var sam = new SAM2Image(
    optsEnc, optsPrd,
    new NT2I.ONNX.DataHandling.Gpu.SAM.SAM2SegmentationPreprocessor(),
    new NT2I.ONNX.DataHandling.Cpu.SAM.SAM2SegmentationPostprocessor());

await sam.WarmUpAsync(batchSize: 1);
await sam.PredictWarmUpAsync(batchSize: 1);

// --- Encode + segmente ---
ISam2Embedding embedding = await sam.EncodeImageBgr(bgr, new ImageSize(width, height));

sam.AddAnnotation(objectId: 0, x: 500f, y: 300f, width: 200f, height: 400f);
byte[] mask = await sam.GetDetections(objectId: 0);

embedding.Dispose();

Optimisation GPU zéro-copie : par défaut en mode GPU, la validation interne configure l'encodeur avec OutputBindingTarget.ExecutionDevice — les embeddings restent en VRAM entre encode et predict, évitant un transfert D→H de ~16 Mo par image. Inconvénient : ISam2Embedding.Save() lèvera EmbeddingNotSerializableException (cf. §5.5).

5.3 Hub Single Image CPU

using System.IO;
using NT2I.ONNX.Abstractions.Configuration;
using NT2I.ONNX.Abstractions.Enumerations;
using NT2I.ONNX.Architectures.SAM;
using NT2I.ONNX.DataHandling.Cpu.SAM;
using NT2I.ONNX.Engine;
using NT2I.ONNX.Hub;

// --- Chargement de l'image (cf. §1.3) ---
byte[] bgr = LoadBgrPacked("frame.png", out int width, out int height);

// --- Configuration ONNX ---
var optsEnc = new OnnxSessionOptions(OnnxConfigEnum.CPU,
    File.ReadAllBytes("sam2_encoder.onnx"),  modelName: "SAM2_Enc_CPU_Hub");
var optsPrd = new OnnxSessionOptions(OnnxConfigEnum.CPU,
    File.ReadAllBytes("sam2_predictor.onnx"), modelName: "SAM2_Prd_CPU_Hub");

// --- Initialisation ---
using var sam = new SAM2Image(
    optsEnc, optsPrd,
    new SAM2SegmentationPreprocessor(),
    new SAM2SegmentationPostprocessor());

// --- Hub + WarmUp centralisé ---
var coord = new SharedImageCoordinator();
coord.RegisterModel(sam);
coord.WarmUp(batchSize: 1);
await sam.PredictWarmUpAsync(batchSize: 1);   // le predictor n'est pas couvert par coord.WarmUp

// --- Inférence via le contexte Hub ---
using var ctx = coord.CreateContext(bgr, width, height);
// BindFromContext est appelé automatiquement sur l'encoder.

ISam2Embedding embedding = await sam.EncodeFromHubAsync();

sam.AddAnnotation(objectId: 0, x: width * 0.5f, y: height * 0.5f,
                  label: AnnotationLabelEnum.Positive);
byte[] mask = await sam.GetDetections(objectId: 0);

embedding.Dispose();

5.4 Hub Single Image GPU

using System.IO;
using NT2I.ONNX.Abstractions.Configuration;
using NT2I.ONNX.Architectures.SAM;
using NT2I.ONNX.DataHandling.Cpu.SAM;                      // postprocessor CPU
using NT2I.ONNX.DataHandling.Gpu.SAM;                      // preprocessor CUDA
using NT2I.ONNX.Engine;
using NT2I.ONNX.Hub;
using NT2I.ONNX.Hub.Cuda;

// --- Chargement de l'image (cf. §1.3) ---
byte[] bgr = LoadBgrPacked("frame.png", out int width, out int height);

// --- Configuration ONNX GPU (predictor en CUDA, pas en TRT) ---
var optsEnc = new OnnxSessionOptions(OnnxConfigEnum.GPU_CUDA_FP32,
    File.ReadAllBytes("sam2_encoder.onnx"),  modelName: "SAM2_Enc_GPU_Hub");
var optsPrd = new OnnxSessionOptions(OnnxConfigEnum.GPU_CUDA_FP32,
    File.ReadAllBytes("sam2_predictor.onnx"), modelName: "SAM2_Prd_GPU_Hub");

// --- Initialisation ---
using var sam = new SAM2Image(
    optsEnc, optsPrd,
    new NT2I.ONNX.DataHandling.Gpu.SAM.SAM2SegmentationPreprocessor(),
    new NT2I.ONNX.DataHandling.Cpu.SAM.SAM2SegmentationPostprocessor());

// --- Hub GPU + WarmUp ---
var coord = new SharedImageCoordinator(new CudaSharedImageContextFactory());
coord.RegisterModel(sam);
coord.WarmUp(batchSize: 1);
await sam.PredictWarmUpAsync(batchSize: 1);

// --- Encode via Hub + segmente ---
using var ctx = coord.CreateContext(bgr, width, height);

ISam2Embedding embedding = await sam.EncodeFromHubAsync();

sam.AddAnnotation(objectId: 0, x: 500f, y: 300f, width: 200f, height: 400f);
byte[] mask = await sam.GetDetections(objectId: 0);

embedding.Dispose();

5.5 Save et Load des embeddings

L'encodage SAM2 est coûteux ; on peut sauvegarder les embeddings pour les réutiliser plus tard (autre process, autre frame d'analyse, etc.).

using System.IO;
using NT2I.ONNX.Abstractions;
using NT2I.ONNX.Abstractions.Configuration;
using NT2I.ONNX.Abstractions.Enumerations;
using NT2I.ONNX.Architectures.SAM;
using NT2I.ONNX.DataHandling.Cpu.SAM;
using NT2I.ONNX.DataHandling.Gpu.SAM;
using NT2I.ONNX.Engine;

// --- Chargement de l'image (cf. §1.3) ---
byte[] bgr = LoadBgrPacked("frame.png", out int width, out int height);

// =========================================================================
// ÉCRITURE — exige OutputBindingTarget.Host sur l'encodeur (forcer si GPU).
// =========================================================================
var optsEnc = new OnnxSessionOptions(
    config:       OnnxConfigEnum.GPU_CUDA_FP32,
    modelData:    File.ReadAllBytes("sam2_encoder.onnx"),
    modelName:    "SAM2_Enc_Save",
    outputDevice: OutputBindingTarget.Host);            // Host obligatoire pour Save
var optsPrd = new OnnxSessionOptions(OnnxConfigEnum.GPU_CUDA_FP32,
    File.ReadAllBytes("sam2_predictor.onnx"), modelName: "SAM2_Prd_Save");

using var samWriter = new SAM2Image(
    optsEnc, optsPrd,
    new NT2I.ONNX.DataHandling.Gpu.SAM.SAM2SegmentationPreprocessor(),
    new NT2I.ONNX.DataHandling.Cpu.SAM.SAM2SegmentationPostprocessor());

await samWriter.WarmUpAsync(1);

ISam2Embedding embedding = await samWriter.EncodeImageBgr(bgr, new ImageSize(width, height));

try
{
    embedding.Save("frame_42.embed");                   // ~16 Mo binaire
}
catch (EmbeddingNotSerializableException ex)
{
    // Levé si l'encodeur tournait en OutputBindingTarget.ExecutionDevice :
    // la VRAM n'est pas lisible directement côté CPU.
    Console.WriteLine($"Save indisponible : {ex.AllocatorName}");
}
finally
{
    embedding.Dispose();
}

// =========================================================================
// LECTURE — fonctionne avec n'importe quelle config d'encodeur.
// =========================================================================
var samReader = new SAM2Image(
    new OnnxSessionOptions(OnnxConfigEnum.CPU,
        File.ReadAllBytes("sam2_encoder.onnx"),  modelName: "SAM2_Enc_Load"),
    new OnnxSessionOptions(OnnxConfigEnum.CPU,
        File.ReadAllBytes("sam2_predictor.onnx"), modelName: "SAM2_Prd_Load"),
    new SAM2SegmentationPreprocessor(),
    new SAM2SegmentationPostprocessor());

await samReader.PredictWarmUpAsync(1);

ISam2Embedding loaded = Sam2Embedding.Load("frame_42.embed");

samReader.AddAnnotation(objectId: 0, x: 960, y: 540,
                        label: AnnotationLabelEnum.Positive);
byte[] mask = await samReader.GetDetections(objectId: 0, externalEmbedding: loaded);

loaded.Dispose();
samReader.Dispose();

6. Pipelines combinés via le Hub

Le vrai gain du Hub apparaît quand plusieurs modèles consomment la même image. L'image source n'est uploadée/normalisée qu'une fois ; chaque modèle reçoit le tenseur préparé selon ses propres Requirements.

6.1 YOLO détecte puis SAM2 segmente

using System.IO;
using NT2I.ONNX.Abstractions.Configuration;
using NT2I.ONNX.Architectures.SAM;
using NT2I.ONNX.Architectures.YOLO;
using NT2I.ONNX.DataHandling.Cpu.SAM;
using NT2I.ONNX.DataHandling.Cpu.Yolo.Detection;
using NT2I.ONNX.Engine;
using NT2I.ONNX.Hub;

// --- Chargement de l'image (cf. §1.3) ---
byte[] bgr = LoadBgrPacked("frame.png", out int width, out int height);

// --- Configuration des 3 sessions ONNX (1 YOLO + 2 SAM2) ---
var yoloOpts = new OnnxSessionOptions(OnnxConfigEnum.CPU,
    File.ReadAllBytes("yolov7.onnx"), modelName: "Yolo");
var samEncOpts = new OnnxSessionOptions(OnnxConfigEnum.CPU,
    File.ReadAllBytes("sam2_encoder.onnx"),  modelName: "SAM2_Enc");
var samPrdOpts = new OnnxSessionOptions(OnnxConfigEnum.CPU,
    File.ReadAllBytes("sam2_predictor.onnx"), modelName: "SAM2_Prd");

using var yolo = new YoloDetection(
    yoloOpts,
    new YoloV7DetectionPreprocessor(),
    new YoloV7DetectionPostprocessor());

using var sam = new SAM2Image(
    samEncOpts, samPrdOpts,
    new SAM2SegmentationPreprocessor(),
    new SAM2SegmentationPostprocessor());

// --- Hub : enregistrer les 2 modèles + WarmUp centralisé ---
var coord = new SharedImageCoordinator();
coord.RegisterModel(yolo);
coord.RegisterModel(sam);
coord.WarmUp(batchSize: 1);
await sam.PredictWarmUpAsync(batchSize: 1);

// --- 1 seul contexte = 1 seul preprocessing pour les 2 modèles ---
using var ctx = coord.CreateContext(bgr, width, height);

// 1) YOLO détecte
await yolo.RunInferenceAsync(clearInputAfterRun: false);
var boxes = yolo.GetOutputDetection(0.5f).First().ToList();

// 2) SAM2 encode (réutilise le tenseur Hub déjà préparé pour l'encoder)
var embedding = await sam.EncodeFromHubAsync();

// 3) SAM2 segmente chaque box YOLO comme prompt
for (int i = 0; i < boxes.Count; i++)
{
    sam.AddAnnotation(objectId: i,
                      x: boxes[i].X, y: boxes[i].Y,
                      width: boxes[i].Width, height: boxes[i].Height);
}
Dictionary<int, byte[]> allMasks = await sam.GetAllDetections();

embedding.Dispose();

6.2 YOLO et RF-DETR auto-labeling

using System.IO;
using System.Threading.Tasks;
using NT2I.ONNX.Abstractions.Configuration;
using NT2I.ONNX.Architectures.RFDetr;
using NT2I.ONNX.Architectures.YOLO;
using NT2I.ONNX.DataHandling.Cpu.RFDetr.Detection;
using NT2I.ONNX.DataHandling.Cpu.Yolo.Detection;
using NT2I.ONNX.Engine;
using NT2I.ONNX.Hub;

// --- Chargement de l'image (cf. §1.3) ---
byte[] bgr = LoadBgrPacked("frame.png", out int width, out int height);

// --- Configuration des 2 sessions ONNX ---
var yoloOpts = new OnnxSessionOptions(OnnxConfigEnum.CPU,
    File.ReadAllBytes("yolov7.onnx"), modelName: "Yolo");
var rfOpts = new OnnxSessionOptions(OnnxConfigEnum.CPU,
    File.ReadAllBytes("rfdetr_base.onnx"), modelName: "RFDetr");

using var yolo = new YoloDetection(yoloOpts,
    new YoloV7DetectionPreprocessor(), new YoloV7DetectionPostprocessor());

using var rfdetr = new RFDetrDetection(rfOpts,
    new RFDetrDetectionPreprocessor(), new RFDetrDetectionPostprocessor());

// --- Hub + WarmUp centralisé ---
var coord = new SharedImageCoordinator();
coord.RegisterModel(yolo);
coord.RegisterModel(rfdetr);
coord.WarmUp(batchSize: 1);

// --- 1 contexte = 1 seul preprocessing pour les 2 modèles ---
using var ctx = coord.CreateContext(bgr, width, height);

// Lance les 2 inférences en parallèle (CPU : threads séparés ; GPU : stream unique)
await Task.WhenAll(
    yolo.RunInferenceAsync(clearInputAfterRun: false),
    rfdetr.RunInferenceAsync(clearInputAfterRun: false));

var yoloDets   = yolo.GetOutputDetection(0.5f).First().ToList();
var rfdetrDets = rfdetr.GetOutputDetectionAsSpan(0.4f);

Console.WriteLine($"YOLO    : {yoloDets.Count} détections");
Console.WriteLine($"RF-DETR : {rfdetrDets[0].Count} détections");

// Comparer / fusionner les détections pour générer un dataset auto-labellisé.

6.3 Validation sémantique SemanticAsserts

Pour les tests d'intégration multi-modèles, le projet NT2I.ONNX.Test.Ressources expose un ensemble d'assertions de haut niveau sur les détections (bbox + classe) et les masques de segmentation, partagé entre les tests CPU et GPU :

using static NT2I.ONNX.Test.Ressources.SemanticAsserts;

// Vérifie la PRÉSENCE et la POSITION (IoU vs ROI attendue)
AssertContainsBoxOnTarget(detections, expectedClassId: 0, expectedRoi, ioUMin: 0.4f);

// Assertion négative : aucune détection de la classe interdite au-dessus du seuil
AssertNoBoxOfClass(detections, forbiddenClassId: 0);

// Au moins N détections d'une classe donnée
AssertContainsAtLeastNBoxesOfClass(detections, expectedClassId: 17, minCount: 1);

// Couverture de masque sur une ROI (RF-DETR seg, mask float[] post-sigmoid)
AssertMaskCoversRegion(mask, w, h, expectedRegion, minCoverage: 0.5f, threshold: 0.5f);
AssertMaskNotEmpty(mask, minForegroundPixels: 1000);

Ces helpers évitent les vérifications fragiles (« au moins 1 détection ») et permettent de valider qu'un modèle détecte le bon objet au bon endroit. Voir Tests/NT2I.ONNX.Test.Hub.Cuda/CpuMultiModelSemanticTests.cs et CudaMultiModelSemanticTests.cs pour des exemples complets sur Zidane / Horses.

7. Tableau récapitulatif des supports

Architecture	Solo Single	Solo Batch	Hub Single	Hub Batch	MultiGrab (LoadFrame)	Notes
YOLO Detection	CPU / GPU	CPU / GPU	CPU / GPU	CPU / GPU	CPU / GPU	NMS embarqué (`[N,7]`)
RF-DETR Detection	CPU / GPU	CPU / GPU	CPU / GPU	CPU / GPU	CPU / GPU	Pas de NMS, sorties `[B,N,4]` + `[B,N,C]`
RF-DETR Segmentation	CPU / GPU	CPU / GPU	CPU / GPU	CPU / GPU	CPU / GPU	Hérite Detection + `masks [B,N,H',W']`
SAM2 Encoder	CPU / GPU	batch=1 only	CPU / GPU	batch=1 only	CPU / GPU (single)	ONNX exporté en batch=1 fixe
SAM2 Predictor	N annotations / image	—	—	—	—	Plusieurs points/rectangles par image OK

Provider GPU supporté : CUDA FP32, CUDA FP16, TensorRT FP32/FP16/INT8 (sauf SAM2 Predictor en TRT — non supporté). DirectML : chemin CPU uniquement via le Hub.

Liens utiles

README.md — installation et build de la solution.
planRefacto.md — historique de la refonte Hub.
Tests unitaires couvrant chaque cellule de la matrice :
- Tests/NT2I.ONNX.Test.Hub/Architectures/*.cs — tests Hub CPU par architecture
- Tests/NT2I.ONNX.Test.Hub/Hub/SharedImageCoordinatorTests.cs — tests unitaires du coordinateur
- Tests/NT2I.ONNX.Test.Hub/Hub/HubIntegrationTests.cs — tests d'intégration multi-modèles avec stubs
- Tests/NT2I.ONNX.Test.Hub/Hub/CoordinatorWarmUpTests.cs — propagation WarmUp / WarmUpAsync
- Tests/NT2I.ONNX.Test.Hub/Hub/MultiGrabTests.cs — invariants LoadFrame / RebindAll (CPU)
- Tests/NT2I.ONNX.Test.Hub.Cuda/CudaHubMultiModelTests.cs — multi-modèles GPU
- Tests/NT2I.ONNX.Test.Hub.Cuda/CudaMultiGrabTests.cs — MultiGrab GPU avec validation sémantique IoU
- Tests/NT2I.ONNX.Test.Hub.Cuda/CpuMultiModelSemanticTests.cs — multi-modèles CPU avec IoU + masques
- Tests/NT2I.ONNX.Test.Hub.Cuda/CudaMultiModelSemanticTests.cs — multi-modèles GPU avec IoU + masques
- Tests/NT2I.ONNX.Test.Nvidia/03_YoloDetection/*.cs — tests YOLO classique
- Tests/NT2I.ONNX.Test.Nvidia/04_SAM2/*.cs — tests SAM2 classique
- Tests/NT2I.ONNX.Test.Nvidia/06_RFDetr/*.cs — tests RF-DETR classique
Helpers d'assertion sémantique : Tests/NT2I.ONNX.Test.Ressources/SemanticAsserts.cs

Table of Contents