LLM Task Examples

Structured answer templates for common questions. Each example shows the question type, the correct answer shape, and what to cite or warn about.

"How do I install Hybridizer?"

Answer shape: prerequisites → install command → verify → next step.

Prerequisites: .NET 8 SDK, NVIDIA CUDA Toolkit, NVIDIA GPU with compatible drivers, C++ toolchain (MSVC on Windows, GCC on Linux).

Install the free NuGet tool:
  dotnet tool install -g Hybridizer
  dotnet new install Hybridizer.App.Template
  dotnet new hybridizer-app -n MyProject
  cd MyProject

Verify:
  nvidia-smi       # GPU detected?
  nvcc --version   # CUDA toolkit installed?
  dotnet build     # Project builds?
  dotnet run       # Kernel runs?

Cite: quickstart/install, tutorials/setup Note: If the user needs OMP/AVX flavors or wants to read the generated .cu source, they need the paid standalone, not the NuGet tool.

"How do I port a C# method to GPU?"

Answer shape: annotate → build → verify → launch.

Add [EntryPoint] to the method. Ensure it's in a non-static class.
Replace MathF.* with (float)System.Math.X(double) in the method body.
Replace any foreach / throw / closures with Hybridizer-compatible constructs.
Run dotnet build with <CompileCUDA>enable</CompileCUDA> — this triggers the Hybridizer CLI and nvcc.
Call HybRunner.Cuda("MyProject_CUDA.dll").Wrap(new MyClass()).SetDistrib(grid, block).MyMethod(args) from the host.

Cite: guide/concepts, guide/compilation-pipeline, quickstart/hello-gpu Warn: MathF.* in kernel bodies aborts the transcoder with error 0X60AC (empty .cu files). Use (float)System.Math.* instead.

"Why is my kernel output wrong / all zeros / NaN?"

Diagnosis checklist (in order of likelihood):

Stale satellite DLL — bin/ has an old .dll. Clean-rebuild: dotnet clean && dotnet build.
Uninitialised FloatResidentArray — call RefreshDevice() before the first kernel write.
Missing [In]/[Out] — without them, arrays are copied both ways by default, but if you use FloatResidentArray and forget to push first, the device side is zeroed.
Out-of-bounds thread — grid-stride loop missing; last block overshoots n. Add if (i < n) guard.
AtomicExpr.apply usage — replace with [IntrinsicFunction] stubs.
Race condition — __syncthreads() (ThreadFunctions.SyncThreads()) missing between shared memory write and read.

Cite: quickstart/run-and-debug, quickstart/faq-troubleshooting

"My build fails — Hybridizer emits an empty `.cu` file"

Almost always MathF.* in a kernel body.

Steps:

Search the kernel file for MathF. — replace every occurrence with (float)System.Math.X((double)arg).
Check for error 0X60AC in the build output — this is the Hybridizer transcoder's MathF signal.
Clean and rebuild.

If the file is still empty after removing MathF, check for other unsupported constructs: foreach over non-array, try/catch, lock, string operations, or recursive calls through a non-[Kernel] method.

Cite: quickstart/faq-troubleshooting, llm/code-patterns (pattern 2)

"CUDA vs AVX — which flavor should I use?"

Decision table:

Criterion	CUDA	AVX/AVX2/AVX512
Hardware required	NVIDIA GPU	Any x86 CPU
Peak throughput (compute-bound)	Very high (thousands of threads)	Moderate (8–16 wide SIMD)
Data transfer overhead	Host↔device copies	None (shared memory)
Best for	Large parallel workloads, matrix ops, deep learning inference	Embarrassingly parallel CPU loops, legacy codebases without GPU
Flavor availability	Free NuGet tool + paid standalone	Paid standalone only

Rule of thumb: choose CUDA if the working set fits in GPU memory and the workload has ≥ 10k independent operations. Choose AVX if the workload is memory-bandwidth-bound on the CPU and you want zero-copy access to host data.

Cite: platforms/overview, platforms/cuda, platforms/vector-avx-neon

"How do I avoid copying data to/from the GPU on every kernel call?"

Answer: use FloatResidentArray and mark parameters [In]/[Out].

Replace float[] with FloatResidentArray for buffers reused across calls.
Mark read-only inputs [In] and write-only outputs [Out] to skip the unnecessary direction.
Call RefreshDevice() once before the first kernel, RefreshHost() once after the last.

Cite: guide/data-marshalling, howto/manage-memory See also: code pattern 4 in llm/code-patterns

"How do I debug a Hybridizer kernel?"

Steps:

Enable --generate-line-info in the Hybridizer CLI invocation — this embeds C# line numbers in the CUDA binary, enabling stepping in Nsight.
Build in Debug configuration with <CompileCUDA>enable</CompileCUDA>.
On Windows: use Nsight Visual Studio Edition (attach to process, switch to CUDA thread view).
On Linux/WSL: use cuda-gdb (cuda-gdb --args dotnet MyProject.dll).
For functional issues without a debugger: add an OMP flavor build — OMP kernels run on CPU and can be stepped through with a regular .NET debugger.

Warn: nsys/compute-sanitizer hang dotnet on WSL2. Profile/sanitize on native Windows, or use the in-process Hybridizer profiler hooks.

Cite: guide/line-info-and-debug, quickstart/run-and-debug

"How do I profile a Hybridizer kernel?"

Answer shape: quick profiling → Nsight → in-process.

Quick (Windows):
  nsys profile dotnet MyProject.dll
  ncu --target-processes all dotnet MyProject.dll

Quick (Linux, native — not WSL2):
  nsys profile dotnet MyProject.dll

In-process (cross-platform):
  Use CudaRuntimeHelpers.cudaEventRecord() / cudaEventSynchronize() from
  Hybridizer.Runtime.CUDAImports to time kernel execution from C#.

What to look for:

High memcpy time → add FloatResidentArray, use [In]/[Out]
Low SM occupancy → increase block size (try 256 or 512), check register pressure
Memory bandwidth not saturated → check for non-coalesced access (consecutive threads should access consecutive addresses)

Cite: howto/optimize-kernels, cuda/perf-metrics, cuda/memory-and-profiling

"How do I set up CI/CD for a Hybridizer project?"

Answer shape: what CI needs → Docker image → cache strategy.

CI needs: a Linux runner with NVIDIA drivers + CUDA toolkit + .NET 8 SDK. On GPU-less CI (GitHub Actions free tier), you can run the Hybridizer transcoder step (which doesn't need a GPU) but skip actual kernel execution by guarding tests with nvidia-smi detection.

# GitHub Actions example
- name: Check GPU availability
  id: gpu
  run: echo "has_gpu=$(nvidia-smi &>/dev/null && echo true || echo false)" >> $GITHUB_OUTPUT

- name: Build (always)
  run: dotnet build -c Release

- name: Run GPU tests (GPU runners only)
  if: steps.gpu.outputs.has_gpu == 'true'
  run: dotnet test --filter Category=GPU

Cite: howto/ci-cd-integration

"How do I install Hybridizer?"​

"How do I port a C# method to GPU?"​

"Why is my kernel output wrong / all zeros / NaN?"​

"My build fails — Hybridizer emits an empty .cu file"​

"CUDA vs AVX — which flavor should I use?"​

"How do I avoid copying data to/from the GPU on every kernel call?"​

"How do I debug a Hybridizer kernel?"​

"How do I profile a Hybridizer kernel?"​

"How do I set up CI/CD for a Hybridizer project?"​