Black-Scholes Option Pricing

Sample source: 4.Finance/BlackScholes

This example implements the Black-Scholes formula for pricing European call and put options. It demonstrates:

Intrinsic function mapping ([IntrinsicFunction])
[Kernel] helper functions
[In] / [Out] marshalling optimization
Numerical validation between CPU and GPU

Intrinsic Function Mapping

C# Math functions run on CPU. For GPU performance, we map them to CUDA fast-math intrinsics:

[IntrinsicFunction("expf")]
public static float Expf(float f)
{
    return (float)Math.Exp((double)f);
}

[IntrinsicFunction("sqrtf")]
public static float Sqrtf(float f)
{
    return (float)Math.Sqrt((double)f);
}

[IntrinsicFunction("logf")]
public static float Logf(float f)
{
    return (float)Math.Log((double)f);
}

[IntrinsicFunction("fabsf")]
public static float fabsf(float f)
{
    return Math.Abs(f);
}

C# Code	CPU	GPU (CUDA)
`Expf(x)`	`Math.Exp`	`expf` (hardware unit)
`Sqrtf(x)`	`Math.Sqrt`	`sqrtf` (hardware unit)
`Logf(x)`	`Math.Log`	`logf` (SFU)

tip

The body of each intrinsic function serves as the CPU fallback. On GPU, it is replaced by the native CUDA instruction. This keeps code portable.

The Kernel

Cumulative Normal Distribution

[Kernel]
static float CND(float f)
{
    const float A1 = 0.31938153f;
    const float A2 = -0.356563782f;
    const float A3 = 1.781477937f;
    const float A4 = -1.821255978f;
    const float A5 = 1.330274429f;
    const float RSQRT2PI = 0.3989422804f;

    float K = 1.0f / (1.0f + 0.2316419f * fabsf(f));
    float cnd = RSQRT2PI * Expf(-0.5f * f * f) *
                (K * (A1 + K * (A2 + K * (A3 + K * (A4 + K * A5)))));

    if (f > 0) cnd = 1.0f - cnd;
    return cnd;
}

Black-Scholes Entry Point

const float RISKFREE = 0.02f;
const float VOLATILITY = 0.30f;

[EntryPoint]
public static void BlackScholes(
    [Out] float[] callResult,
    [Out] float[] putResult,
    [In]  float[] stockPrice,
    [In]  float[] optionStrike,
    [In]  float[] optionYears,
    int lineFrom, int lineTo)
{
    for (int i = lineFrom + blockDim.x * blockIdx.x + threadIdx.x;
         i < lineTo;
         i += blockDim.x * gridDim.x)
    {
        float sqrtT = Sqrtf(optionYears[i]);
        float f1 = (Logf(stockPrice[i] / optionStrike[i])
                    + (RISKFREE + 0.5f * VOLATILITY * VOLATILITY) * optionYears[i])
                   / (VOLATILITY * sqrtT);
        float f2 = f1 - VOLATILITY * sqrtT;

        float CNDF1 = CND(f1);
        float CNDF2 = CND(f2);

        float expRT = Expf(-RISKFREE * optionYears[i]);
        callResult[i] = stockPrice[i] * CNDF1 - optionStrike[i] * expRT * CNDF2;
        putResult[i]  = optionStrike[i] * expRT * (1.0f - CNDF2)
                       - stockPrice[i] * (1.0f - CNDF1);
    }
}

Marshalling Analysis

Parameter	Attribute	Transfer	Why
`callResult`	`[Out]`	Device → Host	Written only
`putResult`	`[Out]`	Device → Host	Written only
`stockPrice`	`[In]`	Host → Device	Read only
`optionStrike`	`[In]`	Host → Device	Read only
`optionYears`	`[In]`	Host → Device	Read only

With 1M options × 5 arrays × 4 bytes: proper [In]/[Out] usage saves 10 MB of PCI-e transfer per call.

Launching and Validation

int OPT_N = 1024 * 1024 * Environment.ProcessorCount;

cuda.GetDeviceProperties(out cudaDeviceProp prop, 0);
HybRunner runner = SatelliteLoader.Load()
    .SetDistrib(8 * prop.multiProcessorCount, 256);
dynamic wrapper = runner.Wrap(new Program());

wrapper.BlackScholes(callResult_cuda, putResult_cuda,
                     stockPrice, optionStrike, optionYears,
                     0, OPT_N);

Numerical Parity Check

GPU floating-point may differ slightly from CPU:

float maxCallError = 0.0f;
for (int i = 0; i < OPT_N; ++i)
{
    float error = Math.Abs(callResult_net[i] - callResult_cuda[i]);
    maxCallError = Math.Max(maxCallError, error);
}
// Typically: L∞ < 1e-6 for single precision

warning

Floating-point arithmetic is not associative. GPU may execute operations in different order than CPU, leading to small differences. Always validate with tolerances, not exact equality.

Next Steps

Hello World — Simpler starting point
Intrinsics & Builtins — Full intrinsic reference
Data Marshalling — Transfer optimization

Intrinsic Function Mapping​

The Kernel​

Cumulative Normal Distribution​

Black-Scholes Entry Point​

Marshalling Analysis​

Launching and Validation​

Numerical Parity Check​

Next Steps​