RUNNING_STATS

The RUNNING_STATS function computes the mean and unbiased sample variance of an array without overflow. The function can also combine previously computed values with new data to allow computing mean and variance on data sets that are too large to fit into memory.

RUNNING_STATS uses the Welford "online" algorithm to compute the running mean and variance in a single pass through the data. The routine is more stable when computing the mean and variance, is significantly faster than the VARIANCE function, and unlike VARIANCE, does not require any additional memory.

Examples

; Define a vector of sample data:

IDL> A = [1, 2, 3, 4, 5, 6, 7, 8, 9, 10]

; Compute the [mean, variance, count]:

IDL> result = RUNNING_STATS(A)

IDL> result

IDL prints:

5.5000000000000000 9.1666666666666661 10.000000000000000

Syntax

Result = RUNNING_STATS( X [, /NAN] [, PREVIOUS=value] )

Return Value

Returns the statistics of the array X in the form [mean, variance, count] in double precision.

Arguments

X

The array to be processed. This array can be any numeric type other than complex or double complex.

Keywords

NAN

Set this keyword to cause the routine to check for occurrences of the IEEE floating-point values NaN or Infinity in the input data. Elements with the value NaN or Infinity are treated as missing data.

Note: Since the value NaN is treated as missing data, if you set /NAN and Array contains only NaN values, the routine will return NaN for the mean and variance, and zero for the count.

Set this keyword to a three-element array containing the [mean, variance, and count] from a previous calculation. These three values will be combined with the new statistics computed from the input array. If this keyword is omitted or is set to [0, 0, 0], then a new calculation is started.

Tip: See below for examples of chaining together multiple calls to RUNNING_STATS using the PREVIOUS keyword.

Note: If the count from a previous calculation is zero, then a new calculation is started, regardless of the mean or variance values.

Thread Pool Keywords

This routine is written to make use of IDL’s thread pool, which can increase execution speed on systems with multiple CPUs. The values stored in the !CPU system variable control whether IDL uses the thread pool for a given computation. In addition, you can use the thread pool keywords TPOOL_MAX_ELTS, TPOOL_MIN_ELTS, and TPOOL_NOTHREAD to override the defaults established by !CPU for a single invocation of this routine. See Thread Pool Keywords for details.

When computing the statistics for a large number of values, the results will depend upon the order in which the numbers are combined. Since the thread pool will combine values in a different order, you may obtain a different — but equally correct — result than that obtained using the standard non-threaded implementation. This effect occurs because RUNNING_STATS uses floating point arithmetic, and the mantissa of a floating point value has a fixed number of significant digits. For more information on floating-point numbers, see Accuracy and Floating Point Operations.

Additional Examples

IDL> A = [1, 2, 3, 4, 5]

IDL> B = [6, 7, 8, 9, 10]

; First compute the stats for the combined array:

IDL> RUNNING_STATS([A, B])

; 5.5000000000000000 9.1666666666666661 10.000000000000000

; Now compute the stats of just A and then combine with B using PREVIOUS keyword

IDL> Stats_of_A = RUNNING_STATS(A)

IDL> Stats_of_A

; 3.000000000000000 2.500000000000000 5.000000000000000

IDL> RUNNING_STATS(B, PREVIOUS = Stats_of_A)

; 5.5000000000000000 9.1666666666666661 10.000000000000000

; use PREVIOUS keyword to efficiently calculate stats on a huge array

IDL> stats = [0, 0, 0]

IDL> for i=0,99 do stats = RUNNING_STATS(randomu(seed, 1e7), PREVIOUS=stats)

IDL> stats

IDL prints:

0.50000184809149439 0.083333037727096743 1000000000.0000000

Version History

8.8.3

Introduced