IMSL_CAT_GLM

The IMSL_CAT_GLM function analyzes categorical data using logistic, Probit, Poisson, and other generalized linear models.

This routine requires an IDL Advanced Math and Stats license. For more information, contact your sales or technical support representative.

The IMSL_CAT_GLM function uses iteratively reweighted least squares to compute (extended) maximum likelihood estimates in some generalized linear models involving categorized data. One of several models, including the probit, logistic, Poisson, logarithmic, and negative binomial models, may be fit.

Note that each row vector in the data matrix can represent a single observation; or, through the use of keyword IFREQ, each row can represent several observations. Also note that classification variables and their products are easily incorporated into the models via the usual regression-type specifications. The models available in IMSL_CAT_GLM are listed in the following table.

Model

PDF of the Response Variable

Parameterization

0

f(y) = (λyexp(-λ)) / y

λ = N x exp(ω + η)

1

2

3

4

θ = Φ (ω + η)

5

θ = 1 − exp (−exp (ω + η) )

Here, Φ denotes the cumulative normal distribution. N and S are known distribution parameters specified for each observation via the keyword IPAR. ω is an optional fixed parameter of the linear response, γi, specified for each observation. (If the keyword IFIX is not used, then ω is taken to be 0.) Since the log-log model (Model = 5) probabilities are not symmetric with respect to 0.5, quantitatively, as well as qualitatively, different models result when the definitions of "success" and "failure" are interchanged in this distribution. In this model and all other models involving θ, θ is taken to be the probability of a "success".

Computational Details

The computations proceed as follows:

  1. The input parameters are checked for consistency and validity.
  2. Estimates of the means of the "independent" or design variables are computed. The frequency or the observation in all but binomial distribution models is taken from vector frequencies. In binomial distribution models, the frequency is taken as the product of n = parameter (i) and frequencies (i). Means are computed as:

  3. By default, unless the keyword INIT_EST is used, initial estimates of the coefficients are obtained (based upon the observation intervals) as multiple regression estimates relating transformed observation probabilities to the observation design vector. For example, in the binomial distribution models, θ may be estimated as:

    and, when Model=3, the linear relationship is given by:

    while, if Model = 4, Φ-1 (θ) = Xβ. When computing initial estimates, standard modifications are made to prevent illegal operations such as division by zero. Regression estimates are obtained at this point, as well as later, by use of IMSL_MULTIREGRESS.

  4. Newton-Raphson iteration for the maximum likelihood estimates is implemented via iteratively reweighted least squares. Let:

    denote the log of the probability of the i-th observation for coefficients β. In the least-squares model, the weight of the i-th observation is taken as the absolute value of the second derivative of:

    with respect to:

    (times the frequency of the observation). The dependent variable is taken as the first derivative Ψ with respect to γi, divided by the square root of the weight times the frequency. The Newton step is given by:

    where all derivatives are evaluated at the current estimate of γ and βn+1 = β – ∆β. This step is computed as the estimated regression coefficients in the least-squares model. Step halving is used when necessary to ensure a decrease in the criterion.

  5. Convergence is assumed when the maximum relative change in any coefficient update from one iteration to the next is less than EPS or when the relative change in the log-likelihood from one iteration to the next is less than EPS/100. Convergence is also assumed after ITMAX iterations or when step halving leads to a step size of less than 0.0001 with no increase in the log-likelihood.
  6. Residuals are computed according to methods discussed by Pregibon (1981). Let li (γi) denote the log-likelihood of the i-th observation evaluated at γi. Then, the standardized residual is computed as:

    where:

    is the value of γi when evaluated at the optimal:

    The denominator of this expression is used as the "standard error of the residual" while the numerator is the "raw" residual. Following Cook and Weisberg (1982), the influence of the i-th observation is assumed to be:

    This is a one-step approximation to the change in estimates when the i-th observation is deleted. Here, the partial derivatives are with respect to β.

Programming Notes

Example

This example is from Prentice (1976) and involves mortality of beetles after five hours of exposure to eight different concentrations of carbon disulphide. The table below lists the number of beetles exposed (N) to each concentration level of carbon disulphide (x, given as log dosage) and the number of deaths which result (y). The following table shows the data:

Log Dosage

Number of Beetles Exposed

Number of Deaths

1.690

59

6

1.724

60

13

1.755

62

18

1.784

56

28

1.811

63

52

1.836

59

53

1.861

62

61

1.883

60

60

The number of deaths at each concentration level are fitted as a binomial response using the following models: logit (Model = 3), probit (Model = 4), and log-log (Model = 5). Note that the log-log model yields a smaller absolute log likelihood (14.81) than the logit model (18.78) or the probit model (18.23). This is to be expected since the response curve of the log-log model has an asymmetric appearance, but both the logit and probit models are symmetric about θ = 0.5.

.RUN

PRO print_results, cs, means, ca, crit, ls, cov

PRINT, '; Coefficient Statistics'

PRINT, ';                  Standard Asymptotic   ', $

';Asymptotic'

PRINT, ';  Coefficient        Error  Z-statistic      ', $

';P-value'

PM, cs, FORMAT = ';(4F13.2)'

PRINT

PRINT, ';Covariate Means = ', means, FORMAT = '(A18, F6.3)'

PRINT

PRINT, ';                           Case Analysis'

PRINT, ';                            Residual            ', $

';Standardized'

PRINT, ';   Predicted    Residual  Std. Error    Leverage', $

';    Residual'

PM, ca, FORMAT = ';(5F12.3)'

PRINT

PRINT, ';Log-Likelihood = ', crit, FORMAT = '(A18, F9.5)'

PRINT

PRINT, ';         Last Step'

PRINT, ls

PRINT

PRINT, ';Asymptotic Coefficient Covariance'

PM, cov, FORMAT = ';(2F12.4)'

END

 

model = 3

nobs = 8

x = ([[1.690, 1.724, 1.755, 1.784, 1.811, 1.836, 1.861, 1.883],$

  [6, 13, 18, 28, 52, 53, 61, 60], $

  [59, 60, 62, 56, 63, 59, 62, 60]])

ncoef = IMSL_CAT_GLM(0, 1, model, x, Ipar = 2, Eps = 1.0e-3, $

  Coef_Stat = cs, Covariances = cov, $

  Criterion = crit, Means = means, $

  Case_Analysis = ca, Last_Step = ls, Obs_Status = os)

print_results, cs, means, ca, crit, ls, cov

IDL prints:

Errors

Warning Errors

STAT_TOO_MANY_HALVINGS: Too many step halvings. Convergence is assumed.

STAT_TOO_MANY_ITERATIONS: Too many iterations. Convergence is assumed.

Fatal Errors

STAT_TOO_FEW_COEF: INIT_EST is used and N_coef_input = #. The model specified requires # coefficients.

STAT_MAX_CLASS_TOO_SMALL: The number of distinct values of the classification variables exceeds MAX_CLASS = #.

STAT_INVALID_DATA_8: N_CLASS_VALS(#) = #. The number of distinct values for each classification variable must be greater than one.

STAT_NMAX_EXCEEDED: The number of observations to be deleted has exceeded Lp_Max = #. Rerun with a different model or increase the workspace.

Syntax

Result = CAT_GLM(N_class, N_continuous, Model, X [, CASE_ANALYSIS=variable] [, CLASS_VALS=variable] [, COVARIANCES=variable] [, COEF_STAT=variable] [, CRITERION=variable] [, /DOUBLE] [, EPS=value] [, IFIX=value] [, IFREQ=value] [, INDICES_EFFECTS=array] [, INIT_EST=array] [, IPAR=value] [, ITMAX=value] [, LAST_STEP=variable] [, MAX_CLASS=value] [, MEANS=variable] [, N_CLASS_VALS=variable] [, /NO_INTERCEPT] [, OBS_STATUS=variable] [, VAR_EFFECTS=array])

Return Value

An integer value indicating the number of estimated coefficients in the model.

Arguments

Model

The model used to analyze the data. The following table contains the models. The Relationship column refers to the relationship between the parameter, θ or λ, and a linear model of the explanatory variables, Xβ.

Model

Relationship

PDF of Response Variable

0

Exponential

Poisson

1

Logistic

Negative Binomial

2

Logistic

Logarithmic

3

Logistic

Binomial

4

Probit

Binomial

5

Log-log

Binomial

The lower bound of the response variable is 1 for Model = 3, and it is 0 for all other models. See the description at the beginning of this topic for more information about these models.

N_class

Number of classification variables.

N_continuous

Number of continuous variables.

X

Two-dimensional array of size N_observations by (N_class + N_continuous) + m containing data for the independent variables, dependent variable, and optional parameters.

The columns must be ordered such that the first N_class columns contain data for the class variables, the next N_continuous columns contain data for the continuous variables, and the next column contains the response variable. The final (and optional) m – 1 columns contain optional parameters. See the keywords IFREQ, IFIX, and IPAR.

Keywords

CASE_ANALYSIS (optional)

Named variable containing a two-dimensional array of the case analysis. The array size is [N_observations, 5].

Case statistics are computed for all observations except where missing values prevent their computation.

CLASS_VALS (optional)

Named variable into which a one-dimensional array of length:

containing the distinct values of the classification variables in ascending order is stored. The first N_CLASS_VALS(0) elements of CLASS_VALS contain the values for the first classification variables. The next N_CLASS_VALS(1) elements contain the values for the second classification variable, etc.

COVARIANCES (optional)

Named variable into which a two-dimensional array of size [N_coefficients, N_coefficients] containing the estimated asymptotic covariance matrix of the coefficients is stored. For ITMAX = 0, this is the Hessian computed at the initial parameter estimates.

COEF_STAT (optional)

Named variable into which a two-dimensional array of size [N_coefficients, 4] containing the parameter estimates and associated statistics is stored.

CRITERION (optional)

Named variable into which the optimized criterion is stored. The criterion to be maximized is a constant plus the log-likelihood.

DOUBLE (optional)

If present and nonzero, double precision is used.

EPS (optional)

Convergence criterion. Convergence is assumed when the maximum relative change in any coefficient estimate is less than EPS from one iteration to the next or when the relative change in the log-likelihood, criterion, from one iteration to the next is less than EPS/100.0. The default value is 0.001.

IFIX (optional)

Column number Ifix in X containing a fixed parameter for each observation that is added to the linear response prior to computing the model parameter. The fixed parameter allows one to test a hypothesis about the parameters via the log-likelihoods.

IFREQ (optional)

Column number Ifreq in X containing the frequency of response for each observation.

INDICES_EFFECTS (optional)

One-dimensional index array with the following length:

VAR_EFFECTS(0) + VAR_EFFECTS(1) + ... + VAR_EFFECTS(N_effects - 1)

The first VAR_EFFECTS (0) elements give the column numbers of X for each variable in the first effect. The next elements give the column numbers for each variable in the second effect. The last elements give the column numbers for each variable in the last effect. The keywords INDICES_EFFECTS and VAR_EFFECTS must be used together.

INIT_EST (optional)

One-dimensional array of length N_coef_input containing initial estimates of parameters. N_coef_input can be completed by IMSL_REGRESSORS. By default, unweighted linear regression is used to obtain initial estimates.

IPAR (optional)

Column number IPAR in X containing the value of the known distribution parameter for each observation, where X(i, IPAR) is the known distribution parameter associated with the i-th observation. The meaning of the distributional parameter depends upon the model used, as shown in the following table:

Model

Parameter

Meaning of parameter (i, IPAR)
0 E

ln(E) is a fixed intercept to be included in the linear predictor (i.e., the offset).

1 S Number of successes required for the negative binomial distribution
2 - Not used for this model
3-5 N Number of trials required for the binomial distribution

The default behavior is that, when Model ≠ 2, each observation is assumed to have a parameter value of 1. When Model = 2, this parameter is not referenced.

ITMAX (optional)

Maximum number of iterations. Use a value of 0 to compute Hessian (stored in COVARIANCES), and the Newton step (stored in LAST_STEP) at the initial estimates. Use the keyword INIT_EST to get the initial estimates. The default value is 30.

LAST_STEP (optional)

Named variable in the form of a one-dimensional array of length N_coefficients containing the last parameter updates (excluding step halvings). When ITMAX is set to 0, LAST_STEP contains the inverse of the Hessian times the gradient vector, all computed at the initial parameter estimates.

MAX_CLASS (optional)

An upper bound on the sum of the number of distinct values taken on by each classification variable. The default value is N_observations by N_class.

MEANS (optional)

Named variable in the form of a one-dimensional array containing the means of the design variables. The array is of length N_coefficients if the keyword NO_INTERCEPT is used, and N_Coefficients − 1 otherwise.

N_CLASS_VALS (optional)

Named variable in the form of a one-dimensional array of length N_class containing the number of values taken by each classification variable. The i-th classification variable has N_CLASS_VALS (i) distinct values.

NO_INTERCEPT (optional)

If present and nonzero, there is no intercept in the model. By default, the intercept is automatically included in the model.

OBS_STATUS (optional)

Named variable into which an one-dimensional array of length N_observations indicating which observations are included in the extended likelihood is stored.

Notes

VAR_EFFECTS (optional)

One-dimensional array of length N_effects containing the number of variables associated with each effect in the model, where N_effects is the number of effects (source of variation) in the model. Keywords VAR_EFFECTS and INDICES_EFFECTS must be used together.

Version History

6.4

Introduced