NTCC REPORTONLOGISTIC REGRESSIONSUBMITTED BYCHANDNI MISHRATOWARDS PARTIAL COMPLETION OF M.SC STATISTICSUNDER SUPERVISION OFDR.
C.M.PANDEYPROFESSOR AND HEADDEPARTMENT OF BIOSTATISTICS AND HEALTH INFORMATICSSANJAY GANDHI POSTGRADUATE INSTITUTE OF MEDICAL SCIENCES, LUCKNOW AcknowledgmentI WOULD LIKE TO EXPRESS MY SINCERE GRATITUDE TO MY NTCC GUIDE DR.
NEERAJ SINGH AND TRAINING GUIDE, DR. C.M. PANDEY FOR PROVIDING THEIR INVALUABLE GUIDANCE TOWARDS COMPLETION OF THIS REPORT.ABSTRACTThe purpose of this report is to provide theresearchers, and the readers with a basic concept of LOGISTIC REGRESSION, with the help of some examples and how to calculate and interpret the result using software like SPSS.This report includes tables, graphs, and calculations with the help of software like SPSS, and interpretation.
BRIEF INTRODUCTION TO SPSSSPSS (Statistical Package for the Social Sciences) is a versatile and responsive program designed to undertake a range of statistical procedures.When SPSS, Inc., an IBM Company, was conceived in 1968, it stood for Statistical Package for Social Sciences. Since the company’s purchase by IBM in 2009, IBM has decided to simply use the name SPSS to describe its core product of predictive analytics.
IBM describes predictive analytics as tools that help connect data to effective action by drawing reliable conclusions about current conditions and future events.SPSS is an integrated system of computer programs designed for the analysis of social sciences data. It is one of the most popular of the many statistical packages currently available for statistical analysis. Its popularity stems from the fact that the program:Allows for a great deal of flexibility in the format of data.Provides the user with a comprehensive set of procedures for data transformation and file manipulation.Offers the researcher a large number of statistical analyses commonly used in social sciences.CONTENT1.Introduction to the Logistic Regression ModelIntroductionFitting the Logistic Regression ModelTesting for the Significance of the CoefficientsConfidence Interval Estimation2.
The Multiple Logistic Regression Model IntroductionThe Multiple Logistic Regression ModelFitting the Multiple Logistic Regression ModelTesting for the Significance of the ModelConfidence Interval Estimation3. Interpretation of the Fitted Logistic Regression Model IntroductionDichotomous Independent VariablePolychotomous Independent VariableContinuous Independent VariableMultivariable ModelsPresentation and Interpretation of the Fitted Values4.Model-Building Strategies and Methods for Logistic RegressionIntroductionPurposeful Selection of CovariatesCASE STUDY.APPENDIXREVIEW -Types of Data ; Measurement Scales: Nominal, Ordinal, interval, and ratio.These are simply ways to categorize different types of variables.
Nominal- Nominal scales are used for labeling variables, without any quantitative value. A good way to remember all of this is that “nominal” sounds a lot like “name” and nominal scales are kind of like “names” or labels. Examples of nominal variables include region, zip code, or gender of individual or religious affiliation. The nominal scale can also be coded by the researcher in order to ease out the analysis process, for example; M=Female, F= Female, etc.Ordinal -This level of measurement involves ordering or ranking the variable to be measured, it is the order of the values is what’s important and significant, but the differences between each one are not really known. For example, is the difference between “OK” and “Unhappy” the same as the difference between “Very Happy” and “Happy?” We cant say. Ordinal scales are typically measures of non-numeric concepts like satisfaction, happiness, discomfort, etc.
Interval- The interval level of measurement not only classifies and orders the measurements, but it also specifies that the distances between each interval on the scale are equivalent along the scale from low interval to high interval. For example, on a standardized intelligence measure, a 10-point difference in IQ scores has the same meaning anywhere along the scale. Thus, the difference in IQ test scores between 80 and 90 is the same as the difference between 110 and 120. However, it would not be correct to say that a person with an IQ score of 100 is twice as intelligent as a person with a score of 50. The reason for this is because intelligence test scales (and other similar interval scales) do not have a true zero that represents a complete absence of intelligence.Ratio -In this level of measurement, the observations, in addition to having equal intervals, can have a value of zero as well. The zero in the scale makes this type of measurement unlike the other types of measurement, although the properties are similar to that of the interval level of measurement. In the ratio level of measurement, the divisions between the points on the scale have an equivalent distance between them.
The four data typesAttribute Nominal Ordinal Interval RatioName2 Categorical Sequence Equal Interval RatioName3 Set Fully ordered, rank ordered Unit size fixed Zero or ref.pt fixedStatistics Count, Mode, chi-squared + median, rank order correlation + ANOVA, mean, SD + Logs??Example1 Set of participants makes of car order of finishing a race centigrade scale Degrees Kelvin or absoluteTypes of relativity A?B A>B |(A-B)| > |(C-D)| ?Types of absolute The identity of individual entities order, sequence intervals, differences ratios, proportionsNote- odds can have a large magnitude even if the underlying probabilities are low.ProbabilityP=outcomes of interest all possible outcomes Odds=p(occurring )p(not occurring)=p(1-p)=The odds of an event are the number of events / the number of non-events.
Odds ratio- odds ratio is a ratio of two odds.Odds ratio = odds1odds0Odds ratio = (p11-p1)(p01-p0)Introduction to the LogisticRegression ModelINTRODUCTIONLogistic regression is the appropriate regression analysis to conduct when the dependent variable (y)is dichotomous (binary) such as “yes” or “no”, “1” or “2”, “A” or “B” or “c”. Logistic regression allows one to predict a discrete outcome, such as group membership, from a set of variables that may be continuous, discrete, dichotomous, or a mix of any of these. Generally, the dependent variable is dichotomous, such as male/female, smoker/non-smoker or success/failure like all regression analyses, the logistic regression is a predictive analysis. Logistic regression is used to describe data and to explain the relationship between one dependent binary variable and one or more nominal, ordinal, interval or ratio-level independent variables.
The logistic regression model is the most frequently used regression model for the analysis of these data. The independent variables are often called covariates.What distinguishes a logistic regression model from the linear regression model is that the outcome variable in logistic regression is categorical. This difference between logistic and linear regression is reflected both in the form of the model and its assumptions. Once this difference is accounted for, the methods employed in an analysis using logistic regression follow, more or less, the same general principles used in linear regression.
Thus, the techniques used in linear regression analysis motivate our approach to logistic regression.There are three primary uses of logistic regression: Prediction of group membership and outcome.The goal is to correctly predict the category of the outcome of individual cases.
Thus, the research question asked is whether an outcome can be predicted from a selected set of independent variables. For instance, in epidemiological studies, can the development of lung cancer be predicted from the incidence and duration of smoking as well as from demographic variables such as gender, age, and social and economic status (SES)?2. Logistic regression provides knowledge of the relationships and strengths among the variables.The goal is to identify which independent variables predict the outcome, that is, increase or decrease the probability of the outcome or have no effect.
For example, does inclusion of information about the incidence and duration of smoking improve prediction of lung cancer, and is a particular variable associated with an increase or decrease in the probability that a case has lung cancer? These parameter estimates (the coefficients of the predictors included in a model) can also be used to calculate and interpret the odds ratio. For instance, what are the odds that a person has lung cancer at age 65, given that he has smoked 10 packs a day for the past 30 years?3.Classification of cases.The goal is to understand how reliable the logistic regression model is in classifying cases for whom the effect is known. For instance, how many people with or without lung cancer are diagnosed correctly? The researcher establishes a cut point of say .5, and then asks, for instance: How many people with lung cancer are correctly classified if everyone with a predicted probability more is diagnosed as having lung cancer?Why will other regression procedure not work?Simple linear regression is one quantitative variable predicting another.Multiple regression is a simple linear regression with more independent variables.Nonlinear regression is still two quantitative variables, but the data is curvilinear.
Running a typical linear regression in some way has major problems since binary data does not have a normal distribution which is a condition needed for most other types of regression.Example1: Table 1.1 lists the age in years (AGE), and presence or absence ofEvidence of significant coronary heart disease (CHD) for 100 subjects in a hypotheticalStudy of risk factors for heart disease. The table also contains an identifierVariable (ID) and an age group variable (AGEGRP). The outcome variable is CHD,Which is coded with a value of “0” to indicate that CHD is absent, or “1” to indicateThat it is present in the individual. In general, any two values could be used, butWe have found it most convenient to use zero and one. We refer to this dataset as the CHDAGE data.
CHD AGE(Years)FIG 1.1A scatterplot of the data in Table 1.1 is given in Figure 1.1.In this scatterplot, all points fall on one of two parallel lines representing theabsence of CHD (y = 0) or the presence of CHD (y = 1).
There is some tendencyfor the individuals with no evidence of CHD to be younger than those with evidenceof CHD. While this plot does depict the dichotomous nature of the outcome variablequite clearly, it does not provide a clear picture of the nature of the relationshipbetween CHD and AGE.The main problem with Figure 1.1 is that the variability in CHD at all ages islarge. This makes it difficult to see any functional relationship between AGE andCHD.
One common method of removing some variation, while still maintainingthe structure of the relationship between the outcome and the independent variable,is to create intervals for the independent variable and compute the mean of theoutcome variable within each group. We use this strategy by grouping age into thecategories (AGEGRP) defined in Table 1.1.
Table 1.2 contains, for each age group,the frequency of occurrence of each outcome, as well as the percent with CHD present.Table SEQ Table * ARABIC 1.
2Age group n Absent Present Mean20–29 10 9 1 0.130–34 15 13 2 0.13335–39 12 9 3 0.2540–44 15 10 5 0.33345–49 13 7 6 0.
46250–54 8 3 5 0.62555–59 17 4 13 0.76560–69 10 2 8 0.8Total 100 57 43 0.43FIG 1.2Age(years)By examining Table 1.2, a clearer picture of the relationship begins to emerge.
Itshows that as age increases, the proportion (mean) of individuals with evidence ofCHD increases. Figure 1.2 presents a plot of the percent of individuals with CHDversus the midpoint of each age interval. This plot provides considerable insightinto the relationship between CHD and AGE in this study, but the functional formfor this relationship needs to be described. The plot in this figure is similar to whatone might obtain if this same process of grouping and averaging were performedin a linear regression.
We note two important differences.Some important facts:-The dependent variable in logistic regression follows the Bernoulli distribution having an unknown probability, p.Bernoulli distribution is just a special case of the Binomial distribution where n=1 (just one trial)Success is “1” and failure is “0”.
The probability of success is “p” and failure is “q=1-p”.In logistic regression, we are estimating an unknown p, for any given linear combination of the independent variables.Therefore, we need to link together our independent variable to essentially the Bernoulli distribution, that link is called the logit.The first difference concerns the nature of the relationship between the outcomeand independent variables. In any regression problem, the key quantity is the mean value of the outcome variable, given the value of the independent variable. This quantity is called the conditional mean and is expressed as “E(Y|x)” where YDenotes the outcome variable and x denotes a specific value of the independentVariable. The quantity E (Y|x) is read “the expected value of Y, given the value x”.
In linear regression, we assume that this mean may be expressed as an equation. This expression implies that it is possible for E (Y|x) to take on any value as xRanges between ??and +?.The column labeled “Mean” in Table 1.2 provides an estimate of E (Y|x). We assume, for purposes of exposition, that the estimated values plotted in Figure 1.2are close enough to the true values of E (Y|x) to provide a reasonable assessment of the functional relationship between CHD and AGE.
With a dichotomous outcome variable, the conditional mean must be greater than or equal to zero and less than or equal to one (i.e., 0 ?E (Y|x) ?1). This can be seen in Figure 1.2. In addition,The plot shows that this mean approaches zero and one “gradually”.
The change inThe E (Y|x) per unit change in x becomes progressively smaller as the conditionalMean gets closer to zero or one. The curve is said to be S-shaped and resembles a The plot of the cumulative distribution of a continuous random variable. Thus, it shouldNot seem surprising that some well-known cumulative distributions have been usedTo provide a model for E (Y|x) in the case when Y is dichotomous. The model we use is based on the logistic distribution.
In order to simplify notation, we use the quantity ?(x) = E (Y|x) to representThe conditional mean of Y given x when the logistic distribution is used. TheThe specific form of the logistic regression model we use is:?(x) = e?0+?1×1+e?0+?1x(1.1)A transformation of ?(x) that is central to our study of logistic regression is the logit transformation. This transformation is defined, in terms of ?(x), as:g(x) = ln ?(x)1-?(x) g(x) = ?0+ ?1x. (1.1*)The importance of this transformation is that g(x) has many of the desirable propertiesOf a linear regression model. The logit, g(x), is linear in its parameters, MayBe continuous, and may range from ??to +?, depending on the range of x.The second important difference between the linear and logistic regressionModels concern the conditional distribution of the outcome variable.
In the linearRegression model we assume that an observation of the outcome variable may beExpressed as y = E (Y|x) + ?. The quantity ? is called the error and expresses anObservation’s deviation from the conditional mean. The most common assumptionIs that ? follows a normal distribution with mean zero and some variance that isConstant across levels of the independent variable. It follows that the conditionalDistribution of the outcome variable given x is normal with mean E (Y|x), and aThe variance that is constant. This is not the case with a dichotomous outcome variable.In this situation, we may express the value of the outcome variable given xAs y = ?(x) + ?.
Here the quantity ? may assume one of two possible values. Ify = 1 then ? = 1 ??(x) with probability ?(x), and if y = 0 then ? = ??(x) withProbability 1 ??(x). Thus, ? has a distribution with mean zero and variance equalTo ?(x) 1 ??(x). That is, the conditional distribution of the outcome variableFollows a binomial distribution with probability given by the conditional mean,?(x).In summary, we have shown that in a regression analysis when the outcomeVariable is dichotomous:1. The model for the conditional mean of the regression equation must be bounded between zero and one. The logistic regression model, ?(x), givenIn equation (1.1), satisfies this constraint.
2. The binomial, not the normal, distribution describes the distribution of theErrors and is the statistical distribution on which the analysis is based.3. The principles that guide an analysis using linear regression also guide us inLogistic regression.. FITTING THE LOGISTIC REGRESSION MODELIn the dichotomous outcome variable given x asy = ?x+e, e may assume one of two possible values.When, y= 1e = 1-?(x).
With probability ?(x). e~N (0, ?(x) (1-?(x)))When, y=0 e = -?(x) with probability 1-?(x)Fitting the logistic regression model in equation (1.1) to a set of data requires that we estimate the values of ?0 and ?1, the unknown parameters. The general method of estimation that leads to the least squares function under the linear regression model (when the error terms are normally distributed) is called maximum likelihood. This method provides the foundation for our approach to estimation with the logistic regression model.f (xi, yi) = ?(xi)yi(1-?(xi))(1-yi) (1.
2)The likelihood function is given by:-L (xi, yi) = i=1n?(xi)yi(1-?(xi))1-yi (1.3)After taking log both sides we get,Ln (L (xi, yi)) = i=1n{yi ln?(xi ) + (1 -yi ) ln1 -?(xi )}. (1.4)To find the value of ? that maximizes L (?) we differentiate L (?) with respect to?0 and ?1 and set the resulting expressions equal to zero. These equations, knownAs the likelihood equations arei=1nyi-?xi=0 (1.5)i=1nxiyi-?(xi) = 0. (1.6)The value of ? given by the solution to equations (1.
5) and (1.6) is calledThe maximum likelihood estimate and is denoted as ?.In general, the use of theThe symbol ” denotes the maximum likelihood estimate of the respective quantity.For example, ?(xi)is the maximum likelihood estimate of ? (xi). This quantityProvides an estimate of the conditional probability that Y is equal to 1, given thatX is equal to xi. As such, it represents the fitted or predicted value for the logisticRegression model. An interesting consequence of equation (1.5) is thati=1nyi=i=1n?(xi)As an example, consider the data given in Table 1.
1. Use of a logistic regressionSoftware package SPSS, with continuous variable AGE as the independent variable,Produces the output in Table 1.3 Table 1.
3 Variables in the EquationB S.E. Wald df Sig. Exp(B) 95% C.I.
for EXP(B)Lower UpperStep 1a AGE .111 .024 21.254 1 .000 1.117 1.066 1.171Constant -5.
309 1.134 21.935 1 .000 .005 Log-likelihood = ?53.676546.The maximum likelihood estimates of ?0 and ?1 are ?0= ?5.
309 and ?1=0.111. The fitted values are given by the equation-?= e-5.
309+0.111*AGE1+e-5.309+.111*AGE (1.7)and the estimated logit , g(x)= -5.309+.111*AGE (1.8)The log-likelihood given in Table 1.
3 is the value of equation (1.4) computed using?0 and ?1.Testing For The Significance Of The CoefficientIn logistic regression, comparison of observed to predicted values is based on the log-likelihood function defined in equation (1.4).
The comparison of observed to predicted values using the likelihood function isbased on the following expression:D = ?2 ln(likelihood of the fitted model)(likelihood of the saturated model) (1.9) The quantity inside the large brackets in the expression above is called the likelihoodRatio Using minus twice its log is necessary to obtain a quantity whose distributionis known and can, therefore, be used for hypothesis testing purposes. Such a test iscalled the likelihood ratio test. Using equation (1.
4), equation (1.9) becomesD = -2i=1nyiln?iyi+ (1-yi)ln1-?i1-yi (1.10)The statistic, D, in equation (1.10) is called the deviance, and for logisticregression, it plays the same role that the residual sum-of-squares plays in linearregression. In fact, the deviance as shown in equation (1.10), when computedfor linear regression, is identically equal to the SSE.In particular, to assess the significance of an independent variable we compare the value of D with and without the independent variable in the equation. The change in D due to the inclusion of the independent variable in the model is:G = D(model without the variable) ? D(model with the variable).
This statistic, G, plays the same role in logistic regression that the numerator of the partial F-test does in linear regression. Because the likelihood of the saturated model is always common to both values of D being differenced, G can be expressed asG = ?2 ln((likelihood without the variable))(likelihood with the variable)For the specific case of a single independent variable, it is easy to show that when the variable is not in the model, the maximum likelihood estimate of ?0 isln(n1/n0) where n1= yi and n0=(1-yi) and the predicted probability forall subjects is constant and equal to n1/n. In this setting, the value of G is:G = ?2 ln(n1n)n1(n0n)n0i=1n?iyi(1-?i)(1-yi) (1.13)Confidence Interval Estimation.The Wald test is equal to the ratio of the maximum likelihood estimate of the slope parameter, ?1, to an estimate of its standard error. Under the null hypothesisand the sample size assumptions, this ratio follows a standard normal distribution.
W = ?1SE(?1)The basis for construction of the interval estimators is the same statistical theory we used to formulate the tests for the significance of the model. In particular, the confidenceinterval estimators for the slope and intercept are, most often, based on their respective Wald tests and are sometimes referred to as Wald-based confidence intervals. The endpoints of a 100(1 ??)% confidence interval for the slope coefficient are-?1±z1-?2SE(?1) (1.9 ) ?0±z1-?2SE(?0) (1.10)where z1??/2 is the upper 100(1 ??/2)% point from the standard normal distributionand SE(.)denotes a model-based estimator of the standard error of therespective parameter estimator. Since we are using software like SPSS, we do not require to calculate it manually, as we can see in Table 1.
3 we are given the confidence interval for exp(?), if we take antilog for both e?1 we can obtain a confidence interval for ?1.Chapter 2.The Multiple Logistic Regression ModelIntroductionIn Chapter 1 we introduced the logistic regression model in the context of a modelcontaining a single variable.
In this chapter, we generalize the model to one with more than one independent variable (i.e., the multivariable or multiple logisticregression model). Central to the consideration of the multiple logistic modelsis estimating the coefficients and testing for their significance.
The logit of the multiple logistic regression model is given by generalizing equation (1.1) and (1.1*) we get:-g(x)=ln?(x)1-?(x)= ?0+?1×1+?2×2+……………+?pxp. (2.1)where, for the multiple logistic regression model,?x=eg(x)1+eg(x)(2.2)FITTING THE MULTIPLE LOGISTIC REGRESSION MODELThe method of estimation used in the multivariable case is the same as in the univariable situation – maximum likelihood. The likelihood function is nearly identical to that given in equation. The likelihood function is nearly identical to that given in equation(1.
3) with the only change being that ?(x) is now defined as in equation (2.1). There will be p + 1 likelihood equations that are obtained by differentiating the log-likelihood function with respect to the p + 1 coefficients.
The likelihood equations that result may be expressed as follows:i=1nyi-?(xi)= 0And i1nxijyi-?(xi)=0 for j = 1, 2, . . . , pAs in the univariable model, the solution of the likelihood equations requiressoftware that is available in virtually every statistical software package. Let ?denote the solution to these equations. Thus, the fitted values for the multiplelogistic regression model are ?ˆ(xi ), the value of the expression in equation (2.2)computed using ?and xi.The Global Longitudinal Study of Osteoporosis in Women (GLOW) dataset, as an example, we consider five variables thought to be of importance that is age at enrollment (AGE), weight at enrollment (WEIGHT), history of a previous fracture (PRIORFRAC), whether or not the women experienced menopause before or after age 45 (PREMENO) and self-reported risk of fracture relative to women of the same age (RATE RISK) coded at three levels: less, same, or more risk.
TABLE 2.2 Fitted Multiple Logistic Regression Model of Fracture in the First Year of Follow Up (FRACTURE) on Age, Weight, Prior Fracture (PRIORFRAC), Early Menopause (PREMENO), and Self-Reported Risk of Fracture (RATE RISK) from the GLOW Study, n = 500 Variables in the EquationB S.E. Wald df Sig. Exp(B) 95% C.I.for EXP(B)Lower UpperStep 1a AGE .
050 .013 13.966 1 .
000 1.051 1.024 1.079WEIGHT .004 .007 .347 1 .
556 1.004 .991 1.018PRIORFRAC .
679 .242 7.858 1 .005 1.973 1.227 3.173PREMENO .187 .
277 .456 1 .499 1.
206 .701 2.074RATE RISK 9.181 2 .010 RATERISK(1) .
534 .276 3.754 1 .053 1.707 .994 2.
930RATERISK(2) .874 .289 9.139 1 .003 2.397 1.360 4.
224Constant -5.606 1.221 21.090 1 .000 .
004 Variable(s) entered on step 1: AGE, WEIGHT, PRIORFRAC, PREMENO, RATE RISK.Model SummaryStep -2 Log likelihood Cox & Snell R Square Nagelkerke R Square1 518.075a .085 .125a. Estimation terminated at iteration number 5 because parameter estimates changed by less than .001.Log-Likelihood = -259.
03768In the example given above, the variable RATE RISK is modeled using the two design variables coded at three level. If we are using software like SPSS we can go to categorical to code the variable.In Table 2.
2 the estimated coefficient for the two design variables for RATERISK are indicated by RATERISK1 and RATERISK2 . The estimated logit is given in the following equation :g(x) = -5.606 + .050 * AGE + .004 * WEIGHT + .679 * PRIORFRAC + .
187 * PREMENO + .534 * RATERISK1 + .874 * RATERISK2And the associated estimated logistic probabilities are found by using equation (2.2) Testing For The Significance of The Modelonce we have a particular multiple (multivariable) logistic regression model, we begin the process of model assessment. The likelihood ratio test for overall significance of the p coefficients for the independent variables in the model is performed in exactly the same manner as in the univariable case. The test is based on the statistic G given in equation (1.12).
Consider the fitted model whose estimated coefficients are given in Table 2.2. For that model, the value of the log-likelihood, shown at the bottom of the table, is L = ?259.0377. The log-likelihood for the constant only model may be obtained by evaluating the numerator of equation (1.13) or by fitting the constant only model.
Either method yields the log-likelihood L = ?281.1676. Thus the value of the likelihood ratio test is, from equation (1.
12), G = ?2?281.1676 ? (?259.0377) = 44.2598 and the p-value for the test is P?2(6) > 44.2598 ? 0.
0001, which is significant at well beyond the ? = 0.05 level. We reject the null hypothesis in this case and conclude that at least one or more of the p coefficients are different from zero, an interpretation analogous to the F-test used in multiple linear regression.The p-values computed under this hypothesis are shown in the fifth column of Table 2.2. If we use a level of significance of 0.05, then we would conclude that the variables AGE, history of prior fracture (PRIORFRAC) and self-reported rate of risk (RATE RISK) are statistically significant, while WEIGHT and early menopause (PREMENO) are not significant.As our goal is to obtain the best fitting model while minimizing the number of parameters, the next logical step is to fit a reduced model containing only those variables thought to be significant and compare that reduced model to the fullmodel containing all of the variables.
The results of fitting the reduced model aregiven in Table 2.3The difference between the two models is the exclusion of the variablesWEIGHT and early menopause (PREMENO) from the full model. The likelihood ratio test comparing these two models is obtained using the definition of Ggiven in equation (1.
12). It has a distribution that is chi-square with 2 degrees offreedom under the hypothesis that the coefficients for both excluded variables areequal to zero. The value of the test statistic comparing the model in Table 2.
3 tothe one in Table 2.2 isG = ?2?259.4494 ? (?259.
0377) = 0.8324Table 2.3 Fitted Multiple Logistic Regression Model of Fracture in the First Year ofFollow Up (FRACTURE) on AGE, Prior Fracture (PRIORFRAC), and Self-ReportedRisk of Fracture (RATE RISK) from the GLOW Study, n = 500Variables in the EquationB S.E. Wald df Sig. Exp(B) 95% C.
I.for EXP(B)Lower UpperStep 1a AGE .046 .
012 13.618 1 .000 1.047 1.
022 1.073PRIORFRAC .700 .241 8.431 1 .004 2.
014 1.256 3.231RATE RISK 9.223 2 .010 RATERISK(1) .549 .275 3.979 1 .
046 1.731 1.010 2.967RATERISK(2) .866 .286 9.150 1 .002 2.
377 1.356 4.165Constant -4.991 .903 30.565 1 .
000 .007 a. Variable(s) entered on step 1: AGE, PRIORFRAC, RATE RISK.Model SummaryStep -2 Log likelihood Cox & Snell R Square Nagelkerke R Square1 518.899a .
083 .123a. Estimation terminated at iteration number 5 because parameter estimates changed by less than .001.Log-Likelihood = -259.4494which, with 2 degrees of freedom, has a p-value of P?2(2) > 0.8324 = 0.663.
As the p-value is large, exceeding 0.05, we conclude that the full model is no betterthan the reduced model. That is, there is little statistical justification for includingWEIGHT and PREMENO in the model. However, we must not base our modelsentirely on tests of statistical significance.CONFIDENCE INTERVAL ESTIMATIONThe methods used for confidence interval estimators for a multivariable model areessentially the same.
Table 2.3, the 95 percent confidence interval for the exponential of the coefficient of variables are given we have to take antilog to obtain CI of coefficient variables.Table 2.4 we are taking antilog of exp(?)Variables in the EquationB S.E. Wald df Sig.
Exp(B) 95% C.I.for BLower UpperStep 1a AGE .046 .
012 13.618 1 .000 1.047 0.0218 0.
0705PRIORFRAC .700 .241 8.
431 1 .004 2.014 .2279 1.1728RATE RISK 9.223 2 .
010 RATERISK(1) .549 .275 3.
979 1 .046 0.5487 .0100 1.0876RATERISK(2) .866 .
286 9.150 1 .002 2.377 0.3045 1.4267Constant -4.991 .903 30.
565 1 .000 .007 a.
Variable(s) entered on step 1: AGE, PRIORFRAC, RATE RISK.CHAPTER 3.Interpretation of the Fitted Logistic Regression ModelINTRODUCTIONWe begin this chapter assuming that a logistic regression model has been fit, that the variables in the model are significant in either a clinical or statistical sense, and that the model fits according to some statistical measure of fit.The interpretation of any fitted model requires that we be able to draw practical inferences from the estimated coefficients in the model it involves two issues: determining the functional relationship between the dependent variable and the independent variable, and appropriately defining the unit of change for the independent variable. When the independent variable is binary or dichotomous:- This case provides the conceptual foundation for all theother situations.We assume that the independent variable, x, is coded as either 0 or 1. Thedifference in the logit for a subject with x = 1 and x = 0 isg(1) ? g(0) = (?0+ ?1× 1) ? (?0+ ?1× 0) = (?0+ ?1) ? (?0) = ?1.The practical problem is that change on the scale of the log-odds is hard to explain and it may not be especially meaningful to a subject-matter audience.
In order to provide a more meaningful interpretation, we need to introduce the odds ratio as a measure of association. The odds of the outcome being present among individuals with x = 1 is ?(1)/1 ? ?(1). Similarly, the odds of the outcome being present among individuals with x = 0 is ?(0)/1? ?(0). The odds ratio denoted OR, is the ratio of the odds for x = 1 to the odds for x = 0, and is given by the equation:-OR = ?(1)1-?(1)?(0)1-?(0) (3.1)Substituting the expressions for the logistic regression model probabilities inTable 3.1 into equation (3.
1) we obtainOR =e?0+?11+e?0+?111+e?0+?1e?01+e?011+e?0= e?0+?1e?0= e?0+?1-?0= e?1Hence, for a logistic regression model with a dichotomous independent variablecoded 0 and 1, the relationship between the odds ratio and the regression coefficient is -OR= e?1 (3.2)Table 3.1 Values of the Logistic Regression Model when the Independent Variable IsDichotomousIndependent Variable(x) Outcome Variable(y) X=1 X=0Y=1?1=e?0+?11+e?0+?1?0=e?01+e?0Y=01-?1=11+e?0+?11-?0=11+e?0Total 1.0 1.0The odds ratio is widely used as a measure of association as it approximates how much more likely or unlikely (in terms of odds) it is for the outcome to be present among those subjects with x = 1 as compared to those subjects with x = 0.To review, the outcome variable is having a fracture (FRACTURE) in the first year of follow-up.Here we use has had a fracture between the age of 45 and enrollment in the study (PRIORFRAC) as the dichotomous independent variable.
The result of cross-classifying fracture during follow-up by prior fracture is presented in Table 3.2.Table 3.2 Cross-Classification of Prior Fracture and Fracture During Follow-Up in the GLOW Study, n = 500FRACTURE * PRIORFRAC CrosstabulationCount PRIORFRAC Total0 1 FRACTURE 0 301 74 3751 73 52 125Total 374 126 500The frequencies in Table 3.2 tell us that there were 52 subjects with values(x = 1, y = 1), 73 with (x = 0, y = 1), 74 with (x = 1, y = 0), and 301 with (x = 0, y = 0).
The results of fitting a logistic regression model containing the dichotomous covariate PRIORFRAC are shown in Table 3.3.Table 3.3 Results of Fitting the Logistic Regression Model of Fracture(FRACTURE) on Prior Fracture (PRIORFRAC) Using the Data in Table 3.2Variables in the EquationB S.E. Wald df Sig.
Exp(B) 95% C.I.for EXP(B)Lower UpperStep 1a PRIORFRAC 1.064 .223 22.741 1 .000 2.
897 1.871 4.486Constant -1.417 .130 117.908 1 .
000 .243 Variable(s) entered on step 1: PRIORFRAC.Model SummaryStep -2 Log likelihood Cox & Snell R Square Nagelkerke R Square1 540.068a .044 .
065Log-Likelihood= -270.03397The estimate of the odds ratio using equation (3.2) and the estimated coefficient for PRIORFRAC in Table 3.3 is OR= e1.064 = 2.9. Readers who have had some previous experience with the odds ratio undoubtedly wonder why we used a logistic regression package to estimate the odds ratio when we easily could havecomputed it directly as the cross-product ratio from the frequencies in Table 3.2, namely,OR= 52*30174*73 = 2.897.Thus, we see that the slope coefficient from the fitted logistic regression model is ?1= ln(52 × 301)/(74 × 73) = 1.0638We obtain a 100 × (1 ? ?)% confidence interval estimator for the odds ratio by first calculating the endpoints of a confidence interval estimator for the log-odds ratio (i.e., ?1) and then exponentiating the endpoints of this interval. In general, the endpoints are given by the expression exp?1±z1-?2*SE(?1).As an example, consider the estimation of the odds ratio for the dichotomized variable PRIORFRAC. Using the results in Table 3.3 the point estimate is OR=2.9and the 95% con?dence interval is exp(1.064±1.96×0.2231) = (1.87,4.49).This interval is typical of many con?dence intervals for odds ratios when the point estimate exceeds 1, in that it is skewed to the right from the point estimate. This con?dence interval suggests that the odds of a fracture during follow-up among women with a prior fracture could be as little as 1.9 times or much as 4.5 times the odds for women without a prior fracture, at the 95% level of con?dence.POLYCHOTOMOUS INDEPENDENT VARIABLESuppose that instead of two categories the independent variable has k > 2 distinct values. In the GLOW study, the covariate self-reported risk is coded at three levels (less, same, and more). The cross-tabulation of it with fracture during follow-up (FRACTURE) is shown in Table 3.5. In addition, we show the estimated odds ratio, its 95% confidence interval and log-odds ratio for the same and more versus less risk.The extension to a situation where the variable has more than three levels is not conceptually different so all the examples in this section use k = 3. Using Spss we obtain table 3.5 and 3.7.Table 3.5 Cross-Classification of Fracture During Follow-Up (FRACTURE) by Self-Reported Rate of Risk (RATE RISK) from the GLOW Study, n = 500FRACTURE * RATE RISK CrosstabulationCountRATE RISK Total1 2 3 FRACTURE 0 139 138 98 3751 28 48 49 125Total 167 186 147 500Odds Ratio 1 28×13848×139 =1.73 49×13928×98=2.482195% CI (1.02, 2.91) (1.46, 4.22) ln(OR) 0.0 0.55 0.91 Table 3.6 Specification of the Design Variables for RATE RISK Using Reference Cell Coding with Less as the Reference GroupRATE RISK(Code) RATERISK1 RATERISK2Less(1) 0 0Same(2) 1 0More(3) 0 1Table 3.7 Results of Fitting the Logistic Regression Model to the Data in Table 3.5 Using the Design Variables in Table 3.6Variables in the EquationB S.E. Wald df Sig. Exp(B) 95% C.I.for EXP(B)Lower UpperStep 1a RATE RISK 11.247 2 .004 RATERISK(1) .546 .266 4.203 1 .040 1.727 1.024 2.911RATERISK(2) .909 .271 11.242 1 .001 2.482 1.459 4.223Constant -1.602 .207 59.831 1 .000 .201 Model SummaryStep -2 Log likelihood Cox & Snell R Square Nagelkerke R Square1 550.578a .023 .034Estimation terminated at iteration number 4 because parameter estimates changed by less than .001.LogLikelihood = -275.28917Table 3.7 gives us confidence interval for odd ratio , coefficient, SE, p value.CONTINUOUS INDEPENDENT VARIABLEWhen a logistic regression model contains a continuous independent variable, interpretationof the estimated coefficient depends on how it is entered into the model and the particular units of the variable. For purposes of developing the method to interpret the coefficient for a continuous variable, we assume that the logit islinear in the variable.As an example, we show the results in Table 1.3 of a logistic regression of AGE on CHD status using the data in Table 1.1. The estimated logit is g(AGE) = ?5.310 + 0.111 × AGE. The estimated odds ratio for an increase of 10 years inage is _OR(10) = exp(10 × 0.111) = 3.03. Thus, for every increase of 10 years in age, the odds of CHD being present is estimated to increase by 3.03 times. The validity of this statement is questionable, because the increase in the odds of CHDfor a 40-year-old compared to a 30-year-old may be quite different from the odds for a 60-year-old compared to a 50-year-old. This is the unavoidable dilemma when a continuous covariate is modeled linearly in the logit and motivates theimportance of examining the linearity assumption for continuous covariates. The endpoints of a 95% confidence interval for this odds ratio are exp(10 × 0.111 ± 1.96 × 10 × 0.024) = (1.90, 4.86).The interpretation of the estimated odds ratio for a continuous variable is similar to that of nominal scale variables. The main difference is that a meaningful change must be defined for the continuous variable.MULTIVARIABLE MODELSFitting a series of univariable models, although useful for a preliminary analysis, rarely provides an adequate or complete analysis of the data in a study because the independent variables are usually associated with one another and may have different distributions within levels of the outcome variable. Thus, one generally uses a multivariable analysis for a more comprehensive modeling of the data. One goal of such an analysis is to statistically adjust the estimated effectof each variable in the model for differences in the distributions of and associations among the other independent variables in the model. Applying this concept to a multivariable logistic regression model, we may surmise that each estimated coefficient provides an estimate of the log-odds adjusting for the other variables in the model. Another important aspect of multivariable modeling is to assess to what extent, if at all, the estimate of the log-odds of one independent variable changes, depending on the value of another independent variable. When the odds ratio for one variable is not constant over the levels of another variable, the two variables are said to have a statistical interaction. In some applied disciplines statistical interaction is referred to as effect modification. This terminology is used to describe the fact that the log odds of one variable are modified or changed by values of the other variable. We begin with an example where there is neither statistical adjustment nor statistical interaction. The data we use come from the GLOW study described in Table Dataset 2. The outcome variable is having a fracture during the first year of follow up (FRACTURE). For the dichotomous variable, we use variable history of prior fracture (PRIORFRAC) and for the continuous covariate, we use height in centimeters (HEIGHT). The results from the three fitted models are presented in Table 3.10. In discussing the results from the examples we use significance levels from the Wald statistics. In all cases, the same conclusions would be reached had we used likelihood ratio tests. Table 3.10 Estimated Logistic Regression Coefficients, Standard Errors, Wald Statistics, p-Values and 95% CIs from Three Models Showing No Statistical Adjustment and No Statistical Interaction from the GLOW Study, n = 500Variables in the EquationModel Variable B S.E. Wald df Sig. Exp(B) 95% C.I.for EXP(B)Lower Upper1. PRIORFRAC 1.064 .223 22.741 1 .000 2.897 1.871 4.486Constant -1.417 .130 117.908 1 .000 .243 a. Variable(s) entered on step 1: PRIORFRAC.Variables in the EquationModel Variable B S.E. Wald df Sig. Exp(B) 95% C.I.for EXP(B)Lower Upper2. PRIORFRAC 1.013 .225 20.199 1 .000 2.754 1.770 4.284HEIGHT -.045 .017 6.811 1 .009 .956 .924 .989Constant 5.895 2.796 4.445 1 .035 363.095 a. Variable(s) entered on step 1: PRIORFRAC, HEIGHT.Variables in the EquationModel Variable B S.E. Wald df Sig. Exp(B) 95% C.I.for EXP(B)Lower Upper3. PRIORFRAC -3.055 5.790 .278 1 .598 .047 .000 3999.295HEIGHT -.054 .022 6.216 1 .013 .947 .907 .988HEIGHT by PRIORFRAC .025 .036 .494 1 .482 1.026 .956 1.101Constant 7.361 3.510 4.398 1 .036 1573.846 a. Variable(s) entered on step 1: PRIORFRAC, HEIGHT, HEIGHT * PRIORFRAC.The Wald Statistic for the coefficient of PRIORFRAC in Model 1 is significant with p < 0.001. When we add HEIGHT to the model the Wald statistics are significant at the 1% level for both covariates. Note that there is little change in the estimate of the coefficient for PRIORFRAC as??% = 100×(1.064-1.012)1.012 = 5.1indicating that the inclusion of HEIGHT does not statistically adjust the coefficient of PRIORFRAC. Thus we conclude that, in these data, the height it is not a confounder of prior fracture. The statistical interaction of prior fracture (PRIORFRAC) and height (HEIGHT) is added to Model 2 to obtain Model 3. The Wald statistic for the added product term has p = 0.492 and thus is not significant. In these data height is not an effect modifier of prior fracture. Hence, the choice is between Model 1 and Model 2. Even though the estimate of the effect of prior fracture is basically the same for the two models, we would choose Model 2 as height (HEIGHT) is not only statistically significant in Model 2, but is an important clinical covariate as well.CH A P T E R 4-Model-Building Strategies andMethods for Logistic RegressionINTRODUCTIONThe goal of any method is to select those variables that result in a “best” model within the scientific context of the problem. In order to achieve this goal, we must have: (i) a basic plan for selecting the variables for the model and (ii) a set of methods for assessing the adequacy of the model both in terms of its individual variables and its overall performance. In this chapter, we discuss methods that address both of these areas.PURPOSEFUL SELECTION OF COVARIATESCASE STUDY 1 (The GLOW Study)STATEMENTFor Purposeful selection, we use the GLOW500 data.OBJECTIVEProvides a good example of an analysis designed to identify risk factors for a specified binary outcome.METHOD TOOLWe are using software tool SPSS Steps.Step 1: The first step in purposeful selection is to fit a univariable logistic regression model for each covariate. The results of this analysis are shown in Table 4.7. Note that in this table, each row presents the results for the estimated regression coefficient(s) from a model containing only that covariate.Table 4.7 Results of Fitting Univariable Logistic Regression Models in the GLOWData, n = 500Variables in the EquationB S.E. Wald df Sig. Exp(B) 95% C.I.for EXP(B)Lower UpperStep 1a AGE .053 .012 20.684 1 .000 1.054 1.031 1.079Constant -4.779 .827 33.374 1 .000 .008 a. Variable(s) entered on step 1: AGE.Variables in the EquationB S.E. Wald df Sig. Exp(B) 95% C.I.for EXP(B)Lower UpperStep 1a WEIGHT -.005 .006 .656 1 .418 .995 .982 1.007Constant -.727 .468 2.417 1 .120 .483 a. Variable(s) entered on step 1: WEIGHT.Variables in the EquationB S.E. Wald df Sig. Exp(B) 95% C.I.for EXP(B)Lower UpperStep 1a HEIGHT -.052 .017 9.134 1 .003 .950 .918 .982Constant 7.212 2.744 6.910 1 .009 1356.000 a. Variable(s) entered on step 1: HEIGHT.Variables in the EquationB S.E. Wald df Sig. Exp(B) 95% C.I.for EXP(B)Lower UpperStep 1a BMI .006 .017 .112 1 .738 1.006 .972 1.040Constant -1.258 .486 6.686 1 .010 .284 a. Variable(s) entered on step 1: BMI.Variables in the EquationB S.E. Wald df Sig. Exp(B) 95% C.I.for EXP(B)Lower UpperStep 1a PRIORFRAC 1.064 .223 22.741 1 .000 2.897 1.871 4.486Constant -1.417 .130 117.908 1 .000 .243 a. Variable(s) entered on step 1: PRIORFRAC.Variables in the EquationB S.E. Wald df Sig. Exp(B) 95% C.I.for EXP(B)Lower UpperStep 1a PREMENO .051 .259 .038 1 .845 1.052 .633 1.749Constant -1.109 .115 92.397 1 .000 .330 a. Variable(s) entered on step 1: PREMENO.Variables in the EquationB S.E. Wald df Sig. Exp(B) 95% C.I.for EXP(B)Lower UpperStep 1a MOMFRAC .661 .281 5.526 1 .019 1.936 1.116 3.358Constant -1.196 .114 110.932 1 .000 .302 Variable(s) entered on step 1: MOMFRAC.Variables in the EquationB S.E. Wald df Sig. Exp(B) 95% C.I.for EXP(B)Lower UpperStep 1a ARMASSIST .709 .210 11.429 1 .001 2.032 1.347 3.066Constant -1.394 .142 96.584 1 .000 .248 a. Variable(s) entered on step 1: ARMASSIST.Variables in the EquationB S.E. Wald df Sig. Exp(B) 95% C.I.for EXP(B)Lower UpperStep 1a SMOKE -.308 .436 .498 1 .480 .735 .313 1.727Constant -1.079 .107 102.450 1 .000 .340 a. Variable(s) entered on step 1: SMOKE.Variables in the EquationB S.E. Wald df Sig. Exp(B) 95% C.I.for EXP(B)Lower UpperStep 1a RATE RISK 11.247 2 .004 RATERISK(1) .546 .266 4.203 1 .040 1.727 1.024 2.911RATERISK(2) .909 .271 11.242 1 .001 2.482 1.459 4.223Constant -1.602 .207 59.831 1 .000 .201 a. Variable(s) entered on step 1: RATE RISK.Step 2: We now fit our first multivariable model that contains all covariates that are significant in univariable analysis at the 25% level. The results of this fit are shown in Table 4.8. Once this model is fit we examine each covariate to ascertain its continued significance, at traditional levels, in the model. We see that the covariate with the largest p-value that is greater than 0.05is for RATERISK2, the design/dummy variable that compares women with RATERISK = 2 to women with RATERISK = 1. The likelihood ratio test for the exclusion of self-reported risk of fracture (i.e., deleting RATERISK_2 and RATERISK_3 from the model) is G = 5.96, which with two degrees of freedom, yields p = 0.051, nearly significant at the 0.05 level.Table 4.8 Results of Fitting the Multivariable Model with All Covariates Significantat the 0.25 Level in the Univariable Analysis in the GLOW Data, n = 500Variables in the EquationB S.E. Wald df Sig. Exp(B) 95% C.I.for EXP(B)Lower UpperStep 1a AGE .034 .013 6.930 1 .008 1.035 1.009 1.062HEIGHT -.044 .018 5.759 1 .016 .957 .923 .992PRIORFRAC .645 .246 6.877 1 .009 1.906 1.177 3.088MOMFRAC .621 .307 4.095 1 .043 1.861 1.020 3.397ARMASSIST .446 .233 3.667 1 .056 1.562 .990 2.465RATERISK 5.820 2 .054 RATERISK(1) .422 .279 2.284 1 .131 1.525 .882 2.636RATERISK(2) .707 .293 5.804 1 .016 2.028 1.141 3.604Constant 2.709 3.230 .704 1 .402 15.019 Variable(s) entered on step 1: AGE, HEIGHT, PRIORFRAC, MOMFRAC, ARMASSIST, RATERISK.Step 3: Next we check to see if covariate(s) removed from the model in Step 2 confound or are needed to adjust the effects of covariates remaining in the model. In results not shown, we find that the largest percent change is 17% for the coefficient of ARMASSIST. This does not exceed our criterion of 20%. Thus, we see that while the self-reported rate of risk is not a confounder it is an important covariate. No other covariates are candidates for exclusion and thus, we continue using the model in Table 4.8.Step 4: On univariable analysis, the covariates for weight (WEIGHT), body mass index (BMI), early menopause (PREMENO) and smoking (SMOKE) were not significant. When each of these covariates is added, one at a time, to the model in Table 4.8 its coefficient did not become significant. The only changeof note is that the significance of BMI changed from 0.752 to 0.334. Thus the next step is to check the assumption of linearity in the logit of continuous covariates age and height. Before moving to step 5 we consider another possible model. Since the coefficient for RATERISK_2 is not significant, one possibility is to combine levels 1 and 2, self-reported risk less than or the same as other women, into a new reference category, it was thought that combining these two categories is reasonable.Hence we fit this model and its results are shown in Table 4.9. In this model, the coefficient for the covariate RATERISK_3 now provides the estimate of the log of the odds ratio comparing the odds of fracture for individuals in level 3 to that of the combined group consisting of levels 1 and 2.Variables in the EquationB S.E. Wald df Sig. Exp(B) 95% C.I.for EXP(B)Lower UpperStep 1a AGE .033 .013 6.567 1 .010 1.034 1.008 1.060HEIGHT -.046 .018 6.526 1 .011 .955 .921 .989PRIORFRAC .664 .245 7.336 1 .007 1.943 1.201 3.142MOMFRAC .664 .306 4.722 1 .030 1.943 1.067 3.536ARMASSIST .473 .231 4.176 1 .041 1.604 1.020 2.525RATERISK .458 .238 3.700 1 .054 1.581 .991 2.521Constant 2.491 3.237 .592 1 .442 12.070 a. Variable(s) entered on step 1: AGE, HEIGHT, PRIORFRAC, MOMFRAC, ARMASSIST, and RATERISK.Step 5: At this point, we have our preliminary main effects model and must now check for the scale of the logit for continuous covariates age and height.Step 6: The next step in the purposeful selection procedure is to explore possible interactions between the main effects. The subject matter investigators felt that each pair of main effects represents a plausible interaction. Hence, we fit models that individually added each of the 15 possible interactions to the main effects model. The results are summarized in Table 4.14. Three interactions are significant at the 10 percent level: Age by prior fracture (PRIORFRAC), prior fracture by mother had a fracture (MOMFRAC) and mother had a fracture by arms needed to rise from a chair (ARMASSIST). We note that prior fracture and mother having had a fracture are involved in two of the three significant interactions.Table 4.14 Log-Likelihood, Likelihood Ratio Test (G, df = 1), and p-Value for the Addition of the Interactions to the Main Effects ModelInteraction Log-Likelihood G /Wald pMain effects model -254.9090 Age*Height -254.8420 0.13 0.716Age*Priorfrac -252.3920 5.701 0.025Age*Momfrac -254.8395 0.140 0.708Age*Armassist -254.8360 0.146 0.702Age*Raterisk -254.3855 1.50 0.305Height*Priorfrac -254.8025 0.213 0.644Height*Momfrac -253.7045 2.438 0.118Height*Armassist -254.1115 1.588 0.208Height*Raterisk -254.4220 0.990 0.320Priorfrac*Momfrac -253.5095 2.793 0.095Priorfrac*Armassist -254.7960 0.224 0.636Priorfrac*Raterisk -254.8475 0.122 0.726Momfrac*Armassist -252.5180 4.699 0.030Momfrac*Raterisk -254.6425 0.533 0.465Armassist*Raterisk -254.7925 2.230 0.135The next step is to fit a model containing the main effects and the three significant interactions. The results of this fit are shown in Table 4.15 The three degree of freedom likelihood ratio test of the interactions model in Table 4.15 versus the main effects model in Table 4.9 is G = 11.03 with p =0.012. Thus, in aggregate, the interactions contribute to the model. However, one interaction, prior fracture by mother’s fracture, is not significant with a Wald statistic p = 0.191. Next, we fit the model excluding this interaction and the results are shown in Table 4.16Table 4.15 Results of Fitting the Multivariable Model with the Addition of ThreeInteractions, n = 500Variables in the EquationB S.E. Wald df Sig. Exp(B) 95% C.I.for EXP(B)Lower UpperStep 1a AGE .058 .017 12.172 1 .000 1.060 1.026 1.095HEIGHT -.049 .018 7.038 1 .008 .952 .919 .987PRIORFRAC 4.598 1.878 5.993 1 .014 99.240 2.501 3937.490MOMFRAC 1.472 .423 12.124 1 .000 4.360 1.903 9.986ARMASSIST .626 .254 6.075 1 .014 1.869 1.137 3.074RATERISK .474 .241 3.869 1 .049 1.607 1.002 2.577AGE by PRIORFRAC -.053 .026 4.223 1 .040 .948 .901 .998MOMFRAC by PRIORFRAC -.847 .648 1.711 1 .191 .429 .121 1.525ARMASSIST by MOMFRAC -1.167 .617 3.580 1 .058 .311 .093 1.043Constant 1.011 3.385 .089 1 .765 2.749 a. Variable(s) entered on step 1: AGE, HEIGHT, PRIORFRAC, MOMFRAC, ARMASSIST, RATERISK, AGE * PRIORFRAC, MOMFRAC * PRIORFRAC, ARMASSIST * MOMFRAC.Table 4.16 Results of Fitting the Multivariable Model with the Significant Interactions, n = 500Variables in the EquationB S.E. Wald df Sig. Exp(B) 95% C.I.for EXP(B)Lower UpperStep 1a AGE .057 .017 12.060 1 .001 1.059 1.025 1.094HEIGHT -.047 .018 6.501 1 .011 .954 .921 .989PRIORFRAC 4.612 1.880 6.018 1 .014 100.715 2.527 4013.438MOMFRAC 1.247 .393 10.064 1 .002 3.479 1.610 7.514ARMASSIST .644 .252 6.538 1 .011 1.904 1.162 3.120RATERISK .469 .241 3.794 1 .051 1.598 .997 2.562AGE by PRIORFRAC -.055 .026 4.543 1 .033 .946 .899 .996ARMASSIST by MOMFRAC -1.281 .623 4.225 1 .040 .278 .082 .942Constant .779 3.381 .053 1 .818 2.180 a. Variable(s) entered on step 1: AGE, HEIGHT, PRIORFRAC, MOMFRAC, ARMASSIST, RATERISK, AGE * PRIORFRAC, ARMASSIST * MOMFRAC.Interpretation-The estimated coefficients in the interactions model in Table 4.16 are, with one exception, significant at the five percent level. The exception is the estimated coefficient for the dichotomized self-reported risk of fracture, RATERISK3 (1 = more, 0 = same or less) with p = 0.051. We elect to retain this in the model since the covariate is clinically important and its significance is nearly five percent. Hence the model in Table 4.16 is our preliminary final model.APPENDIXDATASET 1TABLE 1.1Age, Age Group, and Coronary Heart Disease(CHD) Status of 100 SubjectsID AGE AGEGRP CHD 1201 02231 03241 04251 05251 16261 07261 08281 09281 010291 011302 012302 013302 014302 015302 016302 117322 018322 019332 020332 021342 022342 023342 124342 025342 026353 027353 028363 029363 130363 031373 032373 133373 034383 035383 036393 037393 138404 039404 140414 041414 042424 043424 044424 045424 146434 047434 048434 149444 050444 051444 152444 153455 054455 155465 056465 157475 058475 059475 160485 061485 162485 163495 064495 065495 166506 067506 168516 069526 070526 171536 172536 173546 174557 075557 176557 177567 178567 179567 180577 081577 082577 183577 184577 185577 186587 087587 188587 189597 190597 191608 092608 193618 194628 195628 196638 197648 098648 199658 1100698 1DATASET 2-The Global Longitudinal Study of Osteoporosis in WomenThe Global Longitudinal Study of Osteoporosis in Women (GLOW) is an internationalStudy of osteoporosis in women over 55 years of age being coordinated at the Code Sheet for Variables in the GLOW StudyVariable Description Codes/Values Name1 Identification code 1–n SUB_ID2 Study site 1–6 SITE_ID3 Physician ID code 128 unique codes PHY_ID4 History of prior fracture 1=yes PRIORFRAC0=No 5 Age at enrolment Years AGE6 Weight at enrolment Kilograms WEIGHT7 Height at enrolment Centimeters HEIGHT8 Body mass index kg/m^2 BMI9 Menopause before age 45 1 = yes PREMENO0 = no 10 Mother had the hip fracture 1= yes MOMFRAC0= no 11 Arms are needed to stand from a chair 1= yes ARMASSIST0= no 12 Former or current smoker 1= yes SMOKE0= no 13 Self-reported risk of fracture 1 = Less than others of the RATERISKsame age 2 = Same as others of the Same age. 3 = Greater than others of the Same age. 14 Fracture risk score Composite risk score FRACSCORE15 Any fracture in the first year 1 = yes FRACTURE0 = no IBM SPSS Statistics 20 Command Syntax ReferenceTABLE 1.3LOGISTIC REGRESSION VARIABLES CHD/METHOD=ENTER AGE/CRITERIA=PIN (.05) POUT (.10) ITERATE (20) CUT (.5).TABLE 2.2LOGISTIC REGRESSION VARIABLES FRACTURE /METHOD=ENTER AGE WEIGHT PRIORFRAC PREMENO RATERISK /CONTRAST (RATERISK) =Indicator (1) /CRITERIA=PIN (.05) POUT (.10) ITERATE (20) CUT (.5).TABLE 2.3LOGISTIC REGRESSION VARIABLES FRACTURE /METHOD=ENTER AGE PRIORFRAC RATERISK /CONTRAST (RATERISK) =Indicator (1) /PRINT=CI (95) /CRITERIA=PIN (0.05) POUT (0.10) ITERATE (20) CUT (0.5).TABLE 3.2CROSSTABS /TABLES=FRACTURE BY PRIORFRAC /FORMAT=AVALUE TABLES /CELLS=COUNT /COUNT ROUND CELL.TABLE 3.3LOGISTIC REGRESSION VARIABLES FRACTURE /METHOD=ENTER PRIORFRAC /PRINT=CI (95) /CRITERIA=PIN (0.05) POUT (0.10) ITERATE (20) CUT (0.5).TABLE 3.7LOGISTIC REGRESSION VARIABLES FRACTURE /METHOD=ENTER RATERISK /CONTRAST (RATERISK) =Indicator (1) /PRINT=CI (95) /CRITERIA=PIN (0.05) POUT (0.10) ITERATE (20) CUT (0.5).TABLE 3.10MODEL 1:-LOGISTIC REGRESSION VARIABLES FRACTURE /METHOD=ENTER PRIORFRAC /PRINT=CI (95) /CRITERIA=PIN (0.05) POUT (0.10) ITERATE (20) CUT (0.5).MODEL 2:-LOGISTIC REGRESSION VARIABLES FRACTURE /METHOD=ENTER PRIORFRAC HEIGHT /PRINT=CI (95) /CRITERIA=PIN (0.05) POUT (0.10) ITERATE (20) CUT (0.5).MODEL 3:-LOGISTIC REGRESSION VARIABLES FRACTURE /METHOD=ENTER PRIORFRAC HEIGHT HEIGHT*PRIORFRAC /PRINT=CI (95) /CRITERIA=PIN (0.05) POUT (0.10) ITERATE (20) CUT (0.5).TABLE 4.7LOGISTIC REGRESSION VARIABLES FRACTURE /METHOD=ENTER AGE /PRINT=CI (95) /CRITERIA=PIN (0.05) POUT (0.10) ITERATE (20) CUT (0.5).LOGISTIC REGRESSION VARIABLES FRACTURE /METHOD=ENTER WEIGHT /PRINT=CI (95) /CRITERIA=PIN (0.05) POUT (0.10) ITERATE (20) CUT (0.5).LOGISTIC REGRESSION VARIABLES FRACTURE /METHOD=ENTER HEIGHT /PRINT=CI (95) /CRITERIA=PIN (0.05) POUT (0.10) ITERATE (20) CUT (0.5).LOGISTIC REGRESSION VARIABLES FRACTURE /METHOD=ENTER BMI /PRINT=CI (95) /CRITERIA=PIN (0.05) POUT (0.10) ITERATE (20) CUT (0.5).LOGISTIC REGRESSION VARIABLES FRACTURE /METHOD=ENTER PRIORFRAC /PRINT=CI (95) /CR LOGISTIC REGRESSION VARIABLES FRACTURE /METHOD=ENTER PREMENO /PRINT=CI (95) /CRITERIA=PIN (0.05) POUT (0.10) ITERATE (20) CUT (0.5).ITERIA=PIN (0.05) POUT (0.10) ITERATE (20) CUT (0.5).LOGISTIC REGRESSION VARIABLES FRACTURE /METHOD=ENTER MOMFRAC /PRINT=CI (95) /CRITERIA=PIN (0.05) POUT (0.10) ITERATE (20) CUT (0.5).LOGISTIC REGRESSION VARIABLES FRACTURE /METHOD=ENTER ARMASSIST /PRINT=CI (95) /CRITERIA=PIN (0.05) POUT (0.10) ITERATE (20) CUT (0.5).LOGISTIC REGRESSION VARIABLES FRACTURE /METHOD=ENTER SMOKE /PRINT=CI (95) /CRITERIA=PIN (0.05) POUT (0.10) ITERATE (20) CUT (0.5).LOGISTIC REGRESSION VARIABLES FRACTURE /METHOD=ENTER RATERISK /CONTRAST (RATERISK) =Indicator (1) /PRINT=CI (95) /CRITERIA=PIN (0.05) POUT (0.10) ITERATE (20) CUT (0.5).TABLE 4.8LOGISTIC REGRESSION VARIABLES FRACTURE /METHOD=ENTER AGE HEIGHT PRIORFRAC MOMFRAC ARMASSIST RATERISK /CONTRAST (RATERISK) =Indicator (1) /PRINT=CI (95) /CRITERIA=PIN (0.05) POUT (0.10) ITERATE (20) CUT (0.5).TABLE 4.9LOGISTIC REGRESSION VARIABLES FRACTURE /METHOD=ENTER AGE HEIGHT PRIORFRAC MOMFRAC ARMASSIST RATERISK /CONTRAST (RATERISK) =Indicator (1) /PRINT=CI (95) /CRITERIA=PIN (0.05) POUT (0.10) ITERATE (20) CUT (0.5).TABLE 4.15LOGISTIC REGRESSION VARIABLES FRACTURE /METHOD=ENTER AGE HEIGHT PRIORFRAC MOMFRAC ARMASSIST RATERISK AGE*PRIORFRAC MOMFRAC*PRIORFRAC ARMASSIST*MOMFRAC /CONTRAST (RATERISK) =Indicator (1) /PRINT=CI (95) /CRITERIA=PIN (0.05) POUT (0.10) ITERATE (20) CUT (0.5).TABLE 4.16LOGISTIC REGRESSION VARIABLES FRACTURE /METHOD=ENTER AGE HEIGHT PRIORFRAC MOMFRAC ARMASSIST RATERISK AGE*PRIORFRAC ARMASSIST*MOMFRAC /CONTRAST (RATERISK) =Indicator (1) /PRINT=CI (95) /CRITERIA=PIN (0.05) POUT (0.10) ITERATE (20) CUT (0.5).