In statistical modelling different models are usually considered depending on the type of response variable that the researcher is using. For example, if the outcome variable is continuous then the normal regression method may be used. In this data however, the outcome variable or the dependent variable is binary in that it takes up two outcomes which are smoker and non-smoker. For appropriate analysis using STATA, 0 was coded for smoker and 1 for non-smoker. When the response variable is binary, in most cases the logistic regression model is usually employed.
1. Consider a binary dependent regression model with 1 independent variable
Binary independent regression model with one independent variable
a. Select an appropriate independent variable. Justify your selection.
There are four independent variables under consideration which include: the age of the individual, years of education, income and price of cigarettes in 1979. For bivariate independent regression modelling of one independent variable, I chose years of education since this variable has the least standard deviation compared to the other three variables. The standard deviation for the variables of age of the individuals, years of education, income and price of cigarettes in 1979 are 17.05694,9083.511,4.848667 and 17.05694 respectively as shown in the table below:
Summary statistics for the independent variables
Variable | Obs Mean Std. Dev. Min Max
educ | 1,196 12.22115 3.275847 0 18
income | 1,196 19304.77 9083.511 500 30000
pcigs79 | 1,196 60.98495 4.848667 46.3 69.8
age | 1,196 41.80686 17.05694 17 88
b. Use three independent different models (in one regressor) to estimate the probability of smoking and formulate the population regression function for each of them. Report and interpret the results.
i) Logistic regression model
The first model to consider is the Logistic regression model. After modelling using STATA, these were the results:
Number of iterations in the model
Number of iterations Log likelihood statistic
Smoker coefficient Std.error z p-value 95% confidence interval
Educ -.0591058 .0183098 -3.23 0.001 [-0.949924,0.232191] Constant .2304882 .2292564 1.01 0.315 [-.218846 , .6798224]
The population regression function is:
The chi square likelihood ratio test value is 10.53 and the p-value for the chi square test is 0.0012. The log likelihood value is -789.20747.
In this model, if the education variable is held constant, then the logarithm of the odds that one is a smoker increase by 0.2304882. Moreover, the coefficient of the education variable shows that there is a negative relationship between being a smoker and the years of education. For a unit increase in the years of education the logarithm of odds of being a smoker decrease by 0.0591058.
The probability value of the years of education variable is 0.001 which is less than 0.05, hence the years of education variable is statistically significant. The p-value of the chi square test is 0.0012 and thus less than 0.05, therefore the overall model is statistically significant.
Odds ratio of the coefficients of the logistic model
smoker Odds Ratio Std. Err. z p>z 95% confidence interval
educ .9426071 .017259 -3.23 0.001 [.9093799 , .9770484] constant 1.259215 .288683 1.01 0.315 [.8034454 , 1.973527]