In statistical modelling different models are usually considered depending on the type of response variable that the researcher is using. For example, if the outcome variable is continuous then the normal regression method may be used. In this data however, the outcome variable or the dependent variable is binary in that it takes up two outcomes which are smoker and non-smoker. For appropriate analysis using STATA, 0 was coded for smoker and 1 for non-smoker. When the response variable is binary, in most cases the logistic regression model is usually employed.
1. Consider a binary dependent regression model with 1 independent variable
Binary independent regression model with one independent variable
a. Select an appropriate independent variable. Justify your selection.
There are four independent variables under consideration which include: the age of the individual, years of education, income and price of cigarettes in 1979. For bivariate independent regression modelling of one independent variable, I chose years of education since this variable has the least standard deviation compared to the other three variables. The standard deviation for the variables of age of the individuals, years of education, income and price of cigarettes in 1979 are 17.05694,9083.511,4.848667 and 17.05694 respectively as shown in the table below:
Summary statistics for the independent variables
Variable | Obs Mean Std. Dev. Min Max
educ | 1,196 12.22115 3.275847 0 18
income | 1,196 19304.77 9083.511 500 30000
pcigs79 | 1,196 60.98495 4.848667 46.3 69.8
age | 1,196 41.80686 17.05694 17 88
b. Use three independent different models (in one regressor) to estimate the probability of smoking and formulate the population regression function for each of them. Report and interpret the results.
i) Logistic regression model
The first model to consider is the Logistic regression model. After modelling using STATA, these were the results: