Binary Logistic Regression in STATA|2025

Learn Binary Logistic Regression in STATA with step-by-step instructions. Discover how to model binary outcomes, interpret results, and apply statistical techniques effectively.

Binary logistic regression is a statistical technique used to model the relationship between a dependent binary variable and one or more independent variables. It is commonly used in fields such as social sciences, economics, medicine, and marketing, where the outcome variable is dichotomous (i.e., it takes on two possible outcomes, such as success/failure, yes/no, or 0/1). In STATA, a popular statistical software, logistic regression analysis is straightforward to perform. This paper will explore binary logistic regression in STATA, focusing on its implementation, interpretation, and example use cases. Additionally, we will delve into the interpretation of the results, how categorical variables are handled, and the extension to multivariable logistic regression.

Table of Contents

Binary Logistic Regression in STATA

Logistic regression models are used when the dependent variable is categorical, specifically binary. The binary outcome variable is modeled as a function of predictor variables, which may be continuous or categorical. The key idea behind binary logistic regression is to estimate the probability of an event occurring, given certain predictor variables. In STATA, performing binary logistic regression is simple and involves the use of the logit or logistic commands.

To perform binary logistic regression in STATA, the basic syntax is:

where dependent_variable is the binary outcome variable, and independent_variables are the predictor variables. STATA will estimate the parameters of the model using maximum likelihood estimation.

Alternatively, the logistic command provides odds ratios instead of coefficients:

The logit model is expressed in terms of the log-odds, while the logistic model outputs the odds ratio, which is more interpretable.

Example: Binary Logistic Regression in STATA

Let’s consider a practical example to demonstrate binary logistic regression in STATA. Suppose we have a dataset containing information about customers of a bank, and we are interested in predicting whether a customer will default on a loan based on variables such as income, age, and credit score.

The dataset might look like this:

customer_id	income	age	credit_score	default
1	50000	30	700	0
2	60000	45	650	0
3	30000	25	550	1
4	40000	40	620	0
5	20000	28	480	1

In this example, default is the binary dependent variable (0 = no default, 1 = default), and the independent variables are income, age, and credit_score. To perform a binary logistic regression in STATA, we would run the following command:

STATA would output the coefficients for each predictor variable. These coefficients can be interpreted in terms of the log-odds of the event occurring. To obtain odds ratios, which are easier to interpret, use the logistic command:

The output would include odds ratios for each predictor, which indicate the change in the odds of the outcome variable (loan default) occurring for a one-unit change in the predictor variable.

Logistic Regression Interpretation in STATA

When interpreting the output of a logistic regression in STATA, it is crucial to understand the meaning of the coefficients and their associated p-values. The coefficients represent the change in the log-odds of the dependent variable for a one-unit change in the independent variable, holding other variables constant.

Coefficients: In the logit model, the coefficient estimates represent log-odds. A positive coefficient indicates that as the predictor variable increases, the probability of the event occurring also increases, and vice versa.
Odds Ratios: In the logistic model, the coefficients are converted into odds ratios. An odds ratio greater than 1 suggests that as the predictor increases, the odds of the event occurring increase. Conversely, an odds ratio less than 1 suggests that as the predictor increases, the odds of the event occurring decrease.
P-values: The p-value associated with each predictor helps determine if the predictor is statistically significant. A p-value less than 0.05 is typically considered evidence that the predictor is statistically significant.

For example, if the odds ratio for income is 1.05, it suggests that for each additional unit of income, the odds of defaulting on the loan increase by 5%. If the odds ratio for credit_score is 0.98, it suggests that for each point increase in the credit score, the odds of default decrease by 2%.

Logistic Regression with Categorical Variables in STATA

In real-world datasets, predictor variables are often categorical (e.g., gender, race, or education level). STATA handles categorical variables by creating dummy (binary) variables. For example, if you have a categorical variable gender with two categories (male and female), STATA will automatically create a dummy variable that takes the value 1 for males and 0 for females.

To include categorical variables in your logistic regression, you can use the i. prefix. For instance, if you have a variable gender in the dataset, the following command will perform logistic regression with gender as a categorical variable:

Here, STATA automatically creates the necessary dummy variables for the categorical variable gender.

Multivariable Logistic Regression in STATA

In many real-world scenarios, researchers are interested in examining the joint effect of multiple predictors on the outcome variable. This is where multivariable logistic regression comes in. Multivariable logistic regression models the relationship between a binary outcome and more than one predictor variable.

The syntax for performing a multivariable logistic regression in STATA is the same as for a single-variable logistic regression, but you include multiple independent variables. For example, to predict loan default using income, age, credit score, and gender as predictors, the command would be:

Multivariable logistic regression allows you to account for the simultaneous effects of several predictor variables. STATA will provide you with the coefficients (or odds ratios) for each predictor, which can be used to assess the relative importance of each variable in predicting the outcome.

Binary Logistic Regression: Advanced Topics

Interaction Terms: Sometimes, the effect of one variable on the outcome may depend on the level of another variable. This is called an interaction. To include interaction terms in your logistic regression model, you can use the # operator. For example, to test the interaction between income and age, you would run:

Model Fit and Diagnostics: After fitting a logistic regression model, it is important to evaluate its fit and assess how well it explains the data. STATA provides several methods for evaluating model fit, including:
- Pseudo R-squared: Provides a measure of how much of the variation in the dependent variable is explained by the model.
- Hosmer-Lemeshow Test: A goodness-of-fit test that compares observed and predicted frequencies.
- Likelihood Ratio Test: Compares the fit of two nested models.
Checking for Multicollinearity: In a multivariable logistic regression model, it is essential to check for multicollinearity among the independent variables. High multicollinearity can lead to unreliable estimates of coefficients. In STATA, the vif command can be used to compute the Variance Inflation Factor (VIF) to check for multicollinearity.

Logistic Regression in SPSS vs. STATA

While STATA is a powerful tool for logistic regression, many researchers use SPSS for statistical analysis as well. Both STATA and SPSS allow users to perform logistic regression, but there are differences in the user interface and syntax.

In SPSS, logistic regression is typically performed through a point-and-click interface, making it more user-friendly for non-programmers. However, for advanced users, the syntax in SPSS is also available.
In STATA, the command syntax is more straightforward and flexible, which is why it is often preferred by users who are familiar with command-line interfaces.

Conclusion

Binary logistic regression is a versatile and powerful statistical technique used to model binary outcomes. STATA provides a robust platform for performing logistic regression analysis, with a variety of commands to handle both simple and complex models. By understanding the syntax and output of logistic regression in STATA, researchers can gain valuable insights into the relationships between predictor variables and binary outcomes. Whether dealing with continuous or categorical predictors, STATA offers comprehensive tools to conduct and interpret binary logistic regression analyses.