All around us in Higher Education are tantalizing reports of “results” or gains in student outcomes achieved by colleges and universities through new programming, products, or initiatives.
“11% increase in first-year retention”
“8% increase in four-year graduation rate”
“4.5% increase in persistence for new transfers”
Can they be believed? What is true and what is misleading? Unfortunately, most results that are reported, even in Higher Ed, are simple measures of before and after. Typically these analyses just look at what persistence or completion numbers were before the new program, product or initiative was launched and the same numbers after. The problem is they aren’t controlling for student differences or changes over time not associated with the initiative. In looking at all these results – how do we know what to believe? What is real impact and what is attributable to selection bias or something else?
Number 1: Ask yourself…Was it a pre-post comparison?
Pre-post comparisons are widespread and should be viewed with a healthy dose of skepticism due to a variety of issues with this type of analysis. A common problem with pre-post comparisons is a failure to take into account prior student success trends before the program or initiative was implemented. Let’s say that an institution’s persistence trend went from 70% to 75% in 5 years. At the end of year 5, the institution implements a new student success program and sees persistence moving up to 76.3% in one year. What should be the persistence lift credit for the new program? 1.3%? 0.3%? 3.8%? Will the lift credit and confidence level be different if the path from 70% to 75% were monotonically increasing or wildly fluctuating in between? Figure 1 shows this dilemma. In this scenario it is impossible to differentiate the impact of the initiative or program from the on-going trend through a simple pre-post comparison.
Figure 1: What should be the graduation lift credit for a program that started in 2013? In this figure, the triangle represents the program commencement in 2013.
In addition, pre-post comparisons may suffer from outside factors influencing student outcomes. One common example is the unemployment rate. We know that unemployment often correlates to student enrollment and persistence. When unemployment is high people often look to going back to school and are more likely to stay enrolled, whereas when unemployment is low the trends reverse. Other examples include governmental policy changes and natural disasters. Each of these can impact results – some positive, some negative, but each can make it impossible to differentiate what factors are truly influencing student outcomes.
Finally, pre-post pilots can sometimes suffer from regression to the mean. In causal impact analysis, if pilot inclusion criteria encompass significant recent events or triggers associated with natural change in pre-post impact metrics, such a change can be mistaken for impact results. For instance, a program director conducting an intervention program focusing on students with a sudden large drop in credit hours attempted or midterm grades must be careful in reporting pre-post impact numbers since there may be an accompanying event, such as family loss of income, satisfactory academic progress (SAP) flag or mental health issues, that precedes the drop in credit hours or midterm grades.
The moral of the story?
Always look for a control group and treat pre-post trend comparison results with a healthy dose of skepticism.
Number 2: Ask yourself… Are they cherry picking?
Unfortunately, the phenomenon of cherry picking, or selectively reporting results, is common. Researchers and vendors like to report positive outcomes. However, we know that learning can occur from experiments with both positive and negative results. In clinical trials and higher education, there are now requirements that any IRB trial should be pre-registered with quasi-government entities before the trial commencement to prevent cherry picking, which can lead to publication bias.
While we often believe student success programs can do no harm, the reality is that good intentions may not be sufficient for positive impact numbers for a myriad of reasons. We have seen our share of programs that fail to deliver outcomes. What’s consistent is that whenever we did more detailed impact analyses to understand why they did not work and then made necessary adjustments to improve program efficacy, we saw patient health or student success outcome needles moving in the right direction over time.
We have seen some student success initiatives failing to produce statistically significant outcomes or even negative outcomes for a few student segments as a part of drill-down impact analysis. We work with our partners to better understand why the initiatives did not work and where there is potential for iteration and improvement.
A key lesson?
If a vendor is reporting only positive numbers from a small subset of customers, ask for examples of failed programs or initiatives and learnings from those programs that led to improved future outcomes.
Number 3: Ask yourself… Was there a valid comparison group?
Another key question we should ask centers around the existence of a comparable control group. If there is no control group, any reported number should be treated with extreme caution.
For example, we often hear results such as “students who went to tutoring persisted at rates 10% higher than students who didn’t go to tutoring!” That sounds great! However, are the students who went to tutoring highly similar to students who did not go to tutoring? Isn’t there likely selection bias in that the students who went to tutoring are in some ways different from the students who did not go to tutoring? Is it possible that they were more likely to persist anyway?
In our analyses we see that the answer is “yes”! There is usually significant selection bias in students who take advantage of services vs. students who do not. Shown below in Figure 2 is an example from “Illume Impact” plotting the individual likelihood to persist of each student who went to tutoring (blue line) and each student did not go to tutoring (red line). What we see is that students who went to tutoring had much higher likelihoods to persist even before going tutoring. There is significant selection bias with greater percentage of higher scores in the blue line.
Figure 2: A severe case of selection bias in pre matching. Figure 2 clearly demonstrates that students who go to tutoring already had a much greater probability of persisting to the next term than those who do not. After we did the matching, the impact number was still positive with statistical significance, but nowhere close to the impact number without the matching.
Therefore, comparing students who went to tutoring vs. students who did not is not valid. Instead, a comparable control group must be found by “matching” the students who went to tutoring to highly similar students who did not. This can be done by using prediction and propensity scores (likelihood to have participated) to find a matching student for each student who participated. Once this is done and each student has a match the lift can be measured. This creates a valid comparison or control group.
Tutoring is just one example of many where comparing students who did vs. students who did not is not a valid comparison. We must consider the appropriate comparison or control group in order to measure impact.
Unless we control for student differences (selection bias) we don’t know how much of the difference in persistence or completion is attributable to the service or program and how much is attributable to student differences.
Next, we will explore more nuanced conceptual and technical issues in causal impact analyses where control groups do exist. Understanding these nuanced issues will help us become better decision makers and allocate our scarce student success resources more judiciously.
Dave has more than 20 years of experience in building various analytics apps and solutions spanning nonlinear time-series analysis to predictive analytics, outcomes research, and user experience optimization. He and his team are working on (1) improving predictive algorithms to provide much more actionable insights, (2) adding new capabilities to automate ROI and outcomes analyses as part of action analytics, and (3) making the Civitas Learning analytics platform self-learning and more intelligent over time. He holds 14 U.S. patents, is the author of a book on pattern recognition and predictions, and has published a number of articles in journals. He currently serves as chief data scientist with Civitas Learning.
Laura Malcolm has nearly 20 years of education and product design experience in building technology products to serve partner institutions, faculty, and students in the attainment of their educational goals. She began her career as a high school teacher in Austin and through that experience discovered her deep interest in designing tools to help people learn. Prior to joining Civitas Learning, she spent 10 years in executive leadership roles directing the design and development of innovative educational technology products. She is a two-time CODiE Award recipient for product design. She currently serves as senior vice president, outcomes and strategy for Civitas Learning.