In our prior post, we addressed the most common issues with outcomes reporting through Three Big Ideas. Now we are going to dig in deeper with eight nuanced topics for inspecting impact results based on matched control groups to help you become the Sherlock Holmes of impact.
Number 1: Ask yourself… What is treatment and how does it differ from BAU?
Typically, students are exposed to multiple dynamic treatments as a part of student experience. Oftentimes, it can be difficult to differentiate treatment from business as usual (BAU). The more specific the differentiation between treatment and BAU, the more trustworthy the results.
In our context, treatment is something done to a student with the intent of affecting a student success outcome. Treatment can’t be BAU since it is something that everyone or a large group of students receives or experiences. There needs to be a clear intentionality between treatment and its intended outcomes. Treatment falls into academic, non-academic, or financial aid categories. In short, treatment is what differentiates pilot from control.
Treatment and trigger can sometimes be confused. A “trigger” is an event that may warrant immediate treatment. A trigger can be three consecutive missing classes or it could be a poor midterm grade just received. A trigger is not equal to treatment. Instead, treatment for a poor midterm grade trigger could be a mindset nudge delivered within a few days after the midterm with an offer of a tutoring session with a peer tutor. Treatment can be sequentially linked. Following the same midterm example, the second treatment can be academic tutoring itself measured in duration, frequency, and time of visit for those who attended the tutoring center.
A key component of investigation? Make sure that we have a good understanding of what treatment is and how it differs from BAU.
Idea Number 2: Ask yourself… Is the impact metric related to treatment?
Always ask if there is a strong and intentional association between treatment and reported outcomes. There are two factors to consider — (1) alignment between the intervention objectives and the impact metrics and (2) how long we must wait to measure impact in relation to treatment duration.
If an intervention program is designed to improve student engagement for the current term, it is highly likely that it will also impact student persistence because highly engaged students are more likely to persist based on our data analysis. On the other hand, its impact, provided that it was a one-term program for new students, on graduation will be much more muted and there will be attributional ambiguity due to a number of mediating, unknown factors between the current first term and graduation.
On the other hand, if the linkage between treatment and impact metric is not obvious, we recommend that surrogate, shorter-term impact metrics be explored. Using the mindset nudging and tutoring example for students who did poorly on midterm, an early signal for impact can be attendance in the tutoring session. The next impact metric can be academic tutoring session frequency and duration, followed by final exam/course grades.
Always ask how interventions are designed to influence impact metrics through what pathways.
Idea Number 3: Ask yourself… How are multiple concurrent interventions addressed?
Let’s say that a student is exposed to three treatment events — a writing center visit to improve writing, an office visit in general chemistry 101, and a regular check-in with an academic advisor. In this case, since the student is exposed to the three treatment events, how do we assign impact metrics to an individual treatment? On persistence, all three events can contribute. However, if the goal is to measure the impact on persistence of the writing center, we can create variables for office visit and advising session, investigate their signal-to-noise ratio (SNR) in predicting the impact metric, and incorporate them into models/matching processes if SNR is high.
If the goal of analysis is to measure the relative impact of each of these three events, there are multiple approaches to resolving attributional ambiguity through time-series event-based dynamic matching, which is beyond the scope of this article.
Always ask how multiple, concurrent intervention programs are accounted for.
Idea Number 4: Ask yourself… What were the Inclusion/exclusion criteria?
Quite frequently we see inclusion/exclusion criteria (IEC) not being included (no pun intended) in program description. Knowing IEC is important in observational studies since the knowledge is crucial in determining how to create a control pool and understanding the presence of potential confounders in matching.
For example, let’s say that we are matching students based on SIS-derived features. An institution sends a nudge to all first-term, full-time (FTFT) students who have recently skipped three classes in a row. If all we know is that they sent nudge to FTFT students, we have a problem since skipping 3 consecutive classes would have a material impact on student success, which is not accounted for in the matching process. The best approach?
Understand the predictive and impact signal power of skipping 3 consecutive classes.
If material, this feature needs to be included in the modeling and matching process.
Otherwise, the estimated impact number will be too pessimistic.
In another case, we worked with a vendor trying to negotiate a performance fee based on their pre-post comparison results. They were reporting a huge savings number. Upon closer inspection, we found that one of their inclusion criteria was an inpatient episode before being admitted to a disease management (DM) program. For patients with inpatient episodes during year 1, there is a natural regression towards the mean in health costs in year 2. When we conducted properly matched impact analysis using the same inclusion criteria, but allocating them to pilot or control based on actual DM program participation, there was no material impact in patient health outcomes and cost savings.
In higher ed, we were conducting research to understand the causal relationship between nudges codified via natural language processing (NLP) features and short-term effect on student engagement derived from LMS factors. During the analysis, we encountered an unexpected situation of well-crafted nudges written using salient concepts in behavioral science resulting in lowered student engagement. Upon closer inspection, we realized that these nudges were sent immediately after midterm. Of course students who studied hard for midterm exams tend to relax and take it easy after the midterm.
If triggers were used to define inclusion/exclusion criteria, we need to be extra careful since triggers can be signals for material changes in impact metrics and most static models do not accommodate dynamic triggers. Models with SIS and LMS features can account for such signals since such trigger-related student engagement changes can be inferred through LMS features.
Always ask for inclusion and exclusion criteria for an intervention program to ensure that there are no special situations that could swing impact results one way or the other even in the absence of intervention.
Idea Number 5: Ask yourself… What is the time horizon for impact measurement?
In situations where student success metrics are measured in a long time period, such as graduation, we need to understand not only the inclusion/exclusion criteria, but also the timeframe in terms of treatment duration and when the initial matching occurred. We must define matched pilot and control populations at the start of a long-term intervention and track them over time all the way out to the terminal end point.
The tracking is not trivial when the intervention is applied at a section or course level. In this situation, we can work with treatment dosage. Without this solid baseline foundation, we can confuse correlation with causation, leading to erroneous claims by comparing apples with oranges.
Always ask how long treatment is, when and how populations were matched, and a list of potential confounders during treatment that could’ve affected pilot and control differently.
Idea Number 6: Ask yourself… Are there pitfalls with randomized controlled trials?
Randomization is not as easy as it sounds, especially when there are multiple nested units, such as instructors, sections, advisors, schools, and districts.
When RCT results are reported, always ask the following questions:
What was the randomization strategy?
What were the randomization units? Was there a nested structure?
What were the N’s in the randomization units?
If students are assigned to the randomization units, how was the assignment made? Could there be biases in the randomization units based on the assignment strategy?
Can you prove that the randomization was truly random via covariates, propensity scores, Mahalanobis distance metrics, or prediction scores?
Idea Number 7: Ask yourself… Which covariates were used in matching?
Many research papers show covariates used in matching. However, not many show predictive powers of these covariates. It is of paramount importance that the best machine learning practices be followed in matching. The best practices evolve around
L1, L2, or hybrid regularization to minimize model complexity and to determine the predictive power of covariates
Resulting predictive model performance results encompassing accuracy and calibration
QQ calibration plot with Brier score and Hosmer-Lemeshow p value to detect model calibration failure
Since we are not focused on maximizing the predictive power of a propensity-score model, accuracy of the propensity-score model is not an important metric.
Idea Number 8: Ask yourself… Can you show how effective matching was in minimizing selection bias?
It is of paramount importance that we see the evidence of matching. There exist a number of theories and opinions on matching. As a minimum, we would like to see that their prediction and propensity scores be well matched as shown in Figure 1.
Figure 1: Propensity (left) and prediction score probability density functions before (top) and after matching. As you can see, there were significant differences between pilot and control, which disappeared after matching
Causal impact analysis is serious business. The more we demand causal impact numbers from scientifically rigorous analyses with well-designed experiments, the more we can help students become more successful. Civitas Learning is in the middle of building an evidence-based efficacy network to help our partners get to student success faster and more effectively.
Dave Kil has more than 20 years of experience in building various analytics apps and solutions spanning nonlinear time-series analysis to predictive analytics, outcomes research, and user experience optimization. He and his team are working on (1) improving predictive algorithms to provide much more actionable insights, (2) adding new capabilities to automate ROI and outcomes analyses as part of action analytics, and (3) making the Civitas Learning analytics platform self-learning and more intelligent over time. He holds 14 U.S. patents, is the author of a book on pattern recognition and predictions, and has published a number of articles in journals. He currently serves as chief data scientist with Civitas Learning.
Laura Malcolm has nearly 20 years of education and product design experience in building technology products to serve partner institutions, faculty, and students in the attainment of their educational goals. She began her career as a high school teacher in Austin and through that experience discovered her deep interest in designing tools to help people learn. Prior to joining Civitas Learning, she spent 10 years in executive leadership roles directing the design and development of innovative educational technology products. She is a two-time CODiE Award recipient for product design. She currently serves as senior vice president, outcomes and strategy for Civitas Learning.