Executable Counterfactuals: Improving LLMs’ Causal Reasoning through Code

news
event
seminar
Aniket Vashishtha: Master Student @ University of Illinois Urbana Champaign

Statistics Seminars: Spring 2026

Department of Mathematical Sciences, IU Indianapolis

Organizer: Honglang Wang (hlwang at iu dot edu)

Talk time: 12:15-1:15pm (EST), 2/10/2026, Tuesday

Zoom Meetings: We host our seminars via zoom meetings: Join from computer or mobile by clicking: Zoom to Join or use Meeting ID: 845 0989 4694 with Password: 113959 to join.

Title: Executable Counterfactuals: Improving LLMs’ Causal Reasoning through Code

Abstract: Counterfactual reasoning, a hallmark of intelligence, consists of three steps: inferring latent variables from observations (abduction), constructing alternatives (interventions), and predicting their outcomes (prediction). This skill is essential for advancing LLMs’ causal understanding and expanding their applications in high-stakes domains such as scientific research. However, existing efforts in assessing LLM’s counterfactual reasoning capabilities tend to skip the abduction step, effectively reducing to interventional reasoning and leading to overestimation of LLM performance. To address this, we introduce executable counterfactuals, a novel framework that operationalizes causal reasoning through code and math problems. Our framework explicitly requires all three steps of counterfactual reasoning and enables scalable synthetic data creation with varying difficulty, creating a frontier for evaluating and improving LLM’s reasoning. Our results reveal substantial drop in accuracy (25-40%) from interventional to counterfactual reasoning for SOTA models like o4-mini and Claude-4-Sonnet. To address this gap, we construct a training set comprising counterfactual code problems having if-else condition and test on out-of-domain code structures (e.g. having while-loop); we also test whether a model trained on code would generalize to counterfactual math word problems. While supervised fine-tuning on stronger models’ reasoning traces improves in-domain performance of Qwen models, it leads to a decrease in accuracy on OOD tasks such as counterfactual math problems. In contrast, reinforcement learning induces the core cognitive behaviors and generalizes to new domains, yielding gains over the base model on both code (improvement of 1.5x-2x) and math problems. Analysis of the reasoning traces reinforces these findings and highlights the promise of RL for improving LLMs’ counterfactual reasoning.

Bio: Aniket Vashishtha is currently a Masters student in Computer science at University of Illinois Urbana Champaign (UIUC), where he is advised by Prof. Hao Peng. Prior to UIUC, he worked as a Research Fellow in Microsoft Research India. He is broadly interested in causality and Large Language Models (LLMs), with a focus on developing methods that strengthen LLMs’ causal reasoning and make their use more reliable in high stake domains such as healthcare and scientific discovery. His work has been published in top ML conferences and recognized through spotlight talks across different venues.

Welcome to join us to learn more about Mr. Vashishtha’s research work via Zoom!