About
Hi! I am Lin Shi from Beijing, China. I am currently a Master student in Computer Science at Cornell Tech (2025–2026). I graduated with High Honors from Dartmouth College early (2022–2025), where I double majored in Computer Science and Mathematics and was recognized as a Presidential Scholar, Neukom Scholar, and URAD Scholar.
I have been leading and/or participating in 10+ research projects with Stanford University, CMU, Harvard, Cornell, Dartmouth, Microsoft Research Asia, Shanghai AI Lab, Zhipu AI, etc. I am honored to work with Professor Ludwig Schmidt (Stanford) on Terminal Bench (Stanford x Laude Institute), Professor Soroush Vosoughi (Dartmouth) on LLM evaluation, Professor Paulo Carvalho (CMU) on AI Education, and Professor Faan Chen (Harvard) on transportation and environment research.
Recently, I study how AI can become stronger programmers, focusing on their ability to write, debug, and reason about code that achieves complex tasks and aligns with human preferences. I approach the problem through the lens of dataset benchmark, evaluation methodology, and architectural advancement. I am also interested in how AI-augmented programming reshapes human learning and developer workflows, i.e., what we should and how we could teach and learn about coding in this special age of AI.
As a former alpine ski athlete and currently an AI researcher, I believe in my brain power more than my physical body; but the lessons and spirit from competitive sports and diverse hobbies largely shape who I am now. I am always open to exploring new fields of research, just as I always welcome unlocking new sports/playful activities.
Feel free to reach out to me about academics, research, shared hobbies, or anything you want!
Research
Below are some of my works tagged by theme.
(* equal contribution; ** core member and sub-team lead)

Evaluation Agent Dataset Benchmark
Stanford University & Laude Institute
Mike Merrill*, Alex Shaw*, Nicholas Carlini**, Boxuan Li**, Harsh Raj**, Ivan Bercovich**, Lin Shi**, et al.
Terminal Bench 1.5 under review; Adapters paper planning for submission to ICML 2026
I am a core contributor in Terminal Bench and Harbor. I lead the Registry and Adapter Team to adapt other benchmarks into Terminal-Bench / Harbor format that aim to make agent evaluation uniform, convenient, and easy. I managed 50+ benchmark adapters covering 5000+ tasks by coordinating 100+ contributors and directed adapter standardization, benchmark screening, and code quality control. We are still recruiting open-source community contributors to get more adapters onboard - feel free to reach out!

Evaluation Dataset Benchmark
Shanghai AI Lab
Bowen Li*, Wenhan Wu*, Ziwei Tang*, Lin Shi*, , John Yang, Jinyang Li, Shunyu Yao, Chen Qian, Binyuan Hui, Qicheng Zhang, Zhiyin Yu, He Du, Ping Yang, Dahua Lin, Chao Peng, Kai Chen
COLING 2025 (Oral) | ACL Anthology
DevEval, a dataset and evaluation framework for assessing LLMs across software development lifecycle (software design, environment setup, implementation, unit testing, acceptance testing); released on OpenCompass.

Evaluation
Dartmouth College
Lin Shi, Chiyu Ma, Wenhua Liang, Xingjian Diao, Weicheng Ma, Soroush Vosoughi
AACL-IJCNLP 2025 (Oral) | arXiv
A systematic framework for evaluating position bias of LLM judges by repetition stability, position consistency, and preference fairness; investigated key factors driving position bias.

Evaluation Agent
Dartmouth College
Chiyu Ma, Enpei Zhang, Yilun Zhao, Wenjun Liu, Yaning Jia, Peijun Qing, Lin Shi, Arman Cohan, Yujun Yan, Soroush Vosoughi
EMNLP 2025 | ACL Anthology
Multi-Agent-Debate exacerbates biases; LLM-as-meta-judges resist them.

Agent
Microsoft Research Asia
Informal Research Intern
In preparation for ICML 2026
Designed SAILRTS, a multimodal multi-agent system for automated real-world software-issue resolution through context isolation, enhanced codebase analysis, and bidirectional code–image localization.

AI Education Agent
CMU-OAK Lab
Visiting Researcher, supervised by Professor Paulo Carvalho
In progress
Developed a reinforcement-learning-inspired framework to identify learning segments and knowledge components to construct, evaluate, and optimize personalized learning paths.

AI Education
Dartmouth College
Senior Thesis, supervised by Professor Soroush Vosoughi
Deployed at Dartmouth
AI-powered adaptive learning platform deployed for Dartmouth CS1 courses to personalize practice problems, learning progress, and feedback loop. Try it here (only available to Dartmouth students with password)!

AI Education
Dartmouth College
Neukom Scholar Project
Deployed at Dartmouth
A full-stack multiplayer matching-game. Learn by matching and for fun! Use AI to translate materials into pairs, learn them by matching, and play against others. Try it here.

Mechanistic Interpretability
Dartmouth College
Chiyu Ma*, Lin Shi*, Ollie Liu, Wenhua Liang, Jiaqi Gan, Ming Cheng, Willie Neiswanger, Soroush Vosoughi
Under review at TMLR
Investigated how ablated inputs and fine-tuning methods (e.g., LoRA, Bitfit) modify circuits of LLMs; proposed efficient framework to explore model component functionalities.

Transportation
Harvard University
Lin Shi*, Yiliang Jiang*, Faan Chen*, Kaiyi Zhu, Chris P Nielsen, Yuejiao Wang, Fang Tian, Jiaorong Wu, Xiaohong Chen
Journal of Transport Geography | DOI
Applied Structural Equation Modeling (SEM) to estimate the causal impact of built environment on resident travel behavior, providing valuable insights for urban planning and transportation research.

Transportation
Harvard University
Faan Chen*, Lin Shi*, Yaxin Li, Qilin Wang, Haosen Sun, Xinyu Tang, Jiacheng Zu, Zhenwei Sun
Knowledge-Based Systems | DOI
Proposed multi-criteria decision making (MCDM) frameworks for transportation and environmental policy analysis, providing road-safety policymaking support for APEC countries.
