About me

About

Hi! I am Lin Shi from Beijing, China. I am currently a Master student in Computer Science at Cornell Tech (2025–2026). I graduated with High Honors from Dartmouth College early (2022–2025), where I double majored in Computer Science and Mathematics and was recognized as a Presidential Scholar, Neukom Scholar, and URAD Scholar.

I have been leading and/or participating in 10+ research projects with Stanford University, CMU, Harvard, Cornell, Dartmouth, Microsoft Research Asia, Shanghai AI Lab, Zhipu AI, etc. I am honored to work with Professor Ludwig Schmidt (Stanford) on Terminal Bench (Stanford x Laude Institute), Professor Soroush Vosoughi (Dartmouth) on LLM evaluation, Professor Paulo Carvalho (CMU) on AI Education, and Professor Faan Chen (Harvard) on transportation and environment research.

Recently, I study how AI can become stronger programmers, focusing on their ability to write, debug, and reason about code that achieves complex tasks and aligns with human preferences. I approach the problem through the lens of dataset benchmark, evaluation methodology, and architectural advancement. I am also interested in how AI-augmented programming reshapes human learning and developer workflows, i.e., what we should and how we could teach and learn about coding in this special age of AI.

As a former alpine ski athlete and currently an AI researcher, I believe in my brain power more than my physical body; but the lessons and spirit from competitive sports and diverse hobbies largely shape who I am now. I am always open to exploring new fields of research, just as I always welcome unlocking new sports/playful activities.

Feel free to reach out to me about academics, research, shared hobbies, or anything you want!

Research

Below are some of my works tagged by theme.

(* equal contribution; ** core member and sub-team lead)

Terminal-Bench: Benchmarking Agents on Hard, Realistic Tasks in Command Line Interfaces
Evaluation Agent Dataset Benchmark
Stanford University & Laude Institute
Mike Merrill*, Alex Shaw*, Nicholas Carlini**, Boxuan Li**, Harsh Raj**, Ivan Bercovich**, Lin Shi**, et al.
Terminal Bench 1.5 under review; Adapters paper planning for submission to ICML 2026

I am a core contributor in Terminal Bench and Harbor. I lead the Registry and Adapter Team to adapt other benchmarks into Terminal-Bench / Harbor format that aim to make agent evaluation uniform, convenient, and easy. I managed 50+ benchmark adapters covering 5000+ tasks by coordinating 100+ contributors and directed adapter standardization, benchmark screening, and code quality control. We are still recruiting open-source community contributors to get more adapters onboard - feel free to reach out!

Prompting Large Language Models to Tackle the Full Software Development Lifecycle: A Case Study
Evaluation Dataset Benchmark
Shanghai AI Lab
Bowen Li*, Wenhan Wu*, Ziwei Tang*, Lin Shi*, , John Yang, Jinyang Li, Shunyu Yao, Chen Qian, Binyuan Hui, Qicheng Zhang, Zhiyin Yu, He Du, Ping Yang, Dahua Lin, Chao Peng, Kai Chen
COLING 2025 (Oral) | ACL Anthology

DevEval, a dataset and evaluation framework for assessing LLMs across software development lifecycle (software design, environment setup, implementation, unit testing, acceptance testing); released on OpenCompass.

Judging the Judges: A Systematic Study of Position Bias in LLM-as-a-Judge
Evaluation
Dartmouth College
Lin Shi, Chiyu Ma, Wenhua Liang, Xingjian Diao, Weicheng Ma, Soroush Vosoughi
AACL-IJCNLP 2025 (Oral) | arXiv

A systematic framework for evaluating position bias of LLM judges by repetition stability, position consistency, and preference fairness; investigated key factors driving position bias.

Judging with Many Minds: Do More Perspectives Mean Less Prejudice?
Evaluation Agent
Dartmouth College
Chiyu Ma, Enpei Zhang, Yilun Zhao, Wenjun Liu, Yaning Jia, Peijun Qing, Lin Shi, Arman Cohan, Yujun Yan, Soroush Vosoughi
EMNLP 2025 | ACL Anthology

Multi-Agent-Debate exacerbates biases; LLM-as-meta-judges resist them.

SAILRTS: Multimodal Multi-Agent System for Automated Software Issue Resolution
Agent
Microsoft Research Asia
Informal Research Intern
In preparation for ICML 2026

Designed SAILRTS, a multimodal multi-agent system for automated real-world software-issue resolution through context isolation, enhanced codebase analysis, and bidirectional code–image localization.

Learning Path Optimization by Agent-simulated Students
AI Education Agent
CMU-OAK Lab
Visiting Researcher, supervised by Professor Paulo Carvalho
In progress

Developed a reinforcement-learning-inspired framework to identify learning segments and knowledge components to construct, evaluate, and optimize personalized learning paths.

Adaptive Learning Platform for CS Education
AI Education
Dartmouth College
Senior Thesis, supervised by Professor Soroush Vosoughi
Deployed at Dartmouth

AI-powered adaptive learning platform deployed for Dartmouth CS1 courses to personalize practice problems, learning progress, and feedback loop. Try it here (only available to Dartmouth students with password)!

Universal Matching Game
AI Education
Dartmouth College
Neukom Scholar Project
Deployed at Dartmouth

A full-stack multiplayer matching-game. Learn by matching and for fun! Use AI to translate materials into pairs, learn them by matching, and play against others. Try it here.

Mechanistic Insights: Circuit Transformations Across Input and Fine-Tuning Landscapes
Mechanistic Interpretability
Dartmouth College
Chiyu Ma*, Lin Shi*, Ollie Liu, Wenhua Liang, Jiaqi Gan, Ming Cheng, Willie Neiswanger, Soroush Vosoughi
Under review at TMLR

Investigated how ablated inputs and fine-tuning methods (e.g., LoRA, Bitfit) modify circuits of LLMs; proposed efficient framework to explore model component functionalities.

Built Environment, Car Ownership and PM2.5: Stronger Causal Estimates from a Quasi-Experiment
Transportation
Harvard University
Lin Shi*, Yiliang Jiang*, Faan Chen*, Kaiyi Zhu, Chris P Nielsen, Yuejiao Wang, Fang Tian, Jiaorong Wu, Xiaohong Chen
Journal of Transport Geography | DOI

Applied Structural Equation Modeling (SEM) to estimate the causal impact of built environment on resident travel behavior, providing valuable insights for urban planning and transportation research.

Measuring Road Safety Achievement Based on EWM-GRA-SVD: A Decision-Making Support System for APEC Countries
Transportation
Harvard University
Faan Chen*, Lin Shi*, Yaxin Li, Qilin Wang, Haosen Sun, Xinyu Tang, Jiacheng Zu, Zhenwei Sun
Knowledge-Based Systems | DOI

Proposed multi-criteria decision making (MCDM) frameworks for transportation and environmental policy analysis, providing road-safety policymaking support for APEC countries.

Lin Shi (石林)

About

Research