Sitemap
A list of all the posts and pages found on the site. For you robots out there, there is an XML version available for digesting as well.
Pages
Huayu Sha
Research homepage of Huayu Sha, a Fudan University student working on trustworthy language-model evaluation, medical benchmarks, and scientific intelligence.
Posts
Blog Post number 4
Published:
This is a sample blog post. Lorem ipsum I can’t remember the rest of lorem ipsum and don’t have an internet connection right now. Testing testing testing this blog post. Blog posts are cool.
Blog Post number 3
Published:
This is a sample blog post. Lorem ipsum I can’t remember the rest of lorem ipsum and don’t have an internet connection right now. Testing testing testing this blog post. Blog posts are cool.
Blog Post number 2
Published:
This is a sample blog post. Lorem ipsum I can’t remember the rest of lorem ipsum and don’t have an internet connection right now. Testing testing testing this blog post. Blog posts are cool.
Blog Post number 1
Published:
This is a sample blog post. Lorem ipsum I can’t remember the rest of lorem ipsum and don’t have an internet connection right now. Testing testing testing this blog post. Blog posts are cool.
portfolio
Portfolio item number 1
Published:
Short description of portfolio item number 1
Portfolio item number 2
Published:
Short description of portfolio item number 2 
publications
LLMEval-Med: A Real-world Clinical Benchmark for Medical LLMs with Physician Validation
Published in Findings of EMNLP 2025, 2025
LLMEval-Med is a physician-validated clinical benchmark built from real-world electronic health records and expert-designed scenarios. It targets the weaknesses of existing medical LLM evaluations by moving beyond exam-style questions toward realistic clinical reasoning and checklist-based expert assessment.
LLMEval-Fair: A Large-Scale Longitudinal Study on Robust and Fair Evaluation of Large Language Models
Published in arXiv preprint, 2025
LLMEval-Fair proposes a dynamic evaluation framework that samples unseen test sets from a large question bank, combines contamination-resistant curation with anti-cheating design, and studies almost 50 frontier models longitudinally to produce a more reliable picture of progress than static leaderboards.
OpenNovelty: An Open-domain Benchmark for Evaluating the Open-ended Novelty of Language Models
Published in arXiv preprint, 2026
OpenNovelty studies whether language models can judge the novelty of open-ended ideas rather than only solve fixed-answer tasks. It introduces an open-domain benchmark for comparing LLM judgments of novelty across research-oriented scenarios, aiming to better understand how models support scientific creativity and evaluation.
talks
Talk 1 on Relevant Topic in Your Field
Published:
This is a description of your talk, which is a markdown file that can be all markdown-ified like any other post. Yay markdown!
Conference Proceeding talk 3 on Relevant Topic in Your Field
Published:
This is a description of your conference proceedings talk, note the different field in type. You can put anything in this field.
teaching
Teaching experience 1
Undergraduate course, University 1, Department, 2014
This is a description of a teaching experience. You can use markdown like any other post.
Teaching experience 2
Workshop, University 1, Department, 2015
This is a description of a teaching experience. You can use markdown like any other post.