Movable Feast
How 4 Frontier LLMs Handle Movable Holiday Dates
Key Finding: All models performed well on fixed holidays (Christmas, Independence Day). Performance dropped sharply for movable holidays (Easter, Chinese New Year, Eid al-Fitr). Grok-4.1-Fast achieved 82.9% overall accuracy; GPT-5.1 achieved 40.0%.
Abstract
We asked four frontier large language models to identify holidays from calendar dates. All models performed well on fixed holidays. Performance dropped sharply for movable holidays. This case series documents what we observed. We make no claims about why these differences exist.
Method
Each model received 70 queries asking "What holiday falls on {date}?" with no examples or chain-of-thought prompting.
Models Tested
| Model | Provider |
|---|---|
| Grok-4.1-Fast | xAI |
| Gemini-3-Pro | |
| GPT-5.1 | OpenAI |
| Llama-4-Maverick | Meta |
Holidays
Movable (30 queries per model): Chinese New Year, Eid al-Fitr, Easter Sunday
Fixed (40 queries per model): New Year's Day, Labor Day, Independence Day, Christmas
Years tested: 2020-2029.
Results
Overall Accuracy
| Model | Accuracy | 95% CI |
|---|---|---|
| Grok-4.1-Fast | 82.9% (58/70) | [72.4%, 89.9%] |
| Llama-4-Maverick | 67.1% (47/70) | [55.5%, 77.0%] |
| Gemini-3-Pro | 52.9% (37/70) | [41.3%, 64.1%] |
| GPT-5.1 | 40.0% (28/70) | [29.3%, 51.7%] |
Fixed vs. Movable
| Model | Fixed | Movable | Gap |
|---|---|---|---|
| Grok-4.1-Fast | 100.0% | 60.0% | 40pp |
| Llama-4-Maverick | 97.5% | 26.7% | 71pp |
| Gemini-3-Pro | 80.0% | 16.7% | 63pp |
| GPT-5.1 | 67.5% | 3.3% | 64pp |
The Gap: The difference between fixed and movable performance ranged from 40 to 71 percentage points. GPT-5.1 correctly identified only 1 out of 30 movable holiday dates.
Error Patterns
Two distinct patterns emerged:
- Grok and Llama: Always attempted an answer, sometimes incorrectly. Errors included cultural mismatch (e.g., "Ascension Day" for Eid) and minor observances (e.g., "Darwin Day").
- Gemini and GPT-5.1: Frequently declined to answer (32-41 empty responses). When they did answer incorrectly, errors were rare.
Discussion
All models knew Christmas is December 25. Most knew July 4 is Independence Day. Performance degraded substantially for holidays whose dates change each year.
What We Cannot Say
This study does not support claims about "temporal reasoning capability," whether models "struggle with" any task, or the mechanisms underlying the observed differences. We report what we observed on one afternoon in December 2025.
Limitations
- Four models, one day: Results may not replicate
- English only: No cross-lingual assessment
- Western bias in fixed holidays
- Unknown training cutoffs
- Small sample per category (n=30 yields wide CIs)
Conclusion
On December 3, 2025, we asked four frontier LLMs to identify holidays from dates. They knew fixed holidays (67.5%-100%). They frequently failed on movable holidays (3.3%-60%). Grok-4.1-Fast performed best; GPT-5.1 performed worst.
These results describe a single snapshot. We report what we observed.
Materials
All materials available at: github.com/credentum/vivarium-lab
- Full report:
movable-feast/docs/REPORT_v2.7.md - Run script:
movable-feast/scripts/run_study.py - Requirements:
movable-feast/requirements.txt
This case series was designed, executed, and written with AI assistance (Claude Opus 4.5). Reviewed by DeepSeek-V3.2.