Vivarium Lab Case Series • December 3, 2025

Movable Feast

How 4 Frontier LLMs Handle Movable Holiday Dates

Key Finding: All models performed well on fixed holidays (Christmas, Independence Day). Performance dropped sharply for movable holidays (Easter, Chinese New Year, Eid al-Fitr). Grok-4.1-Fast achieved 82.9% overall accuracy; GPT-5.1 achieved 40.0%.

Abstract

We asked four frontier large language models to identify holidays from calendar dates. All models performed well on fixed holidays. Performance dropped sharply for movable holidays. This case series documents what we observed. We make no claims about why these differences exist.

Method

Each model received 70 queries asking "What holiday falls on {date}?" with no examples or chain-of-thought prompting.

Models Tested

Model	Provider
Grok-4.1-Fast	xAI
Gemini-3-Pro	Google
GPT-5.1	OpenAI
Llama-4-Maverick	Meta

Holidays

Movable (30 queries per model): Chinese New Year, Eid al-Fitr, Easter Sunday

Fixed (40 queries per model): New Year's Day, Labor Day, Independence Day, Christmas

Years tested: 2020-2029.

Results

Overall Accuracy

Model	Accuracy	95% CI
Grok-4.1-Fast	82.9% (58/70)	[72.4%, 89.9%]
Llama-4-Maverick	67.1% (47/70)	[55.5%, 77.0%]
Gemini-3-Pro	52.9% (37/70)	[41.3%, 64.1%]
GPT-5.1	40.0% (28/70)	[29.3%, 51.7%]

Fixed vs. Movable

Model	Fixed	Movable	Gap
Grok-4.1-Fast	100.0%	60.0%	40pp
Llama-4-Maverick	97.5%	26.7%	71pp
Gemini-3-Pro	80.0%	16.7%	63pp
GPT-5.1	67.5%	3.3%	64pp

The Gap: The difference between fixed and movable performance ranged from 40 to 71 percentage points. GPT-5.1 correctly identified only 1 out of 30 movable holiday dates.

Error Patterns

Two distinct patterns emerged:

Grok and Llama: Always attempted an answer, sometimes incorrectly. Errors included cultural mismatch (e.g., "Ascension Day" for Eid) and minor observances (e.g., "Darwin Day").
Gemini and GPT-5.1: Frequently declined to answer (32-41 empty responses). When they did answer incorrectly, errors were rare.

Discussion

All models knew Christmas is December 25. Most knew July 4 is Independence Day. Performance degraded substantially for holidays whose dates change each year.

What We Cannot Say

This study does not support claims about "temporal reasoning capability," whether models "struggle with" any task, or the mechanisms underlying the observed differences. We report what we observed on one afternoon in December 2025.

Limitations

Four models, one day: Results may not replicate
English only: No cross-lingual assessment
Western bias in fixed holidays
Unknown training cutoffs
Small sample per category (n=30 yields wide CIs)

Conclusion

On December 3, 2025, we asked four frontier LLMs to identify holidays from dates. They knew fixed holidays (67.5%-100%). They frequently failed on movable holidays (3.3%-60%). Grok-4.1-Fast performed best; GPT-5.1 performed worst.

These results describe a single snapshot. We report what we observed.

Materials

All materials available at: github.com/credentum/vivarium-lab

Full report: movable-feast/docs/REPORT_v2.7.md
Run script: movable-feast/scripts/run_study.py
Requirements: movable-feast/requirements.txt

This case series was designed, executed, and written with AI assistance (Claude Opus 4.5). Reviewed by DeepSeek-V3.2.