← Back to Credentum

Movable Feast

How 4 Frontier LLMs Handle Movable Holiday Dates

Key Finding: All models performed well on fixed holidays (Christmas, Independence Day). Performance dropped sharply for movable holidays (Easter, Chinese New Year, Eid al-Fitr). Grok-4.1-Fast achieved 82.9% overall accuracy; GPT-5.1 achieved 40.0%.

Abstract

We asked four frontier large language models to identify holidays from calendar dates. All models performed well on fixed holidays. Performance dropped sharply for movable holidays. This case series documents what we observed. We make no claims about why these differences exist.

Method

Each model received 70 queries asking "What holiday falls on {date}?" with no examples or chain-of-thought prompting.

Models Tested

ModelProvider
Grok-4.1-FastxAI
Gemini-3-ProGoogle
GPT-5.1OpenAI
Llama-4-MaverickMeta

Holidays

Movable (30 queries per model): Chinese New Year, Eid al-Fitr, Easter Sunday

Fixed (40 queries per model): New Year's Day, Labor Day, Independence Day, Christmas

Years tested: 2020-2029.

Results

Overall Accuracy

ModelAccuracy95% CI
Grok-4.1-Fast82.9% (58/70)[72.4%, 89.9%]
Llama-4-Maverick67.1% (47/70)[55.5%, 77.0%]
Gemini-3-Pro52.9% (37/70)[41.3%, 64.1%]
GPT-5.140.0% (28/70)[29.3%, 51.7%]

Fixed vs. Movable

ModelFixedMovableGap
Grok-4.1-Fast100.0%60.0%40pp
Llama-4-Maverick97.5%26.7%71pp
Gemini-3-Pro80.0%16.7%63pp
GPT-5.167.5%3.3%64pp

The Gap: The difference between fixed and movable performance ranged from 40 to 71 percentage points. GPT-5.1 correctly identified only 1 out of 30 movable holiday dates.

Error Patterns

Two distinct patterns emerged:

Discussion

All models knew Christmas is December 25. Most knew July 4 is Independence Day. Performance degraded substantially for holidays whose dates change each year.

What We Cannot Say

This study does not support claims about "temporal reasoning capability," whether models "struggle with" any task, or the mechanisms underlying the observed differences. We report what we observed on one afternoon in December 2025.

Limitations

Conclusion

On December 3, 2025, we asked four frontier LLMs to identify holidays from dates. They knew fixed holidays (67.5%-100%). They frequently failed on movable holidays (3.3%-60%). Grok-4.1-Fast performed best; GPT-5.1 performed worst.

These results describe a single snapshot. We report what we observed.

Materials

All materials available at: github.com/credentum/vivarium-lab

This case series was designed, executed, and written with AI assistance (Claude Opus 4.5). Reviewed by DeepSeek-V3.2.