Needle in a haystack
Also known as: NIAH, needle-in-haystack test, long context recall test, context retrieval benchmark
The needle-in-a-haystack test works by taking a long, mostly irrelevant document, inserting a specific fact at various positions within it, and then asking the model a question that requires finding that fact. The test sweeps across both the length of the document and the position of the needle, producing a grid of accuracy scores. A model that claims a 128K context window but scores poorly in the middle of that window fails the real test.
This test became a standard way to expose the gap between advertised context length and effective context length. A model can technically 'process' 100,000 tokens without actually being able to reliably retrieve a specific fact from position 50,000. The needle-in-a-haystack visualization, often shown as a color-coded grid, became a common way labs communicate long-context performance.
For builders, the practical takeaway is: don't assume a long context window means reliable information access across the whole window. Test your specific use case. If you're building a document Q&A system, needle-in-a-haystack style testing tells you much more than token count does. Use chunking and retrieval strategies to reduce how much context you're actually requiring the model to search through.