A test should run until the data is sufficient, not for a fixed number of days. “Long enough” is defined by two things: enough traffic or conversions for the result to be statistically meaningful, and enough elapsed time to cover normal variation, the weekly cycles of weekday versus weekend behavior and the crawl cycles that affect anything SEO-related. The pivot is data sufficiency rather than the calendar, which is why “a week is enough” is unreliable; a week can be plenty of data on a high-traffic page and far too little on a low-traffic one.

The reason duration alone fails is that a result is only trustworthy when it is unlikely to be noise. If a page gets a handful of conversions, a difference between variants can look dramatic and mean nothing, because the sample is too small to distinguish a real effect from random swing. Statistical significance is the threshold that tells you the observed difference is probably real rather than luck, and it depends on volume, not on how many days have passed. A low-traffic test can run for weeks and still not reach a sample size you can trust.

Time still matters, but as a coverage requirement rather than a stopwatch. You want the test window to span the natural rhythms of your traffic so the result is not skewed by a single unusual stretch: at least a full weekly cycle so weekday and weekend behavior are both represented, and for search-related tests, enough time for crawling and re-indexing to take effect. Ending a test the moment a number looks good, before either the sample or the cycle coverage is there, is how teams “confirm” effects that evaporate later. Treat the specific numbers as a working guide and let your own significance check decide.

So run the test until it has both enough volume to be statistically meaningful and enough time to cover at least a full cycle of normal variation, then trust it, rather than stopping on a set date. If the traffic is low, accept that the test needs longer or that the result may never reach confidence, and resist calling a winner before the data is sufficient.