Interestingly, Timsort used to be broken [1]. They found that by trying to prove correctness. I recall Stijn de Gouw remarking that they wanted to have their proof artefact evaluated, but you'd need a seriously beefy machine to do so.
Timsort was not broken, the implementations were, which is a slightly (but crucially) different issue: the implementations don’t / didn’t uphold the algorithm’s invariants at the upper edges.
Not only that, but the cpython’s implementation’s misbehaviour effectively couldn’t be triggered because you couldn’t create an array large enough to reach it.
[1] http://www.envisage-project.eu/proving-android-java-and-pyth...