Systematically generating tests that would have caught Anthropic's top鈥慘 bug

https://news.ycombinator.com/rss Hits: 3
Summary

Most testing strategies miss rare edge cases until customers find them in production. Our system automatically generates targeted unit tests for rare bugs, including the one that would have caught Anthropic鈥檚 recent approximate top-K bug. In this blog post, we鈥檒l provide a brief overview of how it works. Figure 1: Unit-level PBTs are fast but miss edge cases. Proofs offer exhaustive coverage but require extensive reasoning and code refactoring. End-to-end PBTs have coverage but are not compute efficient. Fractional proofs sit at the intersection, using proof decomposition to generate targeted unit tests that balance compute efficiency, developer accuracy, and speed. Catching the rare bug in top-K sampling A bug in the TPU implementation of approximate top-K resulted in the most likely token sometimes being excluded. Rare bugs like this frequently slip through to production because covering every behavior with testing is infeasible in practice. After discovery, Anthropic provided a simple reproducer of the bug, but it is the sort of test you only manage to write after a laborious bug minimization process. We used fractional proof decomposition to automatically generate the unit test without relying on Anthropic鈥檚 bug reproducer code. You can run the unit test on colab. For any code, if testing is done via fractional proof decomposition, bugs can be systematically found without the benefit of hindsight. @given(k=st.integers(min_value=0, max_value=TOP_K_RANGE), arr=arr_strategy) def test_approx_max_k(k, arr): N = len(arr) k = int(k % min(N - MIN_TOP_K, TOP_K_RANGE)) + MIN_TOP_K approx_values, _ = lax.approx_max_k(arr, k=k) assert jnp.max(approx_values) == jnp.max(arr), \ Figure 2: Top-K sampling should always have some chance of picking the most likely token. We encode this property with a PBT (property-based test) for max(approximate_top_k(arr, k=k)) == max(arr). If the implementation of lax.approx_max_k is correct, we should expect the test to pass because the approx...

First seen: 2026-01-14 11:09

Last seen: 2026-01-14 13:09