Systematically generating tests that would have caught Anthropic's top‑K bug

https://news.ycombinator.com/rss Hits: 3

Summary

Most testing strategies miss rare edge cases until customers find them in production. Our system automatically generates targeted unit tests for rare bugs, including the one that would have caught Anthropic’s recent approximate top-K bug. In this blog post, we’ll provide a brief overview of how it works. Figure 1: Unit-level PBTs are fast but miss edge cases. Proofs offer exhaustive coverage but require extensive reasoning and code refactoring. End-to-end PBTs have coverage but are not compute efficient. Fractional proofs sit at the intersection, using proof decomposition to generate targeted unit tests that balance compute efficiency, developer accuracy, and speed. Catching the rare bug in top-K sampling A bug in the TPU implementation of approximate top-K resulted in the most likely token sometimes being excluded. Rare bugs like this frequently slip through to production because covering every behavior with testing is infeasible in practice. After discovery, Anthropic provided a simple reproducer of the bug, but it is the sort of test you only manage to write after a laborious bug minimization process. We used fractional proof decomposition to automatically generate the unit test without relying on Anthropic’s bug reproducer code. You can run the unit test on colab. For any code, if testing is done via fractional proof decomposition, bugs can be systematically found without the benefit of hindsight. @given(k=st.integers(min_value=0, max_value=TOP_K_RANGE), arr=arr_strategy) def test_approx_max_k(k, arr): N = len(arr) k = int(k % min(N - MIN_TOP_K, TOP_K_RANGE)) + MIN_TOP_K approx_values, _ = lax.approx_max_k(arr, k=k) assert jnp.max(approx_values) == jnp.max(arr), \ Figure 2: Top-K sampling should always have some chance of picking the most likely token. We encode this property with a PBT (property-based test) for max(approximate_top_k(arr, k=k)) == max(arr). If the implementation of lax.approx_max_k is correct, we should expect the test to pass because the approx...

First seen: 2026-01-14 11:09

Last seen: 2026-01-14 13:09

Read Full Article More from this Source

Systematically generating tests that would have caught Anthropic's top‑K bug

Summary

Related News

UK Officials could face US entry ban over Twitter policy

Why NUKEMAP isn't on Google Maps anymore (2019)

System Programming in Linux: A Hands-On Introduction "Demo" Programs

I Hate GitHub Actions with Passion

The Tulip Creative Computer