Inducing self-NSFW classification in image models to prevent deepfakes edits

https://news.ycombinator.com/rss Hits: 1

Summary

Hey guys, I was playing around with adversarial perturbations on image generation to see how much distortion it actually takes to stop models from generating or to push them off-target. That mostly went nowhere, which wasn’t surprising.Then I tried something a bit weirder: instead of fighting the model, I tried pushing it to classify uploaded images itself as NSFW, so it ends up triggering its own guardrails.This turned out to be more interesting than expected. It’s inconsistent and definitely not robust, but in some cases relatively mild transformations are enough to flip the model’s internal safety classification on otherwise benign images.This isn’t about bypassing safeguards, if anything, it’s the opposite. The idea is to intentionally stress the safety layer itself. I’m planning to open-source this as a small tool + UI once I can make the behavior more stable and reproducible, mainly as a way to probe and pre-filter moderation pipelines.If it works reliably, even partially, it could at least raise the cost for people who get their kicks from abusing these systems.

First seen: 2026-01-05 09:23

Last seen: 2026-01-05 09:23

Read Full Article More from this Source

Inducing self-NSFW classification in image models to prevent deepfakes edits

Summary

Related News

The Concise TypeScript Book

'Bandersnatch': The Works That Inspired the 'Black Mirror' Interactive Feature (2019)

I build products to get "unplugged" from the internet

CPU Counters on Apple Silicon: article + tool

Datadog, Thank You for Blocking Us