papers in adversarial machine learning — adversarial training

Adversarial training: attacking your own model as a defense

Posted by Dillon Niederhut on

A critical factor in AI safety is robustness in the face of unusual inputs. Without this, models (like chatGPT) can be tricked into producing dangerous outputs. One method for inducing safety is to use adversarial attacks inside the model training loop. This also helps models align their features to human expectations.

Read more →