Share this page:

Prompt-Driven LLM Safeguarding via Directed Representation Optimization

Chujie Zheng, Fan Yin, Hao Zhou, Fandong Meng, Jie Zhou, Kai-Wei Chang, Minlie Huang, and Nanyun Peng, in Proceedings of the Fortieth International Conference on Machine Learning (ICML), 2024.

Download the full text


Abstract


Bib Entry

@inproceedings{zheng2024dro,
  title = {Prompt-Driven LLM Safeguarding via Directed Representation Optimization},
  author = {Zheng, Chujie and Yin, Fan and Zhou, Hao and Meng, Fandong and Zhou, Jie and Chang, Kai-Wei and Huang, Minlie and Peng, Nanyun},
  booktitle = {Proceedings of the Fortieth International Conference on Machine Learning (ICML)},
  year = {2024}
}

Related Publications

  • Prompt-Driven LLM Safeguarding via Directed Representation Optimization

    Chujie Zheng, Fan Yin, Hao Zhou, Fandong Meng, Jie Zhou, Kai-Wei Chang, Minlie Huang, and Nanyun Peng, in Proceedings of the Fortieth International Conference on Machine Learning (ICML), 2024.
    Full Text BibTeX Details
    @inproceedings{zheng2024dro,
      title = {Prompt-Driven LLM Safeguarding via Directed Representation Optimization},
      author = {Zheng, Chujie and Yin, Fan and Zhou, Hao and Meng, Fandong and Zhou, Jie and Chang, Kai-Wei and Huang, Minlie and Peng, Nanyun},
      booktitle = {Proceedings of the Fortieth International Conference on Machine Learning (ICML)},
      year = {2024}
    }
    
    Details
  • On the Safety of Conversational Models: Taxonomy, Dataset, and Benchmark

    Hao Sun, Guangxuan Xu, Jiawen Deng, Jiale Cheng, Chujie Zheng, Hao Zhou, Nanyun Peng, Xiaoyan Zhu, and Minlie Huang, in Findings of the Conference of the 60th Annual Meeting of the Association for Computational Linguistics (ACL-findings), 2022.
    Full Text Abstract BibTeX Details
    Dialogue safety problems severely limit the real-world deployment of neural conversational models and have attracted great research interests recently. However, dialogue safety problems remain under-defined and the corresponding dataset is scarce. We propose a taxonomy for dialogue safety specifically designed to capture unsafe behaviors in human-bot dialogue settings, with focuses on context-sensitive unsafety, which is under-explored in prior works. To spur research in this direction, we compile DiaSafety, a dataset with rich context-sensitive unsafe examples. Experiments show that existing safety guarding tools fail severely on our dataset. As a remedy, we train a dialogue safety classifier to provide a strong baseline for context-sensitive dialogue unsafety detection. With our classifier, we perform safety evaluations on popular conversational models and show that existing dialogue systems still exhibit concerning context-sensitive safety problems.
    @inproceedings{sun2022safe,
      title = {On the Safety of Conversational Models: Taxonomy, Dataset, and Benchmark},
      author = {Sun, Hao and Xu, Guangxuan and Deng, Jiawen and Cheng, Jiale and Zheng, Chujie and Zhou, Hao and Peng, Nanyun and Zhu, Xiaoyan and Huang, Minlie},
      booktitle = {Findings of the Conference of the 60th Annual Meeting of the Association for Computational Linguistics (ACL-findings)},
      year = {2022}
    }
    
    Details