Plot-guided Adversarial Example Construction for Evaluating Open-domain Story Generation

Sarik Ghazarian, Zixi Liu, Akash S. M, Ralph Weischedel, Aram Galstyan, and Nanyun Peng, in The 2021 Annual Conference of the North American Chapter of the Association for Computational Linguistics (NAACL), 2021.

Download the full text

Abstract

With the recent advances of open-domain story generation models, the lack of reliable automatic evaluation metrics becomes an increasingly imperative issue that hinders the development of such models. A critical bottleneck of obtaining a trustworthy learnable evaluation metric is the lack of high-quality training data for learning classifiers to efficiently distinguish between plausible and implausible machine-generated stories. Previous works relied on heuristically manipulate plausible examples to mimic possible system drawbacks such as repetition, contradiction, or irrelevant content in the text level, which can be unnatural and oversimplify the characteristics of implausible machine-generated stories. We propose to tackle these issues by generating a more comprehensive set of implausible stories using plots, which are structured representations of controllable factors used to generate stories. Since these plots are compact and structured, it is easier to manipulate them to generate text with targeted undesirable properties, while at the same time maintain the naturalness of the generation. To improve the quality of incoherent stories, we further apply the adversarial filtering procedure to select a more nuanced set of implausible texts. We find that the evaluation metrics trained on our generated data result in more reliable automatic assessments that correlate remarkably better with human judgments than other baselines.

In our first paper in the title "Plot-guided Adversarial Example Construction for Evaluating Open-domain Story Generation", we tried to achieve a more accurate story plausibility evaluator by proposing a more comprehensive set of incoherent stories based on plot manipulations.
— Sarik (@Sarikgha) March 19, 2021

Bib Entry

@inproceedings{ghazarian2021plot,
  title = {Plot-guided Adversarial Example Construction for Evaluating Open-domain Story Generation},
  author = {Ghazarian, Sarik and Liu, Zixi and M, Akash S and Weischedel, Ralph and Galstyan, Aram and Peng, Nanyun},
  booktitle = {The 2021 Annual Conference of the North American Chapter of the Association for Computational Linguistics (NAACL)},
  publisher = {Association for Computational Linguistics},
  pages = {4334–-4344},
  year = {2021}
}

Related Publications

Open-Domain Text Evaluation via Contrastive Distribution Methods

Sidi Lu, Hongyi Liu, Asli Celikyilmaz, Tianlu Wang, and Nanyun Peng, in Proceedings of the Fortieth International Conference on Machine Learning (ICML), 2024.
Full Text BibTeX Details

@inproceedings{lu2024cdm,
  title = {Open-Domain Text Evaluation via Contrastive Distribution Methods},
  author = {Lu, Sidi and Liu, Hongyi and Celikyilmaz, Asli and Wang, Tianlu and Peng, Nanyun},
  booktitle = {Proceedings of the Fortieth International Conference on Machine Learning (ICML)},
  year = {2024}
}

Details

ConTextual: Evaluating Context-Sensitive Text-Rich Visual Reasoning in Large Multimodal Models

Rohan Wadhawan, Hritik Bansal, Kai-Wei Chang, and Nanyun Peng, in Proceedings of the Fortieth International Conference on Machine Learning (ICML), 2024.
Full Text BibTeX Details

@inproceedings{wadhawan2024contextual,
  title = {ConTextual: Evaluating Context-Sensitive Text-Rich Visual Reasoning in Large Multimodal Models},
  author = {Wadhawan, Rohan and Bansal, Hritik and Chang, Kai-Wei and Peng, Nanyun},
  booktitle = {Proceedings of the Fortieth International Conference on Machine Learning (ICML)},
  year = {2024}
}

Details

AMRFact: Enhancing Summarization Factuality Evaluation with AMR-Driven Negative Samples Generation

Haoyi Qiu, Kung-Hsiang Huang, Jingnong Qu, and Nanyun Peng, in Proceedings of the 2024 Annual Conference of the North American Chapter of the Association for Computational Linguistics (NAACL), 2024.
Full Text BibTeX Details

@inproceedings{qiu2024amrfact,
  title = {AMRFact: Enhancing Summarization Factuality Evaluation with AMR-Driven Negative Samples Generation},
  author = {Qiu, Haoyi and Huang, Kung-Hsiang and Qu, Jingnong and Peng, Nanyun},
  booktitle = {Proceedings of the 2024 Annual Conference of the North American Chapter of the Association for Computational Linguistics (NAACL)},
  year = {2024}
}

Details

ACCENT: An Automatic Event Commonsense Evaluation Metric for Open-Domain Dialogue Systems

Sarik Ghazarian*, Yijia Shao*, Rujun Han, Aram Galstyan, and Nanyun Peng, in Proceedings of the 61th Annual Meeting of the Association for Computational Linguistics (ACL), 2023.
Full Text BibTeX Details

@inproceedings{ghazarian2023accent,
  title = {ACCENT: An Automatic Event Commonsense Evaluation Metric for Open-Domain Dialogue Systems},
  author = {Ghazarian*, Sarik and Shao*, Yijia and Han, Rujun and Galstyan, Aram and Peng, Nanyun},
  booktitle = {Proceedings of the 61th Annual Meeting of the Association for Computational Linguistics (ACL)},
  year = {2023}
}

Details

EnDex: Evaluation of Dialogue Engagingness at Scale

Guangxuan Xu, Nischal Reddy Chandra, Ruibo Liu, Fabrice Harel-Canada, and Nanyun Peng, in Findings of the Association for Computational Linguistics: EMNLP (EMNLP-findings), 2022.
Full Text BibTeX Details

@inproceedings{xu2022endex,
  title = {EnDex: Evaluation of Dialogue Engagingness at Scale},
  author = {Xu, Guangxuan and Chandra, Nischal Reddy and Liu, Ruibo and Harel-Canada, Fabrice and Peng, Nanyun},
  booktitle = {Findings of the Association for Computational Linguistics: EMNLP (EMNLP-findings)},
  year = {2022}
}

Details

DEAM: Dialogue Coherence Evaluation using AMR-based Semantic Manipulations

Sarik Ghazarian, Nuan Wen, Aram Galstyan, and Nanyun Peng, in Proceedings of the Conference of the 60th Annual Meeting of the Association for Computational Linguistics (ACL), 2022.
Full Text Abstract BibTeX Details

Automatic evaluation metrics are essential for the rapid development of open-domain dialogue systems as they facilitate hyper-parameter tuning and comparison between models. Although recently proposed trainable conversation-level metrics have shown encouraging results, the quality of the metrics is strongly dependent on the quality of training data. Prior works mainly resort to heuristic text-level manipulations (e.g. utterances shuffling) to bootstrap incoherent conversations (negative examples) from coherent dialogues (positive examples). Such approaches are insufficient to appropriately reflect the incoherence that occurs in interactions between advanced dialogue models and humans. To tackle this problem, we propose DEAM, a Dialogue coherence Evaluation metric that relies on Abstract Meaning Representation (AMR) to apply semantic-level Manipulations for incoherent (negative) data generation. AMRs naturally facilitate the injection of various types of incoherence sources, such as coreference inconsistency, irrelevancy, contradictions, and decrease engagement, at the semantic level, thus resulting in more natural incoherent samples. Our experiments show that DEAM achieves higher correlations with human judgments compared to baseline methods on several dialog datasets by significant margins. We also show that DEAM can distinguish between coherent and incoherent dialogues generated by baseline manipulations, whereas those baseline models cannot detect incoherent examples generated by DEAM. Our results demonstrate the potential of AMR-based semantic manipulations for natural negative example generation.

@inproceedings{ghazarian2022deam,
  title = {DEAM: Dialogue Coherence Evaluation using AMR-based Semantic Manipulations},
  author = {Ghazarian, Sarik and Wen, Nuan and Galstyan, Aram and Peng, Nanyun},
  booktitle = {Proceedings of the Conference of the 60th Annual Meeting of the Association for Computational Linguistics (ACL)},
  year = {2022}
}

Details

Plot-guided Adversarial Example Construction for Evaluating Open-domain Story Generation

Sarik Ghazarian, Zixi Liu, Akash S. M, Ralph Weischedel, Aram Galstyan, and Nanyun Peng, in The 2021 Annual Conference of the North American Chapter of the Association for Computational Linguistics (NAACL), 2021.
Full Text Slides Code Abstract BibTeX Details

With the recent advances of open-domain story generation models, the lack of reliable automatic evaluation metrics becomes an increasingly imperative issue that hinders the development of such models. A critical bottleneck of obtaining a trustworthy learnable evaluation metric is the lack of high-quality training data for learning classifiers to efficiently distinguish between plausible and implausible machine-generated stories. Previous works relied on heuristically manipulate plausible examples to mimic possible system drawbacks such as repetition, contradiction, or irrelevant content in the text level, which can be unnatural and oversimplify the characteristics of implausible machine-generated stories. We propose to tackle these issues by generating a more comprehensive set of implausible stories using plots, which are structured representations of controllable factors used to generate stories.  Since these plots are compact and structured, it is easier to manipulate them to generate text with targeted undesirable properties, while at the same time maintain the naturalness of the generation. To improve the quality of incoherent stories, we further apply the adversarial filtering procedure to select a more nuanced set of implausible texts. We find that the evaluation metrics trained on our generated data result in more reliable automatic assessments that correlate remarkably better with human judgments than other baselines.

@inproceedings{ghazarian2021plot,
  title = {Plot-guided Adversarial Example Construction for Evaluating Open-domain Story Generation},
  author = {Ghazarian, Sarik and Liu, Zixi and M, Akash S and Weischedel, Ralph and Galstyan, Aram and Peng, Nanyun},
  booktitle = {The 2021 Annual Conference of the North American Chapter of the Association for Computational Linguistics (NAACL)},
  publisher = {Association for Computational Linguistics},
  pages = {4334–-4344},
  year = {2021}
}

Details

Predictive Engagement: An Efficient Metric For Automatic Evaluation of Open-Domain Dialogue Systems

Sarik Ghazarian, Ralph Weischedel, Aram Galstyan, and Nanyun Peng, in The Thirty-Fourth AAAI Conference on Artificial Intelligence (AAAI-20), 2020.
Full Text Code Abstract BibTeX Details

User engagement is a critical metric for evaluating the quality of open-domain dialogue systems. Prior work has focused on conversation-level engagement by using heuristically constructed features such as the number of turns and the total time of the conversation. In this paper, we investigate the possibility and efficacy of estimating utterance-level engagement and define a novel metric, predictive engagement, for automatic evaluation of open-domain dialogue systems. Our experiments demonstrate that (1) human annotators have high agreement on assessing utterance-level engagement scores; (2) conversation-level engagement scores can be predicted from properly aggregated utterance-level engagement scores. Furthermore, we show that the utterance-level engagement scores can be learned from data. These scores can be incorporated into automatic evaluation metrics for open-domain dialogue systems to improve the correlation with human judgements. This suggests that predictive engagement can be used as a real-time feedback for training better dialogue models.

@inproceedings{ghazarian2020predictive,
  title = {Predictive Engagement: An Efficient Metric For Automatic Evaluation of Open-Domain Dialogue Systems},
  author = {Ghazarian, Sarik and Weischedel, Ralph and Galstyan, Aram and Peng, Nanyun},
  booktitle = {The Thirty-Fourth AAAI Conference on Artificial Intelligence (AAAI-20)},
  pages = {7789–-7796},
  year = {2020}
}

Details

Better Automatic Evaluation of Open-Domain Dialogue Systems with Contextualized Embeddings

Sarik Ghazarian, Johnny Tian-Zheng Wei, Aram Galstyan, and Nanyun Peng, in 2019 Annual Conference of the North American Chapter of the Association for Computational Linguistics (NAACL-HLT 2019), NeuralGen Workshop, 2019.
Full Text BibTeX Details

@inproceedings{ghazarian2019better,
  title = {Better Automatic Evaluation of Open-Domain Dialogue Systems with Contextualized Embeddings},
  author = {Ghazarian, Sarik and Wei, Johnny Tian-Zheng and Galstyan, Aram and Peng, Nanyun},
  booktitle = {2019 Annual Conference of the North American Chapter of the Association for Computational Linguistics (NAACL-HLT 2019), NeuralGen Workshop},
  year = {2019}
}

Details

Evaluating and Enhancing the Robustness of Retrieval-Based Dialogue Systems with Adversarial Examples

Jia Li, Chongyang Tao, Nanyun Peng, Wei Wu, Dongyan Zhao, and Rui Yan, in CCF International Conference on Natural Language Processing and Chinese Computing, 2019.
Full Text BibTeX Details

@inproceedings{li2019evaluating,
  title = {Evaluating and Enhancing the Robustness of Retrieval-Based Dialogue Systems with Adversarial Examples},
  author = {Li, Jia and Tao, Chongyang and Peng, Nanyun and Wu, Wei and Zhao, Dongyan and Yan, Rui},
  booktitle = {CCF International Conference on Natural Language Processing and Chinese Computing},
  pages = {142--154},
  year = {2019},
  organization = {Springer}
}

Details