Share this page:

Plot-guided Adversarial Example Construction for Evaluating Open-domain Story Generation

Sarik Ghazarian, Zixi Liu, Akash S. M, Ralph Weischedel, Aram Galstyan, and Nanyun Peng, in The 2021 Annual Conference of the North American Chapter of the Association for Computational Linguistics (NAACL), 2021.

Download the full text


Abstract

With the recent advances of open-domain story generation models, the lack of reliable automatic evaluation metrics becomes an increasingly imperative issue that hinders the development of such models. A critical bottleneck of obtaining a trustworthy learnable evaluation metric is the lack of high-quality training data for learning classifiers to efficiently distinguish between plausible and implausible machine-generated stories. Previous works relied on heuristically manipulate plausible examples to mimic possible system drawbacks such as repetition, contradiction, or irrelevant content in the text level, which can be unnatural and oversimplify the characteristics of implausible machine-generated stories. We propose to tackle these issues by generating a more comprehensive set of implausible stories using plots, which are structured representations of controllable factors used to generate stories. Since these plots are compact and structured, it is easier to manipulate them to generate text with targeted undesirable properties, while at the same time maintain the naturalness of the generation. To improve the quality of incoherent stories, we further apply the adversarial filtering procedure to select a more nuanced set of implausible texts. We find that the evaluation metrics trained on our generated data result in more reliable automatic assessments that correlate remarkably better with human judgments than other baselines.



Bib Entry

@inproceedings{ghazarian2021plot,
  title = {Plot-guided Adversarial Example Construction for Evaluating Open-domain Story Generation},
  author = {Ghazarian, Sarik and Liu, Zixi and M, Akash S and Weischedel, Ralph and Galstyan, Aram and Peng, Nanyun},
  booktitle = {The 2021 Annual Conference of the North American Chapter of the Association for Computational Linguistics (NAACL)},
  publisher = {Association for Computational Linguistics},
  pages = {4334–-4344},
  year = {2021}
}

Related Publications

  • DEAM: Dialogue Coherence Evaluation using AMR-based Semantic Manipulations

    Sarik Ghazarian, Nuan Wen, Aram Galstyan, and Nanyun Peng, in Proceedings of the Conference of the 60th Annual Meeting of the Association for Computational Linguistics (ACL), 2022.
    Full Text Abstract BibTeX Details
    Automatic evaluation metrics are essential for the rapid development of open-domain dialogue systems as they facilitate hyper-parameter tuning and comparison between models. Although recently proposed trainable conversation-level metrics have shown encouraging results, the quality of the metrics is strongly dependent on the quality of training data. Prior works mainly resort to heuristic text-level manipulations (e.g. utterances shuffling) to bootstrap incoherent conversations (negative examples) from coherent dialogues (positive examples). Such approaches are insufficient to appropriately reflect the incoherence that occurs in interactions between advanced dialogue models and humans. To tackle this problem, we propose DEAM, a Dialogue coherence Evaluation metric that relies on Abstract Meaning Representation (AMR) to apply semantic-level Manipulations for incoherent (negative) data generation. AMRs naturally facilitate the injection of various types of incoherence sources, such as coreference inconsistency, irrelevancy, contradictions, and decrease engagement, at the semantic level, thus resulting in more natural incoherent samples. Our experiments show that DEAM achieves higher correlations with human judgments compared to baseline methods on several dialog datasets by significant margins. We also show that DEAM can distinguish between coherent and incoherent dialogues generated by baseline manipulations, whereas those baseline models cannot detect incoherent examples generated by DEAM. Our results demonstrate the potential of AMR-based semantic manipulations for natural negative example generation.
    @inproceedings{ghazarian2022deam,
      title = {DEAM: Dialogue Coherence Evaluation using AMR-based Semantic Manipulations},
      author = {Ghazarian, Sarik and Wen, Nuan and Galstyan, Aram and Peng, Nanyun},
      booktitle = {Proceedings of the Conference of the 60th Annual Meeting of the Association for Computational Linguistics (ACL)},
      year = {2022}
    }
    
    Details
  • Plot-guided Adversarial Example Construction for Evaluating Open-domain Story Generation

    Sarik Ghazarian, Zixi Liu, Akash S. M, Ralph Weischedel, Aram Galstyan, and Nanyun Peng, in The 2021 Annual Conference of the North American Chapter of the Association for Computational Linguistics (NAACL), 2021.
    Full Text Slides Code Abstract BibTeX Details
    With the recent advances of open-domain story generation models, the lack of reliable automatic evaluation metrics becomes an increasingly imperative issue that hinders the development of such models. A critical bottleneck of obtaining a trustworthy learnable evaluation metric is the lack of high-quality training data for learning classifiers to efficiently distinguish between plausible and implausible machine-generated stories. Previous works relied on heuristically manipulate plausible examples to mimic possible system drawbacks such as repetition, contradiction, or irrelevant content in the text level, which can be unnatural and oversimplify the characteristics of implausible machine-generated stories. We propose to tackle these issues by generating a more comprehensive set of implausible stories using plots, which are structured representations of controllable factors used to generate stories.  Since these plots are compact and structured, it is easier to manipulate them to generate text with targeted undesirable properties, while at the same time maintain the naturalness of the generation. To improve the quality of incoherent stories, we further apply the adversarial filtering procedure to select a more nuanced set of implausible texts. We find that the evaluation metrics trained on our generated data result in more reliable automatic assessments that correlate remarkably better with human judgments than other baselines.
    @inproceedings{ghazarian2021plot,
      title = {Plot-guided Adversarial Example Construction for Evaluating Open-domain Story Generation},
      author = {Ghazarian, Sarik and Liu, Zixi and M, Akash S and Weischedel, Ralph and Galstyan, Aram and Peng, Nanyun},
      booktitle = {The 2021 Annual Conference of the North American Chapter of the Association for Computational Linguistics (NAACL)},
      publisher = {Association for Computational Linguistics},
      pages = {4334–-4344},
      year = {2021}
    }
    
    Details
  • Predictive Engagement: An Efficient Metric For Automatic Evaluation of Open-Domain Dialogue Systems

    Sarik Ghazarian, Ralph Weischedel, Aram Galstyan, and Nanyun Peng, in The Thirty-Fourth AAAI Conference on Artificial Intelligence (AAAI-20), 2020.
    Full Text Code Abstract BibTeX Details
    User engagement is a critical metric for evaluating the quality of open-domain dialogue systems. Prior work has focused on conversation-level engagement by using heuristically constructed features such as the number of turns and the total time of the conversation. In this paper, we investigate the possibility and efficacy of estimating utterance-level engagement and define a novel metric, predictive engagement, for automatic evaluation of open-domain dialogue systems. Our experiments demonstrate that (1) human annotators have high agreement on assessing utterance-level engagement scores; (2) conversation-level engagement scores can be predicted from properly aggregated utterance-level engagement scores. Furthermore, we show that the utterance-level engagement scores can be learned from data. These scores can be incorporated into automatic evaluation metrics for open-domain dialogue systems to improve the correlation with human judgements. This suggests that predictive engagement can be used as a real-time feedback for training better dialogue models.
    @inproceedings{ghazarian2020predictive,
      title = {Predictive Engagement: An Efficient Metric For Automatic Evaluation of Open-Domain Dialogue Systems},
      author = {Ghazarian, Sarik and Weischedel, Ralph and Galstyan, Aram and Peng, Nanyun},
      booktitle = {The Thirty-Fourth AAAI Conference on Artificial Intelligence (AAAI-20)},
      pages = {7789–-7796},
      year = {2020}
    }
    
    Details
  • Better Automatic Evaluation of Open-Domain Dialogue Systems with Contextualized Embeddings

    Sarik Ghazarian, Johnny Tian-Zheng Wei, Aram Galstyan, and Nanyun Peng, in 2019 Annual Conference of the North American Chapter of the Association for Computational Linguistics (NAACL-HLT 2019), NeuralGen Workshop, 2019.
    Full Text BibTeX Details
    @inproceedings{ghazarian2019better,
      title = {Better Automatic Evaluation of Open-Domain Dialogue Systems with Contextualized Embeddings},
      author = {Ghazarian, Sarik and Wei, Johnny Tian-Zheng and Galstyan, Aram and Peng, Nanyun},
      booktitle = {2019 Annual Conference of the North American Chapter of the Association for Computational Linguistics (NAACL-HLT 2019), NeuralGen Workshop},
      year = {2019}
    }
    
    Details
  • Evaluating and Enhancing the Robustness of Retrieval-Based Dialogue Systems with Adversarial Examples

    Jia Li, Chongyang Tao, Nanyun Peng, Wei Wu, Dongyan Zhao, and Rui Yan, in CCF International Conference on Natural Language Processing and Chinese Computing, 2019.
    Full Text BibTeX Details
    @inproceedings{li2019evaluating,
      title = {Evaluating and Enhancing the Robustness of Retrieval-Based Dialogue Systems with Adversarial Examples},
      author = {Li, Jia and Tao, Chongyang and Peng, Nanyun and Wu, Wei and Zhao, Dongyan and Yan, Rui},
      booktitle = {CCF International Conference on Natural Language Processing and Chinese Computing},
      pages = {142--154},
      year = {2019},
      organization = {Springer}
    }
    
    Details