Related literature

Related literature#

Generative AI, LLMs, and Chatbot AI#

Before reviewing relevant generative AI research for simulation, we briefly define generative AI and describe popular LLMs and human interaction with them via Chatbot AI tools. Table 1 summarises the key concepts.

Table 1 Key Concepts in Generative AI#
Topic	Summary
Generative AI	AI models designed to create novel digital content such as text, images, music, or code.
Large Language Models (LLMs)	A subset of generative AI specializing in processing and generating human-like text.
Transformer Architecture	Neural network design using self-attention mechanisms to process and generate text.
Zero-Shot Learning	The ability of a model to perform tasks or make predictions on categories it hasn’t explicitly seen during training.
Model Scaling	The process of increasing model size (number of parameters) to improve performance and capabilities.
Hallucination	The tendency of LLMs to generate plausible-sounding but factually incorrect or logically flawed content.
Data Contamination	The overlap of training data with test data, potentially leading to overestimated model performance.
Temperature	A parameter controlling the randomness and creativity in LLM outputs.
Prompt Engineering	The process of crafting effective inputs to elicit desired outputs from LLMs.
Chatbot AI	AI-powered conversational interfaces that use LLMs to understand and generate human-like responses in real-time interactions.
Context Window	The amount of previous conversation an LLM can consider when generating responses.
RLHF (Reinforcement Learning from Human Feedback)	A technique used to fine-tune LLMs based on human ratings of model outputs.
Alignment Problem	The challenge of ensuring AI outputs align with human values and intentions.

Generating novel content using LLMs#

Traditional Machine Learning (ML) paradigms, such as classification, train a model to learn patterns within historical labelled data in order to classify new unseen instances. For example, classifying if a brain scan indicates Parkinson’s Disease or is healthy. Generative AI models are trained on unlabelled data, and rather than predict or classify their aim is to create novel digital content such as text, images, music, or code. For example the generation of a simple simulation model in Python code [Jackson et al., 2024]. LLMs are a subset of generative AI that specialize in natural language communication between humans and computers. The Generative Pre-trained Transformer (GPT) architecture, that underpins AI Chatbot tools like ChatGPT, is perhaps the most well known example of an LLM. GPT models are built on transformer-based neural network architectures, which use self-attention mechanisms to process and generate text [Brown et al., 2020, Vaswani et al., 2023]. In simple terms, GPT models are sequence predictors, trained to predict the next token (e.g. a word) in a sequence based on the context of previous tokens.

Zero-shot learning and model scaling#

A key advancement that distinguishes LLMs from traditional ML approaches is their capacity for zero-shot learning - the ability to perform tasks on previously unseen categories without explicit training [Brown et al., 2020]. This capability enables LLMs to adapt to novel contexts and tasks, such as generating code based on user specifications, without additional training. The evolution of zero-shot learning has been closely tied to the increasing scale of language models. When GPT-1 was introduced in 2018, it contained 117 million parameters [Radford and Narasimhan, 2018]. Subsequent iterations have seen substantial growth in model size, with GPT-3 including 175 billion parameters [Brown et al., 2020]. The exact specifications of GPT-4 have not been officially confirmed by OpenAI, but it is speculated to contain up to a trillion parameters [Giabbanelli, 2024].

Challenges and limitations: data contamination and hallucination#

Evaluating the zero-shot capabilities of LLMs is challenging due to the potential contamination of test data [Xu et al., 2024]. The concept of contamination is analogous to leakage in traditional supervised machine learning [Kaufman et al., 2012], i.e., the training data overlaps with test data, accuracy measures are overstated, and the model is simply outputting data it has memorised in training. In the case of LLMs, it is difficult to determine if the training data overlaps with test data and careful evaluations must be designed.

A key challenge in the use of LLMs is mitigating the risk of hallucination. LLMs are sequence prediction models that prioritize generating the most probable next word in a sequence, even if it is inaccurate. Simply put, given an input, a model will always produce an output, whether it is correct or not. As a result, an LLM may “hallucinate”: confidently present content that is factually incorrect, logically flawed, or at odds with the provided training data [Huang et al., 2023, Ji et al., 2023].

For example, an LLM might generate plausible-sounding but fabricated references in an academic essay or produce code that appears functional but contains logical errors. These errors may go unnoticed by users, and have consequences that vary from minor (e.g. wasted time from debugging nonsensical code) to severe (e.g. incorrect decisions based on the results of a flawed simulation model). The causes of hallucination are complex and varied. In coding, for instance, it might stem from pre-training the LLM on code that contains both obvious and subtle bugs.

Hallucination is a major limitation of generative AI and hence is an active area of research [Ji et al., 2023]. Promising approaches include variations on the theme of iterative retrieval of information [Khot et al., 2023, Yao et al., 2023], that can involve refining outputs through multiple iterations each providing more context or fact checking. Another approach is to estimate model uncertainty statistics that can highlight LLM knowledge deficiencies [Farquhar et al., 2024]. For the immediate future it seems likely that hallucination will continue to be a major challenge for safe and productive use of generative AI with some arguing it cannot be fully eliminated [Xu et al., 2024]. As such it is crucial to incorporate some form of fact-checking or testing mechanisms in any work that relies on content generated by an LLM.

Randomness and prompt engineering#

LLMs include an element of randomness in the generation of responses. This randomness is typically controlled by a “temperature” parameter, where higher values increase variability in outputs (and increase hallucinations), while lower values produce more deterministic results. The use of randomness allows LLMs to generate diverse and creative solutions, but it also means that given the same prompt, an LLM may produce different code outputs across multiple runs. This variability poses challenges for reproducibility in contexts such as code generation for simulation models, where consistent and replicable results are important. By default Chatbot AI tools may not offer direct user control over temperature.

Given the randomness used in generative AI, and a LLMs tendency to hallucinate, another important concept to define is the formation of prompts. This has given rise to the discipline of prompt engineering: the process of writing a prompt that results in the most effective LLM performance [Liu et al., 2021]. This is is very recent area of research and there is not yet a consensus on the most effective approaches although various patterns are available [White et al., 2023]. For example, 1-shot or few-shot learning where the prompt includes 1 or more simple examples of the task to clarify the context for the LLM.

AI Chatbots and alignment#

Since 2022, and at the time of writing, wide scale public access to LLMs has been made possible by general purpose Chatbot AI tools such as ChatGPT, Perplexity.AI, and Google’s Gemini. The underlying LLMs are trained on large amounts of curated web data (including code from sources such as StackOverFlow and GitHub) and fine tuned for chat based human interaction. In general, the tools have been show to understand and generate human-like text (and code) across a wide range of tasks. The overall architecture and training of these models is complex and is not fully known given the commercial nature of the companies that create and operate them (at huge cost). As a general rule, however, LLMs such as GPT-3.5 or 4 are not used as is, instead the models are combined with reinforcement learning from human feedback (RLHF) where a workforce reviews and rates responses output by the model [Casper et al., 2023]. RLHF aims to help Chatbot AI’s tools align responses with the human values and the intentions of their prompts (the so called alignment problem). This process attempts to filter out inappropriate or offensive content while enhancing the models’ ability to provide a relevant response.

Human interaction with these models is via a user-friendly chat interface. The underpinning LLM in use varies by free and paid tiers (e.g. at the time of writing ChatGPT offers a free GPT-3.5 or paid GPT-4 tier). While the LLM architectures have no memory of prior prompts a chatbot AI tool has a context window allowing a user to interact iteratively with an LLM within a larger history/context of prompts and responses. There are size restrictions on these context windows that varies with each chatbot AI tool and underlying model.

Generative AI and computer simulation#

Automated code generation#

Recent research has begun to investigate hybrid modelling where generative AI is combined with computer simulation. Several pioneering studies have examined small scale applications and conceptual frameworks [Giabbanelli, 2024, Akhavan and Jalali, 2024, Jackson et al., 2024, Plooy and Oosthuizen, 2023, Shrestha et al., 2022]. These studies have spanned discrete-event simulation, system dynamics, conceptual modelling, and model documentation and demonstrate the broad potential of generative AI to computer simulation.

Jackson et al. [2024] explored the potential of using GPT-based models to produce simulation models for inventory and process control in logistics systems. Their research focused on the concept of an “NLP shortcut,” which aims to bypass traditional conceptual modelling and coding steps for discrete-event simulation. The study used the OpenAI Davinci Codex (a code based API to the GPT-3 model) to successfully generate simple Python based simulations of logistics systems (e.g. a single-product inventory-control system). The LLM outputs consists of 20-30 lines of Python code implementing simple DES model logic and code to plot model output. Use of the Codex is incorporated into a framework that included dynamic execution of the generated code and review by a human expert.

Akhavan and Jalali [2024] and Plooy and Oosthuizen [2023] investigated the application of ChatGPT in System Dynamics modelling. Both studies take the position that generative AI should not replace a modeller but rather serve as a tool to facilitate the research process, review content, and enhance idea implementation in simulation modelling. Akhavan and Jalali [2024] develop a simple System Dynamics model of Covid-19’s impact on economic growth. Their approach first prompts ChatGPT (GPT-4) in an iterative manner to support conceptual modelling and decisions about methods. The authors manually code a small Python model (40 lines of code) and provide this along with prompts to ChatGPT to generation suggestions for code optimisations, additional plotting code, and improvements to model documentation.

Plooy and Oosthuizen [2023] focussed on using ChatGPT (GPT-4) to generate Python code implementing a simple System Dynamics model of a resource bound population in equilibrium. They outline a six step approach to iterative generate a model with ChatGPTs help. Early steps focus on textual information describing equations for stocks and flows that are first manually implemented in the commercial simulation package iSee Stella Architect. The final step converts the generated equations into 32 lines of Python code with outputs verified by comparing the manually created and generated models.

Conceputal modelling#

Giabbanelli [2024] is a conceptual study that hypothesised about the potential of LLM application across common simulation tasks. The study focused on four key areas: structuring conceptual models, summarizing simulation outputs, improving accessibility to simulation platforms, and explaining simulation errors with guidance for resolution. For example, the potential to use the emerging capability of LLMs to convert images to text to provide automated explanations of charts of simulation output could benefit both non-experts and people with visual impaired.

Shrestha et al. [2022] proposed a process to automatically explain simulation models by generative AI to create version a simplified conceptual model text from more complex causal maps. Their approach involved decomposing large conceptual models into smaller parts and then performing Natural Language Generation (NLG) using a fine-tuned GPT-3 model.

Conclusions#

Despite the limited body of research, these initial investigations suggest a potential role for generative AI in the future of computer simulation.

Role of modeller still vital in planning and verification:
Iterative role
Not explored issues with hallucination
Not explore more complex models.

Further research is needed to explore the integration of generative AI across a wider range of simulation paradigms and to develop robust frameworks for human-AI collaboration in the simulation development process.

References#

[1] (1,2,3)

Ali Akhavan and Mohammad S. Jalali. Generative ai and simulation modeling: how should you (not) use large language models like chatgpt. System Dynamics Review, n/a(n/a):, 2024. URL: https://onlinelibrary.wiley.com/doi/abs/10.1002/sdr.1773, arXiv:https://onlinelibrary.wiley.com/doi/pdf/10.1002/sdr.1773, doi:https://doi.org/10.1002/sdr.1773.

[2] (1,2,3)

Tom B. Brown, Benjamin Mann, Nick Ryder, Melanie Subbiah, Jared Kaplan, Prafulla Dhariwal, Arvind Neelakantan, Pranav Shyam, Girish Sastry, Amanda Askell, Sandhini Agarwal, Ariel Herbert-Voss, Gretchen Krueger, Tom Henighan, Rewon Child, Aditya Ramesh, Daniel M. Ziegler, Jeffrey Wu, Clemens Winter, Christopher Hesse, Mark Chen, Eric Sigler, Mateusz Litwin, Scott Gray, Benjamin Chess, Jack Clark, Christopher Berner, Sam McCandlish, Alec Radford, Ilya Sutskever, and Dario Amodei. Language models are few-shot learners. 2020. URL: https://arxiv.org/abs/2005.14165, arXiv:2005.14165.

[3]

Stephen Casper, Xander Davies, Claudia Shi, Thomas Krendl Gilbert, Jérémy Scheurer, Javier Rando, Rachel Freedman, Tomasz Korbak, David Lindner, Pedro Freire, Tony Wang, Samuel Marks, Charbel-Raphaël Segerie, Micah Carroll, Andi Peng, Phillip Christoffersen, Mehul Damani, Stewart Slocum, Usman Anwar, Anand Siththaranjan, Max Nadeau, Eric J. Michaud, Jacob Pfau, Dmitrii Krasheninnikov, Xin Chen, Lauro Langosco, Peter Hase, Erdem Bıyık, Anca Dragan, David Krueger, Dorsa Sadigh, and Dylan Hadfield-Menell. Open problems and fundamental limitations of reinforcement learning from human feedback. 2023. URL: https://arxiv.org/abs/2307.15217, arXiv:2307.15217.

[4]

Sebastian Farquhar, Jannik Kossen, Lorenz Kuhn, and Yarin Gal. Detecting hallucinations in large language models using semantic entropy. Nature, 630(8017):625–630, 2024. URL: https://doi.org/10.1038/s41586-024-07421-0, doi:10.1038/s41586-024-07421-0.

[5] (1,2,3)

Philippe J. Giabbanelli. Gpt-based models meet simulation: how to efficiently use large-scale pre-trained language models across simulation tasks. In Proceedings of the Winter Simulation Conference, WSC '23, 2920–2931. IEEE Press, 2024.

[6]

Lei Huang, Weijiang Yu, Weitao Ma, Weihong Zhong, Zhangyin Feng, Haotian Wang, Qianglong Chen, Weihua Peng, Xiaocheng Feng, Bing Qin, and Ting Liu. A survey on hallucination in large language models: principles, taxonomy, challenges, and open questions. 2023. URL: https://arxiv.org/abs/2311.05232, arXiv:2311.05232.

[7] (1,2,3)

Ilya Jackson, Maria Jesus Saenz, and Dmitry Ivanov. From natural language to simulations: applying ai to automate simulation modelling of logistics systems. International Journal of Production Research, 62(4):1434–1457, 2024. URL: https://doi.org/10.1080/00207543.2023.2276811, arXiv:https://doi.org/10.1080/00207543.2023.2276811, doi:10.1080/00207543.2023.2276811.

[8] (1,2)

Ziwei Ji, Nayeon Lee, Rita Frieske, Tiezheng Yu, Dan Su, Yan Xu, Etsuko Ishii, Ye Jin Bang, Andrea Madotto, and Pascale Fung. Survey of hallucination in natural language generation. ACM Comput. Surv., mar 2023. URL: https://doi.org/10.1145/3571730, doi:10.1145/3571730.

[9]

missing journal in leakage_reference

[10]

Tushar Khot, Harsh Trivedi, Matthew Finlayson, Yao Fu, Kyle Richardson, Peter Clark, and Ashish Sabharwal. Decomposed prompting: a modular approach for solving complex tasks. 2023. URL: https://arxiv.org/abs/2210.02406, arXiv:2210.02406.

[11]

Pengfei Liu, Weizhe Yuan, Jinlan Fu, Zhengbao Jiang, Hiroaki Hayashi, and Graham Neubig. Pre-train, prompt, and predict: a systematic survey of prompting methods in natural language processing. 2021. URL: https://arxiv.org/abs/2107.13586, arXiv:2107.13586.

[12] (1,2,3)

Corne du Plooy and Rudolph Oosthuizen. AI USEFULNESS IN SYSTEMS MODELLING AND SIMULATION: GPT-4 APPLICATION. The South African Journal of Industrial Engineering, 34(3):286–303, November 2023. URL: https://sajie.journals.ac.za/pub/article/view/2944 (visited on 2023-12-26), doi:10.7166/34-3-2944.

[13]

missing booktitle in Radford2018ImprovingLU

[14] (1,2)

Anish Shrestha, Kyle Mielke, Tuong Anh Nguyen, and Philippe J. Giabbanelli. Automatically explaining a model: using deep neural networks to generate text from causal maps. In 2022 Winter Simulation Conference (WSC), 2629–2640. 2022. doi:10.1109/WSC57314.2022.10015446.

[15]

Ashish Vaswani, Noam Shazeer, Niki Parmar, Jakob Uszkoreit, Llion Jones, Aidan N. Gomez, Lukasz Kaiser, and Illia Polosukhin. Attention is all you need. 2023. URL: https://arxiv.org/abs/1706.03762, arXiv:1706.03762.

[16]

Jules White, Quchen Fu, Sam Hays, Michael Sandborn, Carlos Olea, Henry Gilbert, Ashraf Elnashar, Jesse Spencer-Smith, and Douglas C. Schmidt. A prompt pattern catalog to enhance prompt engineering with chatgpt. 2023. URL: https://arxiv.org/abs/2302.11382, arXiv:2302.11382.

[17]

Cheng Xu, Shuhao Guan, Derek Greene, and M-Tahar Kechadi. Benchmark data contamination of large language models: a survey. 2024. URL: https://arxiv.org/abs/2406.04244, arXiv:2406.04244.

[18]

Ziwei Xu, Sanjay Jain, and Mohan Kankanhalli. Hallucination is inevitable: an innate limitation of large language models. 2024. URL: https://arxiv.org/abs/2401.11817, arXiv:2401.11817.

[19]

Shunyu Yao, Jeffrey Zhao, Dian Yu, Nan Du, Izhak Shafran, Karthik Narasimhan, and Yuan Cao. React: synergizing reasoning and acting in language models. 2023. URL: https://arxiv.org/abs/2210.03629, arXiv:2210.03629.