VoloLLM: Fully Automatic Dungeon Master for DnD

Updated 12/12/2024

Fall 2024 CSCI 5541 NLP: Class Project - University of Minnesota

Tired Tokenizers

Nicholas Padilla

Jack LeGeault

Jacob Cadavez


Table of Contents

Abstract

VoloLLM is a Dungeon and Dragons dungeon master chatbot. It is designed to ingest Dungeons and Dragons adventure modules as guide a single player through the adventure.


Introduction

Dungeons & Dragons (D&D) is an open-ended tabletop role-playing game. Players take the role of a single crafted character, while the Dungeon Master (DM) creates a story, following conventions from a campaign book, and will lead the players through it. DM work can get extremely tedious and taxing as they have to deal with complex intertwined storylines, setting and character consistency, and crafting engaging gameplay. While enjoyable, many players can find it challenging to schedule player sessions over multiple weeks, coordinates groups of 4 or more people, and for DMs to prepare thoroughly for each session.

Our team aims to develop an application that will take the role of a DM and will guide players through an predefined story, adventure modules, and be able to respond to players’ input and actions while building a cohesive story. Our hopes is to address this issue of potential players not being able to find others who have the time or energy to join them on a session.

Background

There has been previous research done on the types of models that have been created for similar purposes. Many focused on one component, but our findings have shown that a mixture of the following can be beneficial.

Fine-tuned LLMs

Fine-tuning has been shown to improve story- telling abilities of an LLM, enhancing the stylistic capabilities and showing writing prowess when given extremely specific prompts and requirements (Gite et al., 2024). Additionally, when utilizing parameter-efficient techniques such as QLoRA, the model does not require a high volume of stories to feed into its training process. Although the quality of the generated stories increase, an ethical issue arises of how to deal with plagiarism and stealing of authors’ works. There have been studies highlighting cases where models have written eerily similar stories to preexisting ones, and unless safeguards are implemented, there is a high risk of real life stories being stolen and published as fully original (Xie et al., 2023). Another downside to fine-tuning is that this technique has not been utilized to improve storytelling consistency and memory.

Retrieval-Augmented Generation (RAG)

RAG techniques retrieve relevant story elements and maintain contextual relevance depending on the input by the player. Other researchers have found that by utilizing RAG can enhance plotline consistency and have the ability to incoroporate ’common-sense’ constraints, meaning the LLM has a rationale for every line given (Wen et al., 2023). In the studies found, RAG based LLMs have stories that have been evaluated by human annotators to be the most creative and complex in comparison to other prompting techniques. It also mitigates the problem of plagiarism as it has been found that the generated stories have a zero N-gram overlap with original text. While effective for static or partially structured narratives, these methods struggle with the dynamic, player-driven nature of D&D campaigns.


Motivation

Our project is the provide guided narrative experience for tabletop role player gamers. There are many new and classic adventure modules that many gamers would love to player.

Target Audience

D&D Players and DMs

Players who wish to engage in immersive D&D campaigns without requiring a human DM, or DMs who would like to use this LLM as an aid to their own campaign.

Game Developers

Any game developers who would like to potentially use this as a baseline to train narrators within their own game could potentially use this in their work.


Outside Impacts

Enhanced Accessibility

Removes the dependency on human DMs, making tabletop RPGs more accessible to new players and underrepresented communities.

Scalable Narratives

Allows larger-scale campaigns and group interactions to occur without extra resources needed.


Approach

For our project we created an end-to-end DM system called VoloLLM. This system is able to ingest a Dugneons and Dragons PDF adventure module and create a database and populate the contents of a progression system. It after responding with a hard coded introduction to the campaign players type their response the the first prompt and the system generates a reply that furthers the story.

More details about each of the implementations are provided below.

imgname

Campaign Preprocessing

Although we could create a system that is creates narratives entirely improvisationally, there is a wealth of existing official and unofficial prewritten adventures that players may want to play. Because of this we need of a way for out system to ingest these stories and automatically process Dungeons and Dragon adventure module.

Story Progress Tracking

After a certain window of responses has passed (currently set to 5), the LLM evaluates whether the current scenario has been resolved or not. If the criteria are satisfied, the system transitions players to the next scenario. This progress tracking mechanism minimizes the risk of narrative stagnation or divergence from the campaign's core storyline. This feature is useful to avoid divergence from the campaign, allowing players to encounter all vital storylines.

Action Summarization & Database

In a Dungeons and Dragons campaign, the impact of player choices on the story is essential for immersion and enjoyment. To an Action Database tracks player decisions using text summarization of a rolling chat history. Every three turns (each consisting of one human message and one AI response), the system summarizes the most recent five-turn window to extract key events. These events are stored in a Hugging Face in-memory vector store for future retrieval. This approach balances coherence and context, with the five-turn window providing sufficient detail for smaller models while avoiding repetition or loss of flow. Overlapping summaries ensure continuity and accurate representation of player actions.

Retrieval Augmented Generation

VoloLLM utilizes Retrieval-Augmented Generation (RAG) to improve DM responses by integrating contextual information from the campaign database (holds info on world and setting), and the event database (actions and scenarios that have happened). Using Hugging Face's InMemoryVectorStore and BERT (bert-base-uncased) for vector embeddings, the RAG provides context and relevant information into every prompt. This allows for more contextual clues and information on generated responses, aiding with story consistency.


Anticipated Challenges

Our team anticipated that there will be issues with stylization and memory. Throughout our research, we saw little documentation on utilizing multiple methods to train a model to be specialized in this type of storytelling. Especially when trying to handle a game as open-ended as D&D, there are limitless options to be taken and the model has to keep track to ensure that each event is accounted for, the response will make thematic sense, and has to be in the right tone. There's a lot of factors at hand and finding a way to manage them all was the point of this project.

Challenges Encountered

There were a number of challenges that we encountered during the project. Firstly, we intended to fine-tune the model to improve the style and format of the responses. We initially planned to do this using transcripts from a a Dungeons and Dragons podcast, but we found the model seem to overfit to the dataset and would often respond with mannerisms of the podcast's DM, and include characters from the responses. We explored using synthetic data generated using ChatGPT, however, because of the large response sizes needed to emulated the chat history and other contextual information, the cost was prohibitively expensive.

We finally found that improved prompt construction using techniques learned through the course of the semester instead greatly improve response quality and consistency. Additionally, for the local models techniques such as automatic chain of thought prompting and few-shot prompting improve greatly improved the detail of response and output format.

The most persistent issue experience throughout the project was generation of consistent output formats. Because so many systems rely on LLMs to generate story progress status of event summaries, the system is vulnerable to issues of prompt structure leaking into the generate response when it is parse from the full LLM response. For instance if the "CHAT HISTORY" heading string leaks through the LLM response, then it is can end up in chat history or action history. This can then lead to the string appearing event more frequently and leading to all further responses to be ill-formatted and difficult for a human to read.

Response formatting issues are most prevalent on local models, but also occur with ChatGPT models with less frequency. The addition of output formatting and parsing with LangChain, the issue still persisted.


Results

Player Satisfaction Survey Results

Category Average Score
Narratives 5.37
Player Engrossment 4.27
Enjoyment 5.57
Creative Freedom 5.63
Personal Gratification 4.8
Social Connectivity 5.37
Textual Aesthetics 5.17

In total 15 players, including the 3 researchers, played VoloLLM chatbot with the OpenAI LLM backend. From player responses, all categories of player satisfaction rated neutral or above. The two highest rated categories of satisfaction were Creative Freedom and Enjoyment. Players reported surprise at how capable the chatbot was at addressing their character choices. They felt that they are free to choose their own path through the story and the chatbot would smoothly accommodated almost any choice while incorporating choice details.

The two lowest categories of satisfaction are Player Engrossment and Personal Gratification. Some of the most cited complaints by players was the lack of challenge in the story. Player sited that the bot was too compliant to the point of taking away for story enjoyment. During one anecdote a user was able to use sci-fi weapons to defeat their enemies, or in an other instance the user stopped performing the quest and instead went to go get burgers. The lack of challenge and constraint makes player feel unchallenged and unsatisfied.

Additionally, the chatbot responds with relatively large blocks of text reaching 256 characters in length. Users that do not report enjoying reading, reported that the text-based nature of the game made the experience less enjoyable. This may be why in the table above, we see the greatest degree of standard deviation in both questions regarding Player Engrossment and Question 6 which asks about player boredom.


Model Satisfaction Survey Results

Category Q GPT 4o Qwen 14B Qwen 7B
Narrative 1 6.33 5.33 5
2 5.67 5.33 5.67
Player Engrossment 3 5 3 3
4 4.3 4 2.33
Enjoyment 5 5.33 5 5
6* 4.33 4 4
Creative Freedom 7 5 6.33 6
8 6 6.3 6
Personal Gratification 9 5 4 4.33
Social Connectivity 10 5.67 4.33 5
11 6.33 3.67 3.33
Textual Aesthetics 12 6 3 2.33
13 6.33 3.67 3.33

When comparing the quality of play between the locally run models, Qwen 2.5 7B \& 14B, and the OpenAI the clear winner is OpenAI. It is the only model that performed above neutral for all categories of satisfaction, as seen above, and satisfaction questions, except for Question 6 where it was reported as being more. The one area where the Qwen models are reported beating ChatGPT 4o in our evaluation is CreativeFreedom. From our experience, the model tends to more directly address actions the player takes, and with shorter responses.


Event Number Cosine Similarity Score
1 0.7085
2 0.7827
3 0.8199
4 0.8671
5 0.7684
6 0.6830
7 0.7094
8 0.7696
9 0.8385
10 0.5850
11 0.7141
12 0.7140
13 0.6932
14 0.7117
15 0.5767
16 0.8566
17 0.7776
18 0.7686
Cosine Similarity Scores between the generated events and the corresponding passage in the module "Deepwater - Dragon Heist" using 'all-mpnet-base-v2' sentence transformer.

From our results, our team found that all of the event descriptions ranged from a moderate to high score (0.57 to 0.86), indicating high similarity to the original campaign module. This is what our team was aiming for as the closer the generated timeline events are to the original text, the higher likelihood that the generated responses using these event contexts will be accurate as well.



Conclusion

Replicability

Our research result are focused largely on the psychometric analysis of 15 players that are playing VoloLLM for brief periods of time, 15 to 60 minutes. Players were selected from among family and friends of the authors of this report. Given the small sample size and flawed methodology for gathering impartial participants it is likely that future research based on this work may find significantly different results. This is all to say that we do not claim that our results are statistically significant or robust. However, we defend our method of sampling citing the scope of the course project and financial cost contacting significant number of participants from relevant communities, such as the tabletop gaming community, and API costs associated with running these trials.

If someone did want to run trials to replicate our findings, all source code is currently phallically available on Github: https://github.com/padpy/VoloLLM. The only additional materials need to run the code is an OpenAI API key, for access to the paid API. A copy of the survey is found in our given report.

Future Work

Stylization

There is room for improvement in the stylization of the LLM’s responses, as they may read off as bland and boring, which can hurt the users’ experience.

Survey More Players

Using our modified Guess-18 questionnaire, future work should interview many more players to identify areas of weakness within the gameplay loop.

Optimizing System Performance

Research should be done to evaluate the efficacy of RAG-based systems and whether the database should be stored in one structure vs. another.

Introducing New Campaigns

More campaigns should be introduced and run through by the model to ensure our team has not just been focusing on optimizing for a single campaign.


Ethical Concerns

The main concern around our application is the generation of sensitive or immoral content. We do not implement any user safety mechanisms that protect users from inappropriate generated content, and solely rely on safety trained in the model or mechanisms implented through OpenAIs API. Our application is sensitive to injection techniques and jailbreaking. One common method for jailbreaking LLMs is to ask them to speak from another entity, bypassing their built-in safety rules. Our application is already asking the AI model to roleplay be default, making it more prone to produce otherwise prohibited contents, even through OpenAI's API.

In play testing there were several incidents that ovvured that typically would not be allowed through through the OpenAI API. Firstly, we were able to generate suggestive content during romance attempts between player characters and non-player characters. Secondly, we were about to prompt the LLM is such a way that is described graphic acts of violence towards humans and animals. This typically is not allowed by OpenAI, but in the course a play testing it freely generate this content. We can already see that, the model is prone to bypassing safety restrictions placed on the model by OpenAI.

Currently, we as the developers have access to the chat logs and data of all chats sent through VoloLLM for purposeful analysis. This may violate the privacy concerns of some users, especially when expecting a private session, as they want their chats to remain anonymous and inaccessible to other individuals. While we explicitly made this fact clear in this iteration, through a user-informed consent protocol, future interactions should also keep this ethical concern in mind and take further precautions in accessing and using this data.