Designing a low-stakes experiment
Due to the experimental nature of the work, the professor designed a low-stakes assignment with two primary goals: (a) to help identify strengths and weaknesses in the dataset, and (b) to get a better handle on how and when students use GenAI tools.
To set students up for success, the professor:
- Organized a hands-on training session with the research data science specialist to demonstrate how to use a curated database and prompt GenAI tools within Dartmouth Chat.
- Focused the work on the provided dataset, asking students not to rely on external search engines.
- Instructed students on a specific workflow: Upload articles one-by-one, prompting their large language model (LLM) of choice for basic summaries, and ask more specific, course-related questions.
- Graded on completion, not perfection, giving students full credit for turning in the assignment on time to encourage honest experimentation.
Finding 1: GenAI as a 'summarizing machine,' not a critical analyst
While the AI tools proved useful, their limitations became apparent in a humanities context that requires deep reading and contextual understanding. The professor noted that while the LLMs were effective for generating broad impressions of articles, they could not substitute for direct engagement with the texts. This observation was echoed by students:
"Students felt that the tools did not go deep enough to get into the details of arguments, of data/evidence presented, and did not capture the critical edge of the scholarly writing."
— Professor
The professor and the students observed specific analytical weaknesses in the LLMs, including:
- Trouble with conceptual associations (e.g., connecting "agriculture" to "environment").
- Limited knowledge in areas like liberation theology, prophetic thought, indigeneity, race, and queer/LGBTQ+ topics.
- A tendency to give short and simple replies when it couldn't make meaning from the inputs.
One clear benefit emerged from a student less comfortable with Spanish. The tool, alongside tools like DeepL Translate, served as a powerful language accessibility aid.
Finding 2: The students uncovered the strengths and limitations of GenAI, and gaps in the journal's focus
The professor noted that the assignment helped to uncover the limitations of GenAI and its use in a course that required students to leverage background context, new concepts, and content as they were learning.
For general queries, the LLMs were helpful and gave students good guidance on how to dissect an article and how to read it by foregrounding key terms. However, these tools did not provide more value in abstract writing than the journal.
"It was great at summarizing, and they called it a 'summarizing machine', but it did not have the same payoff as reading the article on one's own terms. . . The assignment, in all, was helpful in students identifying some key limitations of the ECA dataset: It does not engage indigeneity, race, LGBTQ+ identities and issues, and is largely centered on Salvadoran national concerns and debates." — Professor
Finding 3: Unexpected technical hurdles became learning moments
Beyond the analytical limitations, the class encountered practical, platform-specific challenges. The daily token limits on Dartmouth Chat's more powerful models meant that students who started right away had more "time" (days) to work with the pricier LLM tools, but they still had to be mindful of the tasks they were asking the tools to do. For example, uploading a PDF and asking for a summary left students with fewer tokens for specific queries, follow-ups, etc. In other cases, students were forced to use one of Dartmouth's free local models, Llama, which they soon discovered was limited and less than optimal for the assignment.
"They found, quite instantly, that Llama's model is incredibly limited as it gave them strange outputs and was being particularly snarky with them when they followed up with pointed requests." — Professor
These technical frustrations became an important lesson in how the underlying infrastructure and cost of AI can shape the research process and outcomes.
Blueprint for the next iteration
Based on this experiment, the professor had suggestions for how he might approach a similar assignment in the future:
- Scaffold prompting skills: Practice prompting with students earlier in the term using in-class demos on a single article. The professor wants to develop a checklist to ensure students can confirm their LLM settings are configured for the desired result.
- Fine-tune the model: Adjust "the temperature" of the LLM for more advanced exercises to ensure more consistent and structured outputs.
- Focus the AI's attention: Experiment with "focused retrieval: (directing the LLM to a specific part of a document) rather than having it scan the entire text.
- Ensure equitable access: Use an English-language dataset to create a more level playing field for students.
- Systematize feedback: Create a structured format for students to report issues, concerns, and discoveries with the assignment and the AI tools.
As the professor concluded, "What was imprecise about the assignment, despite my attempts to develop good prompting habits and strategies, was the lack of detail in some of the targeted inquiries about particular topics, keywords, and author arguments"—a challenge these future steps aim to address.