Alina von Davier is a keynote speaker at DEFI’s annual event which this year is titled: AI-Powered Pedagogy: Designing the Future of Education.

As Chief of Research at Duolingo English Test, she and her team are pioneering new assessment methods.

Here she tells us a little about her work. 

If you want to hear more from her, consider signing up to attend our in-person event on Tuesday 26th March.

Digital-first Assessments in the Age of Artificial Intelligence

By Dr. Alina von Davier

Recent advances in artificial intelligence (AI) and intelligent automation have revolutionized educational assessments, with a focus on scalable content generation. AI-powered content generation allows the development of diverse test items and personalized assessments, making high-quality assessments more affordable and accessible. However, a careful balance is needed between technological progress and human-centered design. This approach emphasizes the need for interdisciplinary collaboration and a human-centered AI framework that integrates AI with human expertise to create assessments that adapt to test-takers’ needs. The Duolingo English Test (DET), a research-based digital-first, high-stakes language test, is a trailblazer in the thoughtful adoption of AI throughout the test development and administration processes.

We believe that technology in general, and AI in particular, is the best way to scale education and level the playing field for all, regardless of their circumstances. It can help us create educational and testing content quickly, and curate and recommend resources for learners that are at a specific difficulty level given a learner’s ability. In the context of assessment, AI and computational psychometrics create adaptive test experiences that hone in on a test taker’s proficiency more efficiently than previously possible, making for a quicker, more streamlined testing experience while adhering to the high standards of reliability and validity for high-stakes tests.

We launched the DET as an extension of our mission to break down barriers to education. Our co-founders, Luis von Ahn and Severin Hacker, both immigrated to the US and have experienced first hand the arduous and expensive process of certifying their English proficiency. Duolingo was already using technology expertise to make language learning accessible for people across the globe, and we sought to apply some of the same techniques to language assessment.

Interactive tests for interactive skills

As a research-based assessment, our first priority is the theoretical alignment of the test to frameworks such as the Common European Framework of Reference for Languages (CEFR, 2020). This is achieved by our language experts designing the task types using an extended evidence-centered design framework (Arieli-Attali et al., 2019, Mislevy et al., 2014). 

AI comes into play next helping us generate and review interactive items and pilot new items that can further improve the quality of the test and the DET’s Test Taker Experience, or TTX. As a digital-first assessment, we embrace new technology to enhance how we measure language ability. Our latest items use generative AI to simulate real-time conversations and writing tasks, allowing us to test language skills, including interactional competence (see Galaczi & Taylor, 2018), like never before in the assessment industry.

A digital environment allows for interactivity, which allows us to build more authentic tasks, that is, tasks whose characteristics correspond to relevant activities in the real world (Bachman & Palmer, 2010). Last year, we introduced Interactive Listening, which puts test takers in a chat with an animated conversation partner. They then conduct a multi-turn conversation to achieve a certain communication goal, such as following up with a professor for more information on a topic from class, or asking a friend to review a paper. Such interactions align with the CEFR in terms of the listening purposes of gist, specific details, and rhetorical purpose, and can be seen to assess underlying listening purposes; test takers must use top-down and bottom-up listening processes to understand the individual utterances, as well as interactional competencies to manage the topic and turn taking.

In addition to assessing the skill of listening in isolation, the Interactive Listening task, like many DET tasks, measures multiple language skills at once—which is how they’re used in real life! By listening and responding to a conversation partner, test takers demonstrate conversation skills (integrated listening and speaking), not just comprehension. In addition, the follow-up summary writing task further allows test takers to demonstrate comprehension of the conversation while simultaneously producing a written response. This type of integrated listening/writing skill aligns with ‘Mediation’ in the current CEFR.

Optimizing test development with human in the loop AI

After the items have been designed and pre-tested, we use the latest machine learning and software engineering technology to automatically generate test content, making room for more innovation in the test development process. While the LLMs are able to produce natural, coherent texts on a wide range of topics, in virtually any genre, we ensure that each item is reviewed by human experts, for quality, fairness and bias, as well as for factual accuracy.

For years, the only way to generate items according to the specifications and design was for teams of expert test developers to write them. This process involves people finding source material, researching, and of course writing, all of which takes time and money. And those costs are then passed on to test takers, which is why most high-stakes English proficiency tests cost several hundred dollars.

Instead of spending months writing items by hand, our test developers use a human-in-the-loop AI process to generate a far greater range of content, much faster! We then filter, edit, and review AI-generated content to produce test items that are indistinguishable from something written by actual humans. This ensures that we have a wide variety of content on the test—at a volume necessary to support our computer adaptive test format.

CATs are nimble, CATs are quick

Computer adaptive tests (CATs) are efficient assessments that match the items’ difficulty to each test taker’s performance. These tests necessitate a flexible delivery platform, an item selection algorithm based on psychometric models, and a large item bank (see, for example,  Magis, Yan & von Davier, 2017).  Our approach to automatic item generation ensures that we have a wide variety of appropriate content on the test at a volume necessary to support our computer adaptive test format. Leveraging AI to create a large pool of items is what allows us to administer the test adaptively at scale—another innovation for the test-taker experience, because it means both that the exam can be completed in about an hour, and it can be taken at home.

In a traditional fixed-form exam, test takers are given questions at every proficiency level, regardless of their own proficiency. Not only are these test items demoralizing for the test taker, they are also not “informative”—that is, their response to the item doesn’t help much to estimate their true proficiency.

We designed the DET to quickly adapt to a test taker’s learning level, withholding content that is likely too difficult or too easy, not only because this is a better experience for the test taker but because it’s a more efficient way to assess their proficiency. Because they encounter fewer items far above or below their proficiency level, test takers may find that the test feels less stressful, and perhaps easier, than a longer, fixed-form exam.

Adaptively administering test items also enhances the security of our test: because each test session is uniquely administered and draws from such a large pool of items, it’s extremely difficult for test takers to take advantage of leaked items. In addition, each test session records the webcam, microphone, keyboard, and cursor actions, and is first reviewed by our AI algorithm for potential signs of rule breaking or malicious behavior before it is reviewed by human proctors. These proctors are language experts who use the AI flags and established proctoring guidelines to determine whether any rules have been broken, before arriving at a certification decision.

AI must be used responsibly

The element that unites this entire ecosystem is the test-taker experience, or TTX. TTX is distinct from UX, or user experience; while UX is typically associated with design elements related to visuals and navigation in digital platforms, we use TTX to refer to the full test-taker experience, from test administration to score reporting.

We aim to provide a positive TTX at every stage of the test taker’s journey, including free test-readiness resources, a more affordable price point, intuitive UX design, shorter testing time, and fast score turnaround processes. Fairness plays a role, too: a test taker’s experience can vary across culture, first language, and computer setup. We consider these and other factors when developing test items, and when running quality control evaluations on the exam, to ensure that the test is fair for everyone.

As leaders in AI, we feel an obligation to set the standard. The most recent standards of the assessment community were published in 2014 and do not include guidelines to how AI may be used in assessments. That’s why we developed new Responsible AI standards (Burstein, 2023) to ensure accountability, transparency, fairness, privacy, and security in testing. By sharing them with the world, we hope to continue to lead the way in digital-first assessment.

Human expertise + AI’s power = a winning combination

From item generation to proctoring, we use human experts and machines in places they both perform best, always leaving final decisions to the human experts. And we’re not trying to minimize our need for human collaboration and innovation—rather, we’re using the technology to supplement all of the hard work our test creators do! By streamlining the test development process, we’re able to offer a faster, more innovative test, at a much more affordable price point, making it possible for more people to access high-stakes testing.