Talk - Form, Meaning, and Tokenization in Large Language Models
10:30am - 12:00noon
Room 2405 (Lifts 17-18), 2/F Academic Building OR Online via ZOOM

Patterns in sound and meaning affect word processing. For example, words in English and other Germanic languages that begin with /gl/ often refer to light (e.g., glowglint, and glimmer), which influences how people retrieve and represent those words. Large language models, such as GPT-4, represent words in an analogous way, segmenting them into tokens. For example, GPT-4 tokenizes glint and glimmer as gl + int and gl + immer, and the Chinese characters 蛇 and 蚊 share both a semantic radical and a GPT-4 token (which corresponds to a shared byte in their UTF-8 encodings). Tokenization thereby presents large language models with opportunities to use the semantic information available in word form. However, tokenization is determined by the frequency of strings of characters in training data, independently of any semantic information, and previous research suggests that tokens rarely map onto meaningful constituents such as morphemes (e.g., ing and pre). In this talk, I present evidence that (1) GPT-4 tokens do capture more semantic information than expected by chance, but (2) the amount of information is greater in languages which are more closely related to English, and (3) it is unclear whether LLMs take advantage of this information. I discuss implications for the democratization and interpretability of artificial intelligence, namely the need to ameliorate rather than amplify disadvantages for speakers of low-resource languages and the goal of aligning lexical concepts in large language models with those in the human mind.

Room 2405 (Lifts 17-18), 2/F Academic Building OR Online via ZOOM
More Information

Dave Haslett is a PhD candidate at the Chinese University of Hong Kong, where he is a member of the Language Processing Laboratory in the Department of Linguistics and Modern Languages. His dissertation presents evidence that similar-sounding words influence how people retrieve and represent meanings, which has implications for psycholinguistics, language evolution, and natural language processing. Portions of that in-progress dissertation have been published in Journal of Experimental Psychology: General and Psychonomic Bulletin & Review.

Host: Prof Janet Hui-wen HSIAO, Professor, Division of Social Science, HKUST


Click to Join ZOOM Meeting

Meeting ID: 923 8923 3522

Passcode: 583475

Division of Social Science