Carolyn Jane Anderson

Wellesley College

Author Statistics

Papers

Citation

H-Index

i-10 index

Research Field

Solver-based Gradual Type Migration

arXiv (Cornell University) (2021)

Luna Phipps-Costin Carolyn Jane Anderson Michael Greenberg Arjun Guha

Gradually typed languages allow programmers to mix statically and dynamically typed code, enabling them to incrementally reap the benefits of static typing as they add type annotations to their code. However, this type migration process is typically a manual effort with limited tool support. This paper examines the problem of \emph{automated type migration}: given a dynamic program, infer additional or improved type annotations. Existing type migration algorithms prioritize different goals, such as maximizing type precision, maintaining compatibility with unmigrated code, and preserving the semantics of the original program. We argue that the type migration problem involves fundamental compromises: optimizing for a single goal often comes at the expense of others. Ideally, a type migration tool would flexibly accommodate a range of user priorities. We present TypeWhich, a new approach to automated type migration for the gradually-typed lambda calculus with some extensions. Unlike prior work, which relies on custom solvers, TypeWhich produces constraints for an off-the-shelf MaxSMT solver. This allows us to easily express objectives, such as minimizing the number of necessary syntactic coercions, and constraining the type of the migration to be compatible with unmigrated code. We present the first comprehensive evaluation of GTLC type migration algorithms, and compare TypeWhich to four other tools from the literature. Our evaluation uses prior benchmarks, and a new set of ``challenge problems.'' Moreover, we design a new evaluation methodology that highlights the subtleties of gradual type migration. In addition, we apply TypeWhich to a suite of benchmarks for Grift, a programming language based on the GTLC. TypeWhich is able to reconstruct all human-written annotations on all but one program.

Type Inference

Solver

Type safety

Data type

10.48550/arxiv.2109.05049

Cite

Citations (0)

ProSPer: Probing Human and Neural Network Language Model Understanding of Spatial Perspective

Tessa Masis Carolyn Jane Anderson

Understanding perspectival language is important for applications like dialogue systems and human-robot interaction. We propose a probe task that explores how well language models understand spatial perspective. We present a dataset for evaluating perspective inference in English, ProSPer, and use it to explore how humans and Transformer-based language models infer perspective. Although the best bidirectional model performs similarly to humans, they display different strengths: humans outperform neural networks in conversational contexts, while RoBERTa excels at written genres.

Language Understanding

10.18653/v1/2021.blackboxnlp-1.8

Cite

Citations (1)

Online discussion forum help-seeking behaviors of students underrepresented in STEM

International Conference of Learning Sciences (2020)

Victoria Jay Genevieve M. Henricks Carolyn Jane Anderson Lawrence Angrave Nigel Bosch

Underrepresented Minority

Source

Cite

Citations (0)

Exploring Social Biases of Large Language Models in a College Artificial Intelligence Course

Proceedings of the AAAI Conference on Artificial Intelligence (2023)

Skylar Kolisko Carolyn Jane Anderson

Large neural network-based language models play an increasingly important role in contemporary AI. Although these models demonstrate sophisticated text generation capabilities, they have also been shown to reproduce harmful social biases contained in their training data. This paper presents a project that guides students through an exploration of social biases in large language models. As a final project for an intermediate college course in Artificial Intelligence, students developed a bias probe task for a previously-unstudied aspect of sociolinguistic or sociocultural bias they were interested in exploring. Through the process of constructing a dataset and evaluation metric to measure bias, students mastered key technical concepts, including how to run contemporary neural networks for natural language processing tasks; construct datasets and evaluation metrics; and analyze experimental results. Students reported their findings in an in-class presentation and a final report, recounting patterns of predictions that surprised, unsettled, and sparked interest in advocating for technology that reflects a more diverse set of backgrounds and experiences. Through this project, students engage with and even contribute to a growing body of scholarly work on social biases in large language models.

Presentation (obstetrics)

10.1609/aaai.v37i13.26879

Cite

Citations (2)

StudentEval: A Benchmark of Student-Written Prompts for Large Language Models of Code

Findings of the Association for Computational Linguistics: ACL 2022 (2024)

Hannah McLean Babe Sydney Nguyen Yangtian Zi Arjun Guha Molly Q Feldman

Benchmark (surveying)

Code (set theory)

10.18653/v1/2024.findings-acl.501

Cite

Citations (3)

StarCoder: may the source be with you!

arXiv (Cornell University) (2023)

Raymond Li Loubna Ben Allal Yangtian Zi Niklas Muennighoff Denis Kocetkov

The BigCode community, an open-scientific collaboration working on the responsible development of Large Language Models for Code (Code LLMs), introduces StarCoder and StarCoderBase: 15.5B parameter models with 8K context length, infilling capabilities and fast large-batch inference enabled by multi-query attention. StarCoderBase is trained on 1 trillion tokens sourced from The Stack, a large collection of permissively licensed GitHub repositories with inspection tools and an opt-out process. We fine-tuned StarCoderBase on 35B Python tokens, resulting in the creation of StarCoder. We perform the most comprehensive evaluation of Code LLMs to date and show that StarCoderBase outperforms every open Code LLM that supports multiple programming languages and matches or outperforms the OpenAI code-cushman-001 model. Furthermore, StarCoder outperforms every model that is fine-tuned on Python, can be prompted to achieve 40\% pass@1 on HumanEval, and still retains its performance on other programming languages. We take several important steps towards a safe open-access model release, including an improved PII redaction pipeline and a novel attribution tracing tool, and make the StarCoder models publicly available under a more commercially viable version of the Open Responsible AI Model license.

Python

Tracing

MIT License

10.48550/arxiv.2305.06161

Cite

Citations (119)

PhD Knowledge Not Required: A Reasoning Challenge for Large Language Models

arXiv (Cornell University) (2025)

Carolyn Jane Anderson Joydeep Biswas Aleksander Boruch-Gruszecki Federico Cassano Molly Q Feldman

Existing benchmarks for frontier models often test specialized, ``PhD-level'' knowledge that is difficult for non-experts to grasp. In contrast, we present a benchmark based on the NPR Sunday Puzzle Challenge that requires only general knowledge. Our benchmark is challenging for both humans and models, however correct solutions are easy to verify, and models' mistakes are easy to spot. Our work reveals capability gaps that are not evident in existing benchmarks: OpenAI o1 significantly outperforms other reasoning models that are on par on benchmarks that test specialized knowledge. Furthermore, our analysis of reasoning outputs uncovers new kinds of failures. DeepSeek R1, for instance, often concedes with ``I give up'' before providing an answer that it knows is wrong. R1 can also be remarkably ``uncertain'' in its output and in rare cases, it does not ``finish thinking,'' which suggests the need for an inference-time technique to ``wrap up'' before the context window limit is reached. We also quantify the effectiveness of reasoning longer with R1 and Gemini Thinking to identify the point beyond which more reasoning is unlikely to improve accuracy on our benchmark.

10.48550/arxiv.2502.01584

Cite

Citations (0)

3. Log-Multiplicative Association Models as Latent Variable Models for Nominal and/or Ordinal Data

Sociological Methodology (2000)

Carolyn Jane Anderson Jeroen K. Vermunt

Associations between multiple discrete measures are often due to collapsing over other variables. When the variables collapsed over are unobserved and continuous, log-multiplicative association models, including log-linear models with linear-by-linear interactions for ordinal categorical data and extensions of Goodman's (1979, 1985) RC(M) association model for multiple nominal and/or ordinal categorical variables, can be used to study the relationship between the observed discrete variables and the unobserved continuous ones, and to study the unobserved variables. The derivation and use of log-multiplicative association models as latent variable models for discrete variables are presented in this paper. The models are based on graphical models for discrete and continuous variables where the variables follow a conditional Gaussian distribution. The models have many desirable properties, including having schematic or graphical representations of the system of observed and unobserved variables, the log-multiplicative models can be read from the graphs, and estimates of the means, variances, and covariances of the latent variables given values on the observed variables are a function of the log-multiplicative model parameters. To illustrate some of the advantageous aspects of these models, two examples are presented. In one example, responses to items from the General Social Survey (Davis and Smith 1996) are modeled, and in the other example, panel data from two groups (Coleman 1964) are analyzed.

Categorical variable

Log-linear model

Conditional probability distribution

Variables

Graphical model

10.1111/0081-1750.00076

Cite

Citations (45)

How Beginning Programmers and Code LLMs (Mis)read Each Other

arXiv (Cornell University) (2024)

Sydney Nguyen Hannah McLean Babe Yangtian Zi Arjun Guha Carolyn Jane Anderson

Generative AI models, specifically large language models (LLMs), have made strides towards the long-standing goal of text-to-code generation. This progress has invited numerous studies of user interaction. However, less is known about the struggles and strategies of non-experts, for whom each step of the text-to-code problem presents challenges: describing their intent in natural language, evaluating the correctness of generated code, and editing prompts when the generated code is incorrect. This paper presents a large-scale controlled study of how 120 beginning coders across three academic institutions approach writing and editing prompts. A novel experimental design allows us to target specific steps in the text-to-code process and reveals that beginners struggle with writing and editing prompts, even for problems at their skill level and when correctness is automatically determined. Our mixed-methods evaluation provides insight into student processes and perceptions with key implications for non-expert Code LLM use within and outside of education.

Code (set theory)

10.1145/3613904.3642706

Cite

Citations (7)

GlyphPattern: An Abstract Pattern Recognition for Vision-Language Models

arXiv (Cornell University) (2024)

Wu Zixuan Yoolim Kim Carolyn Jane Anderson

Vision-Language Models (VLMs) building upon the foundation of powerful large language models have made rapid progress in reasoning across visual and textual data. While VLMs perform well on vision tasks that they are trained on, our results highlight key challenges in abstract pattern recognition. We present GlyphPattern, a 954 item dataset that pairs 318 human-written descriptions of visual patterns from 40 writing systems with three visual presentation styles. GlyphPattern evaluates abstract pattern recognition in VLMs, requiring models to understand and judge natural language descriptions of visual patterns. GlyphPattern patterns are drawn from a large-scale cognitive science investigation of human writing systems; as a result, they are rich in spatial reference and compositionality. Our experiments show that GlyphPattern is challenging for state-of-the-art VLMs (GPT-4o achieves only 55% accuracy), with marginal gains from few-shot prompting. Our detailed error analysis reveals challenges at multiple levels, including visual processing, natural language understanding, and pattern generalization.

10.48550/arxiv.2408.05894

Cite

Citations (0)