Writing problematic code with AI’s help

 
Humans trust AI assistants too much and end up writing more insecure code. At the same time, they gain false confidence that they have written well-functioning and secure code.
 

Perry et al. from Stanford University, in their research work, engage study participants in completing security-related programming tasks to answer the following research questions:

  1. Does the distribution of security vulnerabilities users introduce differ based on [the] usage of an AI assistant?

  2. Do users trust AI assistants to write secure code?

  3. How do users’ language and behavior when interacting with an AI assistant affect the degree of security vulnerabilities in their code?

EXPERIMENTAL SETUP: The researchers chose tasks covering cryptographic libraries, handling and using user-controlled data, common web vulnerabilities, and lower-level problems such as memory management. The participants completed a self-contained assignment and were graded on the quality of the output to meet correctness and security criteria. The participants were primarily undergraduate students with fewer than five years of programming experience. This participant selection is a limitation of the study. If your organization has more mature programmers, the mileage of the findings from this study will vary. 

WHAT THEY FOUND: Participants who had access to AI assistants were likelier to believe they had written secure code. While AI assistants make it more productive in the short term for non-programmers and novices to output more code, there is no guarantee that the code produced is high quality or secure. They found significant differences in string encryption and SQL injection (two security-related tasks) between those who used AI assistants and those who didn’t. The macro analysis across all the assigned tasks was that participants who trusted AI assistants less tinkered more with the prompts (both in language and formatting) and generated more secure code from the AI assistant.    


WHY IT MATTERS: The Discussion section of the paper opens with: “However, our results provide caution that inexperienced developers may be inclined to readily trust an AI assistant’s output, at the risk of introducing new security vulnerabilities.” which is a good summary of the current state of copilot (not the GitHub tool, but general formulation) tools and the impact they will have on programming behavior in the near-term. Companies are rushing to deploy code-generation tools to enhance the productivity of programming teams. Yet, current approaches are proceeding based on anecdotal data about productivity gains. Studies such as this, alongside those from Google and GitHub, help highlight where this works well and doesn’t. These results can serve as helpful guidance in escaping pilot purgatory in Generative AI.

THE NUANCES: The research paper details the prompting strategies used by the participants, prompt length, formatting, and language, how they tinker with the AI system parameters, and how these impact the correctness and security of the outputs. More so, this affects the perception of the participant’s performance evaluation. I encourage readers to peruse the paper to dive into the details.


THE HUMAN-MACHINE FUTURE: This finding from the authors caught my eye, “[o]verall, our results suggest that several participants developed “mental models" of the assistant over time, and those that were more likely to pro-actively adjust parameters and re-phrase prompts were more likely [to] provide correct and secure code.” These results point to an exciting future for how humans and machines might work together collectively and collaboratively in a shared environment, which is exactly what we’re doing with my BCG Henderson Institute Fellowship work on Augmented Collective Intelligence

GO DEEPER: Here are some papers that provide supplementary perspectives - 

  1. Chen, M., Tworek, J., Jun, H., Yuan, Q., Pinto, H. P., Kaplan, J., Edwards, H., Burda, Y., Joseph, N., Brockman, G., Ray, A., Puri, R., Krueger, G., Petrov, M., Khlaaf, H., Sastry, G., Mishkin, P., Chan, B., Gray, S., . . . Zaremba, W. (2021). Evaluating Large Language Models Trained on Code. ArXiv. https://doi.org/10.48550/arXiv.2107.03374 

  2. Koyuncu, A., Liu, K., Bissyandé, T. F., Kim, D., Klein, J., Monperrus, M., & Traon, Y. L. (2018). FixMiner: Mining Relevant Fix Patterns for Automated Program Repair. ArXiv. https://doi.org/10.1007/s10664-019-09780-z 

  3. Le, T. H., Chen, H., & Babar, M. A. (2020). Deep Learning for Source Code Modeling and Generation: Models, Applications and Challenges. ArXiv. https://doi.org/10.1145/3383458 

Abhishek Gupta

Founder and Principal Researcher, Montreal AI Ethics Institute

Director, Responsible AI, Boston Consulting Group (BCG)

Fellow, Augmented Collective Intelligence, BCG Henderson Institute

Chair, Standards Working Group, Green Software Foundation

Author, AI Ethics Brief and State of AI Ethics Report

https://www.linkedin.com/in/abhishekguptamcgill/
Previous
Previous

Stop Ignoring Your Stakeholders

Next
Next

Escaping pilot purgatory in Generative AI