Skip navigation Jump to main navigation

Morningside Campus Access Updates

All Columbia affiliates have access to the Morningside campus. CUID holders can request campus access for guests by completing the registration form. Campus entry points remain limited. Read More.
Close alert

The Game-Changing Uses of ChatGPT and Other AI Large Language Models for Analytics

By Robert L. Simione II, Associate Professor of Professional Practice, Master of Science in Applied Analytics

Recent advances in artificial intelligence make this a uniquely exciting moment to work in analytics and education. For years, I have been working with large language models (LLMs), which give statistical measurements of what human language looks like based on large samples of text from the internet and researcher-compiled resources. The introduction of tools like ChatGPT and Bard has brought them to wider public awareness. In the analytics field, I think two applications will have more widespread use in the near future.

First, companies will continue to use LLMs to generate starter code. For example, suppose someone asks you to retrieve a dataset from Amazon Web Services’ S3 service, one of the enterprise’s most widely used file-hosting platforms. Depending on your experience with the service, this can take several hours to figure out. Alternatively, you could “ask” (through a chatbot-like interface) an LLM to do it, and many of them would be able to immediately provide working code that you can copy, paste, and run with only a bit of tweaking.

LLMs can reduce a week’s coding to a few hours for work projects that follow common structures. 

While using LLMs is exciting, there is reason to exercise caution when using them at work—LLM output is not always correct. In the example above, an LLM can give you a confident reply that doesn’t work, sometimes by inventing functionality that doesn’t exist or doesn’t work in the way the LLM assumes it should. Errors like this can be difficult to identify without thoroughly checking the validity of what’s been produced by hand. Furthermore, it can be hard to integrate the output into a project or other body of work—especially if it needs careful documentation of citations, methodologies, and justifications to allow the work to be reproducible.

Another application that I think is currently underutilized but will grow soon is LLMs’ capacity for data cleaning. Several estimates indicate that analysts spend 30% to 80% of their time on data cleaning. Examples of this for text include correcting spelling errors, converting text encodings, extracting names of people or places, and other tasks that previously needed to be handled on a case-by-case basis with specialized tools. Even on the low end, it is quite a bit of effort. Now LLMs can be “asked” to do this. This automatic process leads to huge time savings for analysts frequently working with “dirty” data.

No matter what problem, dataset, tool, or other innovative approach someone is taking to the field, I wish to leave students with the following advice: Document everything you do. Reproducibility is the foundation of all scientific work, and analytics is no different. By documenting the steps you have taken to solve a problem, you can share your solution with your colleagues and even your future self to evaluate the procedure and reproduce and learn from it in the future.

About the Author

Robert L. Simione II is the newest associate professor of professional practice in Columbia University’s Master of Science in Applied Analytics program. He has taught the capstone course, Applied Analytics Frameworks and Methods 1 & 2, Applied Analytics in the Organizational Context, and Managing Data. These courses cover best practices with real-world context and examples and relevant cutting-edge technologies. 

Simione’s experience at Columbia goes back to 2019, when he started teaching part-time in the evenings while working during the day as the lead data scientist at CANVS.AI, a start-up in downtown Manhattan focused on market research tools for analyzing the actual text in companies’ social media engagement and customer response surveys. He has been in the industry since 2014, when he completed his Ph.D. in mathematics from Carnegie Mellon University.

About the Program

Columbia University’s Master of Science in Applied Analytics prepares students with the practical data and leadership skills to succeed. The program combines in-depth knowledge of data analytics with the leadership, management, and communication principles and tactics necessary to impact decision-making at all levels within organizations.