Welcome to my blog!
Musing, wanderings & and learnings
Christopher M. Stewart, Ph.D.
📊 Is data "data"? 📊
March 25th, 2026At a breakfast at Google's LAX campus about ten years ago, I was discussing my team's crowdsourcing work with some colleagues. Two linguists and a quantitative psychologist working on the kinds of projects that companies like scale.ai would eventually make gobs of money doing for other companies. The data that we were collecting was all about ads, yet the three of us had no background in advertising whatsoever. The consensus was that "data is data". We didn't necessarily need to have extensive experience with advertising or individual partner teams' use cases to do our work of getting high-quality data to train and/or validate their models.
Fast-forward a decade. For the past 9 months, I have been working with lots of different collaborators on different kinds of projects. The people that I collaborate with have titles like AI scientist, faculty member at a research hospital, professor, graduate student, and startup founder. The various projects address AI safety, both from a methodological standpoint and in real-world contexts like cancer survivorship care, red-teaming, language patterns in AI output vs. human writing, among others. I have found that the deeper the collaboration, the more important it is to understand the culture of the domain in which my collaborators work. For example, physicians have a very particular perspective on safe and secure deployment of medical AI that has to be taken seriously for a fruitful collaboration. A focus on the patient rather than AI results in different priorities. Data is data only works up to a certain point.
With that being said, there are contexts in which the "data is data" mindset can be helpful. In the second week of XCS224N, Stanford's NLP with Deep Learning course, we did a deep dive on the 2013 paper that introduced Word2Vec, a computational implementation of the famous British linguist J.R. Firth's "you shall know a word by the company that it keeps". In 2018, this insight showed up in chemistry. Mol2Vec treats molecules the way Word2Vec treats sentences: break a molecule into its substructures, treat each one as a "word", and learn embeddings from a corpus of 19.9 million compounds. The result is a 300-dimensional vector for any molecule, a compressed representation of its chemical "meaning". In a recent interview, I used mol2vec (and XGBoost) to predict the temperature range where polymers transition from a hard, brittle glassy state to a soft, pliable state, the glass transition temperature (Tg), from string representations of their structures known as "SMILES". On a held-out test set, the trained model achieved an R² of 0.78. Not too shabby.
I have no background in chemistry or materials science. But the paradigm of "find the words, learn the grammar, let the geometry of the embedding space do the work" appears to have some juice in polymer informatics, just like in natural language processing. Sometimes data really is data.
Stanford AI Professional Program: NLP with Deep Learning
Spring 2026I am currently working on completing Stanford Engineering’s Artificial Intelligence Professional Program and am taking XCS224N: Natural Language Processing with Deep Learning. The course covers 10 modules and 5 assignments, providing deep theoretical and practical grounding in modern NLP. Below is a summary of my notes from the first two modules, followed by the full reference document.
FNotes: The embedded PDF below contains my complete reference notes from Modules 1 and 2 of XCS224N, including mathematical derivations, worked examples, and key takeaways for later modules. I'll keep it current as I progress through the course's modules. The math is intense for me, but I'm lucky to have time to work on digesting it right now.
RAG via Hierarchical Bayesian Language Modeling
LinkedIn Post · March 1, 2026This post previews a forthcoming COLM submission about automatic prompt optimization with neural and non-neural RAG for financial question-answering. Is there an alternative to RAG? Maybe something Bayesian? Perhaps even better than neural embeddings and similarity in some cases? The discussion touches on AI, NLP, and practical considerations for building effective retrieval pipelines.