Training classifiers with natural language explanations

wcrichton · on Aug 25, 2018

Part of what this reveals is the intimate relationship between querying and labeling. A query (in the SQL sense) is the human attempting to express a domain concept through a program. Here the queries have an imperative flavor (being written in Python). From the article for example, identifying causal relationships in text by searching for “due to” is a domain concept encoded as a weak heuristic.

This suggests that our query tools need to be more deeply integrated into human-in-the-loop machine learning workflows. For example, in my use case of analyzing TV news videos, let’s say I want to identify a panel of guests. I’ll come up with a query like “3 to 5 people on screen, whose pose suggests they are sitting, and they’re looking at each other.” While this query isn’t a perfect filter (precision nor recall), it will likely find a few positive examples. Then I can query “show me more scenes like this one,” and slowly build up a training set from my queries. Then I train a classifier, and inspect its results. Rinse and repeat.

Edit: also, I’m sure there’s a billion startups that do some variant of this for some domain, but I think we really need a better open source ecosystem around visualization and labeling of data for this workflow to be truly accessible in most domains.