Part of what this reveals is the intimate relationship between querying and labeling. A query (in the SQL sense) is the human attempting to express a domain concept through a program. Here the queries have an imperative flavor (being written in Python). From the article for example, identifying causal relationships in text by searching for “due to” is a domain concept encoded as a weak heuristic.
This suggests that our query tools need to be more deeply integrated into human-in-the-loop machine learning workflows. For example, in my use case of analyzing TV news videos, let’s say I want to identify a panel of guests. I’ll come up with a query like “3 to 5 people on screen, whose pose suggests they are sitting, and they’re looking at each other.” While this query isn’t a perfect filter (precision nor recall), it will likely find a few positive examples. Then I can query “show me more scenes like this one,” and slowly build up a training set from my queries. Then I train a classifier, and inspect its results. Rinse and repeat.
Edit: also, I’m sure there’s a billion startups that do some variant of this for some domain, but I think we really need a better open source ecosystem around visualization and labeling of data for this workflow to be truly accessible in most domains.
This suggests that our query tools need to be more deeply integrated into human-in-the-loop machine learning workflows. For example, in my use case of analyzing TV news videos, let’s say I want to identify a panel of guests. I’ll come up with a query like “3 to 5 people on screen, whose pose suggests they are sitting, and they’re looking at each other.” While this query isn’t a perfect filter (precision nor recall), it will likely find a few positive examples. Then I can query “show me more scenes like this one,” and slowly build up a training set from my queries. Then I train a classifier, and inspect its results. Rinse and repeat.
Edit: also, I’m sure there’s a billion startups that do some variant of this for some domain, but I think we really need a better open source ecosystem around visualization and labeling of data for this workflow to be truly accessible in most domains.