Original article: Imagining the Evidence
I am pleased to inform both Mr. Ball and the world that the “next level” of TAR, meaning the use of whole documents and populations, rather than selected seed sets, is already here and doing fine. Rules-Based approaches to TAR are not constrained by the need to create and perfect the selection of a seed set. Instead, they apply their algorithms and iterations across the entire population, at once, each time. There is no need for any exemplar document, as the exemplar is the rule itself – thus any document can be evaluated for its “exemplary-ness” and to what degree, where and when.
Furthermore, Mr. Ball discusses the thorny issue of self-interested collection and seed set tagging. He suggests the opposing party should be the one to set the seed set tags into motion. This is a step in the right direction. But, the best approach would be to have both producing and opposing working together to determine relevance – an option easily afforded by a Rules-Based approach. Rather than having any one party have to sit down and hand-craft a seed set, both sides can agree on the RULES of responsiveness, rather than on whether this document or that one is the better exemplar. With agreed-upon rules in place, documents are easily assessed not just for yes/no relevance, but also to what degree.
Finally, the notion of “imagining” the documents is very much alive and well in the field of Data Visualization. We often use this technique in a descriptive way (here’s what your data shows), but it can also very much be used in a proscriptive way (is there anything that looks like this? How close?). This concept is very much connected to the current practice of iterating for performance optimization (aka: trading off precision and recall). TAR systems that utilize the notion of DocType or Attribute templates already have the concept of a “generic” or “iconic” version, essentially an exemplar. It is trivial to create more templates and use them in a hierarchical manner to test how much a potential document matches the generic exemplars, by relevance priority.