The transcript, articles and subsequent brouhaha are overblown. It is obvious that the parties are not particularly debating the efficacy of predictive coding, but rather the proper way to implement it, particularly when there are changes to the population volume or to the relevancy specifications. What is interesting to all of us is this:
Why is there so much wrangling over the size of the seed set and the need to re-seed it with changes?
How should such TAR systems and workflow handle changes? What are the differences between changes to document volume (count) and changes to review specifications?
Why has this ignited the blogosphere, to the point of blatant fact distortion?
There is considerable wrangling over the seed set because of 2 reasons: control and comfort. Judge Peck alludes to the second notion with his distinctions between a statistical sample and a "comfort sample." (His words.) The client service folks in the audience are laughing at this notion as they know full well the mistrust their clients have in "the numbers," while the statisticians present are utterly confounded. Surely, the statistical sample is in fact the most comfortable sample, anything else would be downright uncomfortable!
And, so we get to the second issue here: control. By registering vague, ill-described discomfort with the numbers, the attorneys regain some control by playing on fear. What don't we see? What didn't we get? As Judge Peck points out, it's not about the miniscule "misses," but rather the overall "sure-fires" and the ability to build a case around those.
One of the big problems in this matter is the conflating of two separate types of changes: changes in population size (adding documents) and changes in relevancy scope. These changes are completely independent of one another and should be managed separately. The size change affects the random sampling and so it should be regenerated each time new docs are added. The scope change affects the seed set coding and so it should be regenerated as well to reflect current, up-to-date specifications. In fact, the seed set coding should never be out of step with current specs, or it is obsolete. A far better TAR workflow design than random sampling to generate a seed set and fixed tagging of it is to assume change is part of the litigation and build in mechanisms to manage it. For starters, the seed set should be stratified per the attributes of the then-current document population, rather than random. It should be regenerated every time docs are added or removed. Next, a transparent, easy to edit ruleset should be used to track spec changes and show those over time as documents change their tags. Finally, TAR systems should be priced so that there is no disincentive to make such changes easily.
And finally, why all the hoopla? Because we all know that deep down this is where it is all going. In some ways we are rooting for Recommind (even if we would consider ourselves a competitor) because TAR is the better solution: lower cost, faster and more accurate. Now, if we could be smart about sampling, seed sets and inevitable changes to spec and volume, we'd really have something to shout about.