Monday, November 12, 2012

Redaction? There’s an App for That..

Raise your hand if you've either participated in or managed a group of contract attorneys sitting in rows of cubicles (or "stations") redacting documents for production. Whether your tool of choice was a black marker or a cursor, I'm betting it was a thankless, tedious, pain-staking job and you hated it. Wouldn’t it have been nice if you could've taught the computer how to recognize the patterns of PII (private identification information) and have it make the redactions itself?

Well, guess what, Valora heard your anguished cries and we've built an AutoRedaction engine that rivals any group of manual redactors. With blazing speeds, impressive accuracy and astounding savings, our PowerHouseTM system literally autoredacts paper and ESI documents in seconds.

Capitalizing on our extensive experience with pattern-matching technology[1], Valora has built a custom software program that automatically determines the presence of sensitive PII, confidential or privileged information, and then redacts out that information on the image. AutoRedaction takes the form of a black block, with or without a representative stamp, such as "Redacted" or "Employee 123." Redactions can be made permanent, such as for production purposes, or kept temporary, with a technique for "lift and peek," when desired. Redactions can also be made to the underlying text, or on both text and image, if desired.

[1] For more on Probabilistic Hierarchical Context-Free Grammars, see this link on Google Scholar.

Thursday, November 8, 2012

Statistical Pattern Matching Accurately Predicts Presidential Winners and Electoral College Counts, Why Not Privilege and Responsiveness in Litigation?

The technology utilized by political statisticians is finally getting the attention it deserves.  Not because it is partisan, but because it is accurate.  The excellent article in today’s LA Times explains how mathematical models predicted the election outcome well before the first polls had opened. How? By taking the information from numerous sample sets and re-modeling over and over again with different assumptions and weightings. If this sounds a lot like statistical sampling and pattern-matching, then you have been paying attention! The techniques used by the Nate Silvers of the world to classify and label voting patterns are being used right now in litigation to “predict” (or diagnose, if you prefer) for privilege, responsiveness and issues.

At Valora, we call this technique Probabilistic Hierarchical Context-Free Grammars, but others have shortened it to Statistical Pattern Matching, which works just fine. The point is that information about documents (or voter behavior or music choices) has been available for a long time. The only missing piece is the human comfort level with statistics and probabilistic systems.

If the statisticians can call elections, baseball winners and consumer preferences, isn’t it time we let them loose onto document analysis and review? If you’d like a primer on or a demonstration of Probabilistic Hierarchical Context-Free Grammars in litigation, contact us at

Thursday, August 9, 2012

Valora Technologies CEO, Sandra Serkes, Responds to Craig Ball’s LTN Article on “Next Level” Technology Assisted Review

Original article: Imagining the Evidence

I am pleased to inform both Mr. Ball and the world that the “next level” of TAR, meaning the use of whole documents and populations, rather than selected seed sets, is already here and doing fine.  Rules-Based approaches to TAR are not constrained by the need to create and perfect the selection of a seed set.  Instead, they apply their algorithms and iterations across the entire population, at once, each time.  There is no need for any exemplar document, as the exemplar is the rule itself – thus any document can be evaluated for its “exemplary-ness” and to what degree, where and when.

Furthermore, Mr. Ball discusses the thorny issue of self-interested collection and seed set tagging.  He suggests the opposing party should be the one to set the seed set tags into motion.  This is a step in the right direction.  But, the best approach would be to have both producing and opposing working together to determine relevance – an option easily afforded by a Rules-Based approach.  Rather than having any one party have to sit down and hand-craft a seed set, both sides can agree on the RULES of responsiveness, rather than on whether this document or that one is the better exemplar.  With agreed-upon rules in place, documents are easily assessed not just for yes/no relevance, but also to what degree.

Finally, the notion of “imagining” the documents is very much alive and well in the field of Data Visualization.  We often use this technique in a descriptive way (here’s what your data shows), but it can also very much be used in a proscriptive way (is there anything that looks like this?  How close?).  This concept is very much connected to the current practice of iterating for performance optimization (aka: trading off precision and recall).  TAR systems that utilize the notion of DocType or Attribute templates already have the concept of a “generic” or “iconic” version, essentially an exemplar.  It is trivial to create more templates and use them in a hierarchical manner to test how much a potential document matches the generic exemplars, by relevance priority.

Thursday, July 19, 2012

Valora Technologies CEO, Sandra Serkes, Invited to Speak to ILTA South Pacific Region About Technology Assisted Review

ILTA Program to Cover "Exploring Predictive Coding and Technology Assisted Review: Valora Technologies' Approach"
July 25, 2012, 6:00pm EST

During her presentation, Ms. Serkes will layout the Technology-Assisted review (TAR) landscape and Valora Technologies' overall approach.  With Document Review the most costly phase of ediscovery, there are considerable savings to be achieved when it's possible to automate some or most of that review process in a reliable and defensible manner.  This theory is the motivation behind the hype about predictive coding and TAR. Unlike most of the solutions on offer in this market space, Valora Technologies' approach is Rule-Based and transparent.  Come learn about the developing TAR landscape and one provider's unique vision for this space.

Remote & physical presentation sign-up now open.

Friday, May 4, 2012

3 Drawbacks To Predictive Coding

Valora’s Response to LTN article: Take Two: Reactions to 'Da Silva Moore' Predictive Coding Order

What is missing there, and elsewhere, is a discussion of the specific weaknesses of the overall Predictive Coding technique.  Here are just three drawbacks of the technique: 
  1. PC tagging algorithms are not transparent.  No one really knows why the PC engine "chose" the documents it did.  Typically, the “choosing” algorithm is hidden and not disclosed.  All we know is that somehow the document recognized is a lot like another tagged.  
  2. PC has no checks or balances on the skill set, education, consistency or motivations of the seed set coder(s).  The entire Predictive Coding approach assumes that the seed set coder(s) know what they are doing, and that they are correct, consistent and honest.  Would you defend that position, particularly given that the “human being as gold standard" concept has been roundly deflated (see Blair & Maron, Grossman, TREC, etc.)?
  3. Typically, seed set creation and audit sampling for PC use a random sampling technique, the weakest of all types. 
Other sampling techniques (stratified, cluster, panel, etc.) are aware of document attributes and utilize intelligent groupings to create a much stronger, more representative sample for seed set coding and auditing purposes.

Since at present, all Predictive Coding solutions are products, which means they have limited functionality and flexibility for specific case matters, perhaps we should be thinking about the broader picture of Technology-Assisted Review (TAR) as a service – customizable, measurable and transparent.

Wednesday, April 4, 2012

What’s the Difference Between Automated Review and Predictive Coding?

Automated review and predictive coding are often mentioned in the same breath, as synonyms for each other. They are actually different concepts. Predictive coding, in which a topic-expert manually codes a "seed set" of documents (and the software follows suit) is a type of automated review. There are 2 other types.

A second approach to automated review is called Rules-Based Coding, in which a set of rules is created to direct how documents should be coded, very similar to a Coding Manual or a Review Memo that might be prepared for a group of on- or off-shore contract attorneys. The preparation of the Ruleset is typically done by some combination of topic experts, attorneys and technologists. The rules are run on the document population and it is evaluated, tweaked and run again until all parties are satisfied.

The third approach to automated review is called Present & Direct, in which software takes a first, unprompted assessment of the documents and puts forth a graphical representation (pretty charts and diagrams) of what the data contains. This is sometimes called Early Case Assessment or Data Visualization. Once data analysis is presented, the reviewer "informs" the software what he/she wants by batch-tagging key document groupings.

All of these techniques are variations of one another and each has its strengths and weaknesses for use in different types of matters and circumstances. (A topic for a future blog post, clearly!) The point here it is to recognize that Predictive Coding DOES NOT EQUAL Automated Review; it is simply one of several techniques to accomplish it.

Friday, February 17, 2012

Valora's position on Judge Peck's Commentary in (Da Silva v. Moore) Transcript

The transcript, articles and subsequent brouhaha are overblown. It is obvious that the parties are not particularly debating the efficacy of predictive coding, but rather the proper way to implement it, particularly when there are changes to the population volume or to the relevancy specifications. What is interesting to all of us is this:

Why is there so much wrangling over the size of the seed set and the need to re-seed it with changes?

How should such TAR systems and workflow handle changes? What are the differences between changes to document volume (count) and changes to review specifications?

Why has this ignited the blogosphere, to the point of blatant fact distortion?

There is considerable wrangling over the seed set because of 2 reasons: control and comfort. Judge Peck alludes to the second notion with his distinctions between a statistical sample and a "comfort sample." (His words.) The client service folks in the audience are laughing at this notion as they know full well the mistrust their clients have in "the numbers," while the statisticians present are utterly confounded. Surely, the statistical sample is in fact the most comfortable sample, anything else would be downright uncomfortable!

And, so we get to the second issue here: control. By registering vague, ill-described discomfort with the numbers, the attorneys regain some control by playing on fear. What don't we see? What didn't we get? As Judge Peck points out, it's not about the miniscule "misses," but rather the overall "sure-fires" and the ability to build a case around those.

One of the big problems in this matter is the conflating of two separate types of changes: changes in population size (adding documents) and changes in relevancy scope. These changes are completely independent of one another and should be managed separately. The size change affects the random sampling and so it should be regenerated each time new docs are added. The scope change affects the seed set coding and so it should be regenerated as well to reflect current, up-to-date specifications. In fact, the seed set coding should never be out of step with current specs, or it is obsolete. A far better TAR workflow design than random sampling to generate a seed set and fixed tagging of it is to assume change is part of the litigation and build in mechanisms to manage it. For starters, the seed set should be stratified per the attributes of the then-current document population, rather than random. It should be regenerated every time docs are added or removed. Next, a transparent, easy to edit ruleset should be used to track spec changes and show those over time as documents change their tags. Finally, TAR systems should be priced so that there is no disincentive to make such changes easily.

And finally, why all the hoopla? Because we all know that deep down this is where it is all going. In some ways we are rooting for Recommind (even if we would consider ourselves a competitor) because TAR is the better solution: lower cost, faster and more accurate. Now, if we could be smart about sampling, seed sets and inevitable changes to spec and volume, we'd really have something to shout about.