It is a startling revelation when you learn that a dataset that has been public for years and contains over 7,500 instances of unredacted social security numbers, credit card numbers, dates of birth, home addresses and phone numbers. But that is precisely the claim of John Martin, the CEO and founder of BeyondRecognition.
The EDRM Enron Email Data Set v2 (EDRM Data) is a collection of documents originally gathered by the Federal Energy Regulatory Commission (FERC) as part of its investigation of Enron's energy trading practices and then made public by FERC. The EDRM data is a reworked version of the original documents which was available for download over an extended period of time at EDRM's website - it has since been transferred to Amazon Web Services for downloading, though there is a link from EDRM to the download site.
Why have so many people/teams worked with the data for years without discovering all the personally identifiable information (PII)? EDRM teams worked with it. The NIST-sponsored Text Retrieval Conference (TREC) Legal Track for 2010 and 2011 used that data set. Teams from around the world used it.
Beyond Recognition acknowledges that it worked with the data set for several months without checking for PII - it was an accidental discovery when testing its mass redaction tool.
Let me make it clear that I understand that publishing a post on this issue serves the business interests of BeyondRecognition - John Martin himself is quick to point that out. Whatever the motivation, there is a lot of sound advice in the post about identifying and removing PII.
Putting motivation to one side, it is a real issue that publication of this data set necessarily meant that a data breach had taken place and it is astonishing that no one ever checked for PII. EDRM, in an e-mail I have seen, acknowledges that it is aware of the PII content and is working with an EDRM partner to make "a PII clean" version of the data available via EDRM.
But if the data breach was known, why were the proper authorities not notified?
At this point, BeyondRecognition has notified EDRM, FERC, Amazon Web Services, the FTC and the Texas Attorney General.
Let me hasten to add that I have no personal knowledge of what has gone on here - but it is disturbing that no one looked for PII when the data was made available and that no one (other than John Martin) reported the breach.
So . . . if there is something I am missing - if other players on this stage wish to have a voice, I urge them to write a measured response, which I'll be happy to post.