privacy preserving data mining
Image by dekstop via Flickr

Late last month, The Economist set off a little thought bomb modestly titled “New rules for big data”. The article laid out all the various entrenched assumptions standing in the way of thoughtful, relevant information policy in the age of ever larger and more critical data sets. Put more simply, in a world where a Library of Congress is a fungible measure of data, we have to rethink how we protect and traffic information.

Backupify doesn’t yet handle multiple Libraries of Congress’ worth of data, but we want to get there. As such, we want to be forward-looking. Thus, the provisional crowdsourcing of our privacy policy, which begins to confront some of these areas of concern.

One of the more interesting points in the article was an argument against data retention:

“Current rules on digital records state that data should never be stored for longer than necessary because they might be misused or inadvertently released. But Viktor Mayer-Schönberger of the National University of Singapore worries that the increasing power and decreasing price of computers will make it too easy to hold on to everything. In his recent book ‘Delete’ he argues in favour of technical systems that ‘forget’: digital files that have expiry dates or slowly degrade over time.

“Yet regulation is pushing in the opposite direction. There is a social and political expectation that records will be kept, says Peter Allen of CSC, a technology provider: ‘The more we know, the more we are expected to know—for ever.’ American security officials have pressed companies to keep records because they may hold clues after a terrorist incident. In future it is more likely that companies will be required to retain all digital files, and ensure their accuracy, than to delete them.”

Sci-fi author and quasi-futurist Charles Stross has written repeatedly on the danger of reliance on databases for predictive analysis, if only because databases are routinely riddled with data errors. Bruce Schneier has written on the difficulty of purging your data from cloud-based systems, largely because the cloud companies don’t want to give up on all that delicious data-mining fodder.

This would seem to make the case that retaining data is more dangerous than deleting it. But is this an argument against databases, or an argument for better databases — and better data policies? If government is going to require companies to maintain data indefinitely, shouldn’t those companies be required to maintain the accuracy and integrity of those databases. Shouldn’t the government require the same of themselves? If we’re going to maintain a no-fly list, or e-mail blast address books, shouldn’t the agencies and organizations using them being under legal remit and obligation to make those databases at least 95 percent (or, to my mind, 99.999 percent) accurate?

More to the point, if an audit of a database shows the data to be less than 95 percent accurate — your mailing list produces a delivery error on more than one out of every twenty sends — you’re obligated to either upgrade or purge, period.

That said, is such an edict enforceable? It would seem that storage capacity is expanding (or, rather, dropping in price) faster than computing power, so we’re going to be able to store more data than we’re able to parse and maintain effectively. This argues for a classic, analog data retention policy — any record that hasn’t been updated after a certain period (the IRS says seven years) should be purged.

Still more complicated; not all records are created equal. Financial records are the sort of data that might need be kept indefinitely, especially for organizations of certain size, or any outfit that’s publicly traded. The same goes for any government transactional records. There is a public accountability stake in those records.

Conversely, address records (of the physical, IP, URL or e-mail variety) that haven’t been updated in perhaps three years may not be worth keeping. Personal finances likely qualify here as well.

At Backupify, we believe you own your data, not us. We’re the bank, but the money is yours. Withdraw it whenever you like, and it’s gone from our ledgers forever. But government may have something to say about those data polices — and everyone else’s — very soon.

Suffice it to say, I would not want to be the one crafting data compliance legislation right now. I welcome your insights into this issue in the comments section.

{ Comments }

Very Private Privacy Policy
Image by Mot via Flickr

Last Saturday, the New York Times ran a piece on the inadequacy of modern Internet privacy practices. In short, the “opt-in to a byzantine privacy EULA” approach is universally reviled, and doesn’t begin to address the myriad levels of granular privacy control that many users expect these days, or the myriad privacy loopholes that few are aware even exist.

Backupify has a fairly brief and straightforward privacy policy, as noted in our FAQ:

We don’t do anything with your data once it is backed up. We don’t look at it, we don’t sell it, we don’t analyze it, we don’t modify it. Our privacy policy is that you own your data and you should be in control. We don’t own your data, we just provide software to give you more control over your stuff. We charge for our service, so we never have to resort to analyzing your data so that we can sell advertising against it or anything like that. You will never get email from us unless you opt-in for it.

Backupify was started on the premise that your data is yours and you should not leave it locked up in all of these online systems. We believe strongly in freedom and privacy.

Personally, I think that’s a pretty clear, comprehensible and reliable security policy.

But here’s where it gets complicated: Users have asked us for a search interface for their backups, so they can find certain items within their data archive without downloading and parsing the data themselves, very much like the Navel-Gazer Self-Search we joked about on Monday. The features in that post were snarkily named, but they all contain an element of truth in that they represent functionality somebody has asked for.

So how do we index your data archive when we promise not to look at it? How do we build the E-marketer Goggles we suggested without analyzing your data? Even something as simple as a data retention policy — in which users have asked to delete part of their data archives after a certain retention period — would require that we check the timestamps on certain data elements before purging them. This gets even more complicated when we’re backing up version histories, such that we don’t want to purge older stuff, like WordPress themes, that’s still current.

Add in that some users have asked for these features and asked that they get complete encryption control over the data archive — I’m not sure how we’re supposed to search index your data if we can’t decrypt it — and you see where even our own well intentioned, direct privacy policy starts to look inadequate.

So I’m throwing this to the community at large: How would you write Backupify’s privacy policy? What clauses should it contain? Is it all or nothing, or do you opt-in by feature? Your feedback may well alter the very future of Backupify. Seriously.

{ Comments }

Back to the Future DeLorean Time Machine
Image by AdamL212 via Flickr

What services will Backupify offer two years from now? Hard to say. We’ve got lots of cool ideas, some crazier than others. Here’s a listing of some of the wackier concepts that may or may not have come up during brainstorming sessions. Please feel free to weigh in on your favorites, and don’t forget to relate your requests to Backupify’s core business.

Navel-Gazer Self-Search: A meta-search of all the Backupify services you back up. Find keywords or links regardless of where you posted them. So if you went on a Facebook-Twitter-and-Google-Buzz tirade about the latest Apple product rumor (the iPad 4.0 has a 3D videochat cam, and for only $1699!) you can sort through all your posts to find out exactly where you put that link to the leaked prototype pics.

“The Party’s Over Here Now” Profile Migrator: The Facebook Killer has finally arrived (Orkut? Really?) and you don’t want to drag all the photos and videos and friend lists and birthdays over to the new hotness? No worries; Backupify has a wizard that will let you pick and choose which bio — Twitter, Facebook, LinkedIn, or Google Profile — you like best, and which friends from each service you want to to slide over, along with an easy way to import all the multimedia files you’ve got jammed all over the place. And in 18 months, when the next new social network fad pops up, you can do it all over again, only easier.

The “Eternal Sunshine” Machine: Got a relationship (or a weekend) you’d like to erase from your online memory? A few keyword and profile selections, and we’ll delete every connection to your ex that’s out in public, while retaining the unpurged records in our archives (just in case he apologizes and you decide to patch it up).

Twitter Court Reporter: Give us a list of Twitter usernames and a general time window, and we’ll narrow down your tweetstream to just the conversations you had with these individuals, then output the whole confab as a searchable PDF.

E-Marketer Goggles: We’ll run a general keyword analysis over your collective social profiles and determine what products and services that marketers — who buy data from Twitter, Facebook and Google for exactly these purposes — think you’re apt to buy. This may explain some of those bizarre AdSense ads and spam messages you keep getting. Think of it as a next-generation tag cloud of your online footprint, only more useful (and creepy).

If any of these sound cool to you, or if you’ve got an even better idea, feel free to flame away about it in the comments section of this post.

{ Comments }