America’s Antiterrorism Network – Distributed Data Storage

Distributed data storage is a PROVEN way to protect data from accidental and deliberate destruction. Moreover, distributing data reduces the need for backbone bandwidth and makes meshed WiFi access networks like those I proposed last week more practical both for security purposes and as ONE (there should be more) alternative to the access duopoly which leaves Americans badly underserved and over-charged today.

Yale Law Professor Yochai Benkler writes:

“Imagine a data storage and retrieval system that stores millions of discrete files, in a way that can be accessed, searched and retrieved by millions of users, who can access the system wherever they are connected to the Internet. Imagine that this system is under a multi-pronged attack. Its enemies have used a variety of techniques, ranging from shutting down the main search server under the threat of armed seizure, to inserting malicious files to corrupt the system, to capturing and threatening the operators of storage devices. Imagine that even through all these assaults, the system continues to operate, and continues to provide high quality storage, search, and retrieval functionality to millions of users worldwide. That would be a system worth studying as a model for cybersecurity, would it not?

“That system has in fact been in existence for five years, and it has indeed been under the kinds of attacks described over this entire period. It is the peer-to-peer music file sharing system. It is the epitome of a survivable system…”

You can’t drop a bomb on the data storage vaults of Kazaa because they don’t exist. The data is replicated rather than protected. The music files exist in thousands of redundant copies on the hard drives of cooperating Kazaa users. Even a fiendish online attack would not get all the copies, Yochai points out, because at any given minute many of the computers that host them are offline (mine is in an airplane right now for example). Moreover, individual users back up their computers even if none of us do that as often as we should.

Data replication within a network also serves to make the network faster and reduce the demands on network bandwidth. Some systems for file sharing divide the file between multiple machines, each of which can then serve back pieces of it in parallel. Some systems automatically create new copies of files close to where there has been high demand for the file.

Not incidentally, file migration and other forms of caching are reasons why mesh networks CAN provide much better Internet access than some skeptics think (see TechDirt’s reasonable but skeptical response to my WiFi post. Especially read the comments on TechDirt both pro and con). When data migrates closer and closer to those who are using it, it should often be possible to supply it directly on the mesh network without burdening the local mesh network’s connection to the broader Internet once the initial replication has occurred.

Yochai also addresses privacy and secrecy issues which sound like they would be a problem if data is sleeping around on willing hosts. Part of the answer is to store only part of each file on any one host. Encryption is another part. No need to ever store a key and the data in the same place. No security is absolute but this doesn’t look like an intractable problem.

Much informed by Yochai, here’s what I think will happen (I am responsible for the predictions so don’t blame him):

The growth of cooperative data storage and distribution systems like BitTorrent and Gnutella will continue, largely as a way to distribute both legal and illegal entertainment content. Yahoo! News ran a Reuters story last November which quote the British Web analysis firm CacheLogic as saying that BitTorrent accounts for about 35% of all web traffic! I doubt this number but there’s little doubt about the growth of cooperative file sharing and distribution.
Smart corporations will find a way to cooperate in distributed data storage mainly for disaster recovery services. This trend will be slowed by well-meaning regulators who will prefer data protection methods they understand even if these methods are ultimately less secure than anonymous distribution.
Free data backup from Google and Microsoft and others will include some degree of redundancy and distribution but will suffer from distrust of these companies as repositories, limited redundancy, and the fact that these companies as repositories will have no choice but to respond to subpoenas and will occasionally be hacked, usually by disgruntled employees.
Several free cooperative replication data backup systems for consumers will emerge. Some will be pure co-ops; others will have commercial models. At first only nerds will use them; eventually distributed data storage – like almost universal access (not here yet) – will be taken for granted as an obvious and necessary part of cyberspace.