Here’s a thought experiment: try to imagine what it would have been like to create Google before the era of the Internet and open standards. You would probably have had to pay millions of dollars to create the necessary software on a proprietary operating system. The effort would have required a huge team of people taking many years. Since Google is a search engine, it most likely would have been given to the phone company to design and run. If you were using X.25, the international networking standard (the Internet equivalent of its time), you would have been charged for each packet of information that you sent or received, in a network in which each network operator had a bilateral agreement with every other network operator. This total project probably would have taken a decade, cost a billion dollars, and not have worked very well.
In fact, the actual cost of building and launching the first Google server was probably only thousands of dollars using standard PC components, mostly open-source software as the base, and connecting to the Stanford University network, which immediately made the service available, at no additional cost, to everyone else on the Internet.
Moving from a Web of documents to a Web of data (or of Linked Open Data) is an oft-cited goal in the sciences. The Web of data would allow us to link together disparate information from unrelated disciplines, run powerful queries, and get precise answers to complex, data-driven questions. It’s an undoubtedly desirable extension of the way that the existing networks increase the value of documents and computers through connectivity - Metcalfe’s Law applied to more complex information and systems.
However, making the Web of data turns out to be a deeply complex endeavor. Data - here, a catchall word covering databases and datasets and generally meaning here information that is gathered in the sciences as a result of either experimental work or environmental observation - require a much more robust and complete set of standards to achieve the same “web” capabilities we take for granted in commerce and culture.
Unlike documents, the ultimate intended reader of most data is a machine. Some classic examples include search engines, analytic software, database back ends, and more. There is simply too much data in production to place people on the front lines of analysis. When data scales easily into the petabytes, we just can’t keep up using the existing systems.
There are three interlocking dimensions to interoperability in data: legal, technical, and semantic. By legal, we mean the contractual and intellectual property rights associated with the data; by technical, the standard systems (especially the computer languages) in which the data is published; and by semantic, the actual meaning of the data itself - what it describes, and how it relates to the broader world.
Each of these dimensions is complex on its own. Taken together, the three represent unsolvable complexity. The semantic layer alone requires an almost miraculous level of agreement on “what things mean,” and anyone who has witnessed argument among scientists, be they economists of physicists, knows that even apparently simple topics turn contentious over matters as basic as definitions. Consensus on the technical layer is somewhat easier - the existence of the Web and the Semantic Web “stack” of standard technologies has begun to take a leadership position in data networking - but still difficult, long, and open to argument. One of the only opportunities we have is in the legal layer, where we can look to a broad set of successes in legal interoperability through the use of a simple, flat standard: the public domain.
The public domain is a very simple concept - no rights are reserved to owners, and all rights are granted to users. The public domain exists as a counterweight to copyright in the creative space, but in some countries - especially the United States - as a first option for data that is not considered “creative.”
The public domain option currently underpins a wide variety of linked data that is already well on its way to achieving Web scale. From the International Virtual Observatory, whose members build an international data net on norms of “acknowledgment” rather than contracts of “attribution”, to the world of genomics, where entire genomes and related data are harmonized nightly across multiple countries, the public domain creates complete interoperability at the legal layer of the data network, and serves as a foundation for the next layer of technical interoperability.
Interestingly we have yet to observe similar network effects emerging in cases where the underlying data is treated in a more conservative “intellectual property” context by using copyright licenses or database licenses inspired by copyright. Indeed, in the case of the international consortium mapping human genomic variation, the implementation of a “click through” license was found in practice to impede integration of that mapped variation with other public domain data, limiting the value of the map. The license was removed, thepublic domain option instated, and the database was immediately technically integrated with the rest of the international web of gene data.
We have seen the public domain option work, again and again, across the scientific disciplines. Implementing the public domain as the interoperability standard for the legal dimension of the web of data holds the greatest promise for scalability and long-term achievement of the network effect for data, as it permits the widest range of experimentation and development at the technical and semantic layers.