Open Data needs open standards and open research: an academic and
standardization point of view.
speech from Fabien L. Gandon at the Open World Forum.
see video from the Open World Forum 2011: Open Data: the big picture, panel discussion
see video from the Open World Forum 2011: Open Data: the big picture, panel discussion
Open data needs open formats like XML to be stored and exchanged and in that sense having a neutral
standardization body like W3C is vital to design and publish standard formats.
Now in parallel the scale of the datasets opened on the web, their variety in
content, lifecycles and usages call for research to develop efficient means of
communication, parsing, storage, access, transformation, security,
internationalization, and so on.
Open data also needs open data structures for instance the RDF standard is inherently open: not only is it a nonproprietary
model with nonproprietary syntax, it is also designed to make datasets
extensible and reusable. By design the RDF model ensures anyone can say
anything about anything; there is no way in the model to prevent that. As soon as you name something I can reuse
that name and start to attach my data to it.
Now from a research
perspective this creates very complex challenges for instance when calculating
on those data we are in an open world assumption, I can’t be sure I am not
missing an important piece of data somewhere. What kind of processing can I do in
those circumstances, how can I efficiently crawl, index, link and ultimately
find my way through this giant global graph of open data?
Open data also needs open protocols to be accessible to everyone from everywhere. But maybe not quite. Open
data is sometimes reduced to data with public read access but things tend to be
more complex in reality.
We may need more than read
access we may want the C.R.U.D. operations, C.R.U.D. standing for Create new
open data, Read open data, Update open data, Delete open data for instance to
implement the right to oblivion. SPARQL 1.1 is a standardization effort in that
direction. But then many hard problems remain open for instance temporality in
these accesses to data: versioning, revisions, all kinds of changes and the
chain reactions they trigger in a linked open data world.
Yet open access also needs to
be secure and in particular to have open data you might need precise means to
define what is open and what is closed. If I have data I might want to open
some of it only, and I should be encouraged to open that part that can be
opened. This raises the questions of fine-grain access control and licenses for
data. As paradoxical as it may seems the absence of a license may eventually
restrict the use of data by making it difficult to identify actually opened
data. And then if I get some data I might want to make sure I can use it and
this raises the questions of provenance, traceability and authenticity. In that
context, many complex questions remain open like what happens when I mash up
data with different provenances and licenses? What should be the provenance and
license of the results of the inferences, aggregations, statistics, I did on
these data?
Open data also needs open schemas to capture their meaning, ensure their interoperability and foster
automated use and reuse. From the standardization point of view, RDFS or OWL
languages are contributions in that direction allowing us to publish our
schemas. The issue of fostering the emergence and stabilization of standard
schemas in domains needing them remains complex. But on top of it the open
nature of the schemas create new challenges like scaling the processing of
these schemas to large datasets and allowing for approximation, incomplete data
and incoherent data that we are bound to find in an open world.
From a standardization point
of view it is clear that we need neutral places where to build open standard
supporting open data including: open architectures, open formats, open languages,
open protocols, open methodologies, and so on.
We also need subsequent standardization efforts in each application domain
in particular to release compatible datasets and schemas.
Now I’d like to conclude with
two last points from the academic perspective on open data.
First, with the web and beyond
computer science, many academic disciplines face new research and education
challenges and the open data initiative in itself uncovers several of them from
legal issues to be solved to new economic models to be invented, from
sociological approaches of the open data lifecycles to biological models that
can inspire new data structures and algorithms.
And finally, there is a
reciprocal perspective to be taken from the academic point of view since science
and education produce and consume a fair amount of data themselves. Academia is
an application domain itself of open data. For instance there is a need for more
open science data initiatives, opening observations and results of scientific
activities for other scientists to analyze and reuse, making academic and
research material one-click away from being re-useful.
In other words, vice-versa, an open academic world needs open data.