vendredi 23 septembre 2011

Open Data needs open standards and open research: an academic and standardization point of view at the Open World Forum.


Open Data needs open standards and open research: an academic and standardization point of view.
speech from Fabien L. Gandon at the Open World Forum.
see video from the Open World Forum 2011: Open Data: the big picture, panel discussion 

Open data needs open formats like XML to be stored and exchanged and in that sense having a neutral standardization body like W3C is vital to design and publish standard formats. Now in parallel the scale of the datasets opened on the web, their variety in content, lifecycles and usages call for research to develop efficient means of communication, parsing, storage, access, transformation, security, internationalization, and so on.

Open data also needs open data structures for instance the RDF standard is inherently open: not only is it a nonproprietary model with nonproprietary syntax, it is also designed to make datasets extensible and reusable. By design the RDF model ensures anyone can say anything about anything; there is no way in the model to prevent that.  As soon as you name something I can reuse that name and start to attach my data to it.

Now from a research perspective this creates very complex challenges for instance when calculating on those data we are in an open world assumption, I can’t be sure I am not missing an important piece of data somewhere. What kind of processing can I do in those circumstances, how can I efficiently crawl, index, link and ultimately find my way through this giant global graph of open data?

Open data also needs open protocols to be accessible to everyone from everywhere. But maybe not quite. Open data is sometimes reduced to data with public read access but things tend to be more complex in reality.

We may need more than read access we may want the C.R.U.D. operations, C.R.U.D. standing for Create new open data, Read open data, Update open data, Delete open data for instance to implement the right to oblivion. SPARQL 1.1 is a standardization effort in that direction. But then many hard problems remain open for instance temporality in these accesses to data: versioning, revisions, all kinds of changes and the chain reactions they trigger in a linked open data world.

Yet open access also needs to be secure and in particular to have open data you might need precise means to define what is open and what is closed. If I have data I might want to open some of it only, and I should be encouraged to open that part that can be opened. This raises the questions of fine-grain access control and licenses for data. As paradoxical as it may seems the absence of a license may eventually restrict the use of data by making it difficult to identify actually opened data. And then if I get some data I might want to make sure I can use it and this raises the questions of provenance, traceability and authenticity. In that context, many complex questions remain open like what happens when I mash up data with different provenances and licenses? What should be the provenance and license of the results of the inferences, aggregations, statistics, I did on these data?

Open data also needs open schemas to capture their meaning, ensure their interoperability and foster automated use and reuse. From the standardization point of view, RDFS or OWL languages are contributions in that direction allowing us to publish our schemas. The issue of fostering the emergence and stabilization of standard schemas in domains needing them remains complex. But on top of it the open nature of the schemas create new challenges like scaling the processing of these schemas to large datasets and allowing for approximation, incomplete data and incoherent data that we are bound to find in an open world.

From a standardization point of view it is clear that we need neutral places where to build open standard supporting open data including: open architectures, open formats, open languages, open protocols, open methodologies, and so on.  We also need subsequent standardization efforts in each application domain in particular to release compatible datasets and schemas.

Now I’d like to conclude with two last points from the academic perspective on open data.

First, with the web and beyond computer science, many academic disciplines face new research and education challenges and the open data initiative in itself uncovers several of them from legal issues to be solved to new economic models to be invented, from sociological approaches of the open data lifecycles to biological models that can inspire new data structures and algorithms.

And finally, there is a reciprocal perspective to be taken from the academic point of view since science and education produce and consume a fair amount of data themselves. Academia is an application domain itself of open data. For instance there is a need for more open science data initiatives, opening observations and results of scientific activities for other scientists to analyze and reuse, making academic and research material one-click away from being re-useful.

In other words, vice-versa, an open academic world needs open data.