martes, 15 de marzo de 2016

Let’s open more datasets, because what could go wrong?

Note: This article is a translation of what I wrote in Spanish for the International Open Data Conference 2016 blog. You can read the original post in spanish: Abramos más conjuntos de datos, ¿qué puede salir mal? and in english: "Let’s open more datasets. What could go wrong?".
In conversations between members of the open data community, especially those responsible of providing data, one often overhears statements such as “it’s necessary to stimulate the demand for open data,” “we can’t reach the reusers,” “it would be interesting if data providers and reusers talked more.” I am sure that you have heard such statements in many occasions.
Most probably, this uneasiness is not unknown to the IODC organizers, whom need to be aware that previous editions of the event have mostly been focused on what is usually called the “supply side,” this is the public organizations in charge of the custody and providing groups of open data. What is true is that in Spain, possibly due to the fact that it is the Ministry of Industry the one that promotes open data policies, it has always been encouraged that reuse companies are very present in events about open data. And this will surely be noticed in the program of the 4th IODC next October.

However, I would like to tell you a secret that could help understand why, apparently, there is no such long-awaited open data demand: it turns out that for reuse companies, it is often more productive to obtain data from the web than using open data portals. Unfortunately, technologies for data extraction from documents have advanced in recent years much faster than the existent datasets in portals.

Even though it is quite inefficient and we may not like it, currently it is the only possible way in many sectors for companies to generate data value. In other sectors, when there is no published data, neither in documents nor in datasets, there is no demand to stimulate. Companies, especially small companies, survive on the value that they can create and sell today, not on future promises.

If you were a company, where would you put resources? On an open source library to improve a data-extraction algorithm for PDFs or taking part in circular arguments about the best way of opening data?
In my opinion, as I am on the “demand side,” I would like IODC 2016 to be a turning point, not as much as to define more standards, more indexes and policies and laws, but to obtain a publication agreement of more useful datasets.

If we actually aim to encourage innovation and creation of value from open data, I suggest we flood portals with useful datasets. What could go wrong? Actually, much of these data are already inside published documents on the web, and much effort is being put on extracting and cleaning them when it could rather be put on creating data value.