miércoles, 31 de agosto de 2016

What if we could calculate our own real-time customized “official indicators”

Note: This article is a translation of what I wrote in Spanish for the International Open Data Conference 2016 blog. You can read the original post in spanish: ¿Y si todos pudiésemos calcular nuestros propios “indicadores oficiales” personalizados en tiempo real and in english: What if we could calculate our own real-time customized “official indicators”?.
Almost all governments worldwide and multilateral institutions such as OCED, the UN, the World Bank or the European Commission began their open data policies with the release of the statistic datasets they produce. Because of that, we have a big amount of indicators we can work with in reasonably accessible formats to study almost any issue, either environmental, social, economical or a combination of all these aspects.

Besides providing us with the datasets, in some cases they have created tools to access the data easily (APIs), and even applications that help us work with the indicators (visualizations).
These indicators follow periodic cycles which can happen monthly, yearly or even multiannual perdiods due to the high cost of their production. In general, the methodologies used to calculate the indicators are not available for citizens. In the best-case scenario, they are documented in a very superficial way on their fact sheet.
Photo by William Iven
Now let’s imagine for a moment that the national social security systems, the company registers, the customs registers, the environmental agencies, etc, release the data they hold as open datasets in real time. One of the effects we could easily imagine is that a lot of indicators that these days are being released periodically could be known and, even better, explored in real time.

Besides, this would remove the possibility for anyone to get privileged information, considering that we all could have the same ability to analyze the evolution of the indicators to take our own decisions. Even more, we could customize calculations according to our own particular situation by working on the methodologies.
The fact is that in many cases, the production cycle of some indicators could be shortened until we get closer to ‘real time’, and the cost of production could be reduced greatly as well thanks to open government data.

Even though this is a big step ahead, I don’t think we should settle down with having the indicators as open data; we should aspire to examine the open datasets and methodologies used to calculate these indicators and even customize them, because if conveniently anonymized there is no reason for them not to be released as open data.

sábado, 20 de agosto de 2016

Some very simple practices to help with the reuse of open datasets

Note: This article is a translation of what I wrote in Spanish for the International Open Data Conference 2016 blog. You can read the original post in spanish: Algunas prácticas muy sencillas que facilitan la reutilización de conjuntos de datos abiertos and in english: "Some very simple practices to help with the reuse of open datasets".
In the past few years, an important number of materials have been published to help data holders with that release of open government data. In his article “The good, the bad…and the best practices”, Martin Alvarez gathered a total of 19 documents, including guides, manuals, good practices, toolkits, cookbooks, etc. Different kinds of authors have invested resources in the development of these materials: national governments, regional governments, foundations, standardization organisms, the European Commission, etc.; hence the number of different perspectives.
On the other hand, a large amount of effort is still being made in the development of standards for the release of open government datasets, either for general purposes or specific domains.
Photo by: Barn Images
Too often, however, very easy rules that facilitate sustainable open datasets reuse are forgotten when datasets are published. I am just mentioning some of the obstacles we often find when we explore a new dataset and assess whether it is worth incorporating it to our service:
  1. Records do not include a field with a unique identifier, which makes it very difficult to monitor changes when the dataset is updated.
  2. Records do not contain a field with the date when it was last updated, which also complicates monitoring which records have changed from one publication version to the next one.
  3. Records do not contain a field with the date of creation, which makes it difficult to know the date each one were incorporated to the dataset.
  4. Fields do not use commonly agreed standards for the type of data they contain. This often occurs in fields with dates and times, or economic values, etc…but is also common in other fields.
  5. Inconsistencies between the content of the dataset and its equivalent published on HTML web pages. Inconsistencies can be of many types, from records published on the website and not exported to the dataset to differences in fields that are published in one format or the other.
  6. The record is published on the dataset much later than on the website. This can make a dataset useless for reuse if the service requires immediacy.
  7. Service Level Agreements on the publication of datasets are not specified overtly. It is not that important to merely judge those agreements as good or bad; what is really important is that they are known, as it is very hard to plan data reuse ahead when you do not know what to expect.
  8. These elements are not provided: a simple description about the content of the fields and structure of the dataset, as well as the relevant criteria used to analyze that content (lists of elements for factor variables, update criteria, meaning of different states, etc.).
As you can see, these practices are not necessarily linked to open-data-related work; they rather deal with the experience in software development projects, or simply with common sense.

Even though most of them are very easy to implement, they are of great importance to convince somebody to invest their time in an open dataset. As you may know, dealing with web scrapping can be more convenient than reusing open datasets; And these are a few simple practices that make the difference.

sábado, 6 de agosto de 2016

How far should a public administration go with regard to the provision of value—added services based on open data?

Note: This article is a translation of what I wrote in Spanish for the International Open Data Conference 2016 blog. You can read the original post in spanish: ¿Hasta dónde debe llegar la administración en la prestación de servicios sobre datos abiertos? and in english: How far should a public administration go with regard to the provision of value—added services based on open data?
Last Monday I took part in the panel “Reuse of EU open data: challenges and opportunities” during the Reuse of EU legal data Workshop, organized by the Publications Office of the European Union. One of the interesting issues that came up during the panel (you can watch it here) focused on the well-known question: How far should a public administration go with regard to the provision of services based on open government datasets?

The discussion in the context of fostering open government data reuse, arise from the difficulties of finding a balance between the services that every public administration must provide to citizens for free and the space that should be left for private initiatives to create value from open government datasets. In many cases, that unstable balance creates certain tensions that do not contribute to innovation.

In the past few years, I have heard numerous arguments both from the supply and the demand side. These arguments show positions from one end: “public administrations should only provide raw data and no services;” to the opposite: “the public administration should go forward in the value chain as much as possible when providing services to citizens.”

My position in this matter, which I had the chance to defend during the debate, is that it is not useful to work in drawing a red line between what should be considered a basic/core service and a premium/value-added service. Quite on the contrary, we should work on the definition of the minimum incentives that should be designed for opendata-driven innovation to flourish and deliver wealth creation.
Photo by: Rodion Kutsaev

For that reason, I used the panel to make the following statement, which could be a starting point to clearly define the minimum conditions that a reuser needs to create value added services:

“open government datasets should be released in such condition that a reuser can build the same services that are provided for free by the data holder.”

This is basically because, in many cases, value creation starts from that baseline; this is, from just improving a service that already exists. If an existing public service cannot be reproduced, for example due to a few hours delay in the release of the underlying dataset or because of the limited quality of the released data, then it will not be possible to innovate by launching and improved product or service to the market.

In my opinion, this approach to the issue can help us make some progress in this debate. I hope this first statement can be improved and better worded by contributions from the community, or otherwise proved wrong by evidence or better arguments than my own.