miércoles, 31 de agosto de 2016

What if we could calculate our own real-time customized “official indicators”

Note: This article is a translation of what I wrote in Spanish for the International Open Data Conference 2016 blog. You can read the original post in spanish: ¿Y si todos pudiésemos calcular nuestros propios “indicadores oficiales” personalizados en tiempo real and in english: What if we could calculate our own real-time customized “official indicators”?.
Almost all governments worldwide and multilateral institutions such as OCED, the UN, the World Bank or the European Commission began their open data policies with the release of the statistic datasets they produce. Because of that, we have a big amount of indicators we can work with in reasonably accessible formats to study almost any issue, either environmental, social, economical or a combination of all these aspects.

Besides providing us with the datasets, in some cases they have created tools to access the data easily (APIs), and even applications that help us work with the indicators (visualizations).
These indicators follow periodic cycles which can happen monthly, yearly or even multiannual perdiods due to the high cost of their production. In general, the methodologies used to calculate the indicators are not available for citizens. In the best-case scenario, they are documented in a very superficial way on their fact sheet.
Photo by William Iven
Now let’s imagine for a moment that the national social security systems, the company registers, the customs registers, the environmental agencies, etc, release the data they hold as open datasets in real time. One of the effects we could easily imagine is that a lot of indicators that these days are being released periodically could be known and, even better, explored in real time.

Besides, this would remove the possibility for anyone to get privileged information, considering that we all could have the same ability to analyze the evolution of the indicators to take our own decisions. Even more, we could customize calculations according to our own particular situation by working on the methodologies.
The fact is that in many cases, the production cycle of some indicators could be shortened until we get closer to ‘real time’, and the cost of production could be reduced greatly as well thanks to open government data.

Even though this is a big step ahead, I don’t think we should settle down with having the indicators as open data; we should aspire to examine the open datasets and methodologies used to calculate these indicators and even customize them, because if conveniently anonymized there is no reason for them not to be released as open data.

sábado, 20 de agosto de 2016

Some very simple practices to help with the reuse of open datasets

Note: This article is a translation of what I wrote in Spanish for the International Open Data Conference 2016 blog. You can read the original post in spanish: Algunas prácticas muy sencillas que facilitan la reutilización de conjuntos de datos abiertos and in english: "Some very simple practices to help with the reuse of open datasets".
In the past few years, an important number of materials have been published to help data holders with that release of open government data. In his article “The good, the bad…and the best practices”, Martin Alvarez gathered a total of 19 documents, including guides, manuals, good practices, toolkits, cookbooks, etc. Different kinds of authors have invested resources in the development of these materials: national governments, regional governments, foundations, standardization organisms, the European Commission, etc.; hence the number of different perspectives.
On the other hand, a large amount of effort is still being made in the development of standards for the release of open government datasets, either for general purposes or specific domains.
Photo by: Barn Images
Too often, however, very easy rules that facilitate sustainable open datasets reuse are forgotten when datasets are published. I am just mentioning some of the obstacles we often find when we explore a new dataset and assess whether it is worth incorporating it to our service:
  1. Records do not include a field with a unique identifier, which makes it very difficult to monitor changes when the dataset is updated.
  2. Records do not contain a field with the date when it was last updated, which also complicates monitoring which records have changed from one publication version to the next one.
  3. Records do not contain a field with the date of creation, which makes it difficult to know the date each one were incorporated to the dataset.
  4. Fields do not use commonly agreed standards for the type of data they contain. This often occurs in fields with dates and times, or economic values, etc…but is also common in other fields.
  5. Inconsistencies between the content of the dataset and its equivalent published on HTML web pages. Inconsistencies can be of many types, from records published on the website and not exported to the dataset to differences in fields that are published in one format or the other.
  6. The record is published on the dataset much later than on the website. This can make a dataset useless for reuse if the service requires immediacy.
  7. Service Level Agreements on the publication of datasets are not specified overtly. It is not that important to merely judge those agreements as good or bad; what is really important is that they are known, as it is very hard to plan data reuse ahead when you do not know what to expect.
  8. These elements are not provided: a simple description about the content of the fields and structure of the dataset, as well as the relevant criteria used to analyze that content (lists of elements for factor variables, update criteria, meaning of different states, etc.).
As you can see, these practices are not necessarily linked to open-data-related work; they rather deal with the experience in software development projects, or simply with common sense.

Even though most of them are very easy to implement, they are of great importance to convince somebody to invest their time in an open dataset. As you may know, dealing with web scrapping can be more convenient than reusing open datasets; And these are a few simple practices that make the difference.

sábado, 6 de agosto de 2016

How far should a public administration go with regard to the provision of value—added services based on open data?

Note: This article is a translation of what I wrote in Spanish for the International Open Data Conference 2016 blog. You can read the original post in spanish: ¿Hasta dónde debe llegar la administración en la prestación de servicios sobre datos abiertos? and in english: How far should a public administration go with regard to the provision of value—added services based on open data?
Last Monday I took part in the panel “Reuse of EU open data: challenges and opportunities” during the Reuse of EU legal data Workshop, organized by the Publications Office of the European Union. One of the interesting issues that came up during the panel (you can watch it here) focused on the well-known question: How far should a public administration go with regard to the provision of services based on open government datasets?

The discussion in the context of fostering open government data reuse, arise from the difficulties of finding a balance between the services that every public administration must provide to citizens for free and the space that should be left for private initiatives to create value from open government datasets. In many cases, that unstable balance creates certain tensions that do not contribute to innovation.

In the past few years, I have heard numerous arguments both from the supply and the demand side. These arguments show positions from one end: “public administrations should only provide raw data and no services;” to the opposite: “the public administration should go forward in the value chain as much as possible when providing services to citizens.”

My position in this matter, which I had the chance to defend during the debate, is that it is not useful to work in drawing a red line between what should be considered a basic/core service and a premium/value-added service. Quite on the contrary, we should work on the definition of the minimum incentives that should be designed for opendata-driven innovation to flourish and deliver wealth creation.
Photo by: Rodion Kutsaev

For that reason, I used the panel to make the following statement, which could be a starting point to clearly define the minimum conditions that a reuser needs to create value added services:

“open government datasets should be released in such condition that a reuser can build the same services that are provided for free by the data holder.”

This is basically because, in many cases, value creation starts from that baseline; this is, from just improving a service that already exists. If an existing public service cannot be reproduced, for example due to a few hours delay in the release of the underlying dataset or because of the limited quality of the released data, then it will not be possible to innovate by launching and improved product or service to the market.

In my opinion, this approach to the issue can help us make some progress in this debate. I hope this first statement can be improved and better worded by contributions from the community, or otherwise proved wrong by evidence or better arguments than my own.

martes, 15 de marzo de 2016

Let’s open more datasets, because what could go wrong?

Note: This article is a translation of what I wrote in Spanish for the International Open Data Conference 2016 blog. You can read the original post in spanish: Abramos más conjuntos de datos, ¿qué puede salir mal? and in english: "Let’s open more datasets. What could go wrong?".
In conversations between members of the open data community, especially those responsible of providing data, one often overhears statements such as “it’s necessary to stimulate the demand for open data,” “we can’t reach the reusers,” “it would be interesting if data providers and reusers talked more.” I am sure that you have heard such statements in many occasions.
Most probably, this uneasiness is not unknown to the IODC organizers, whom need to be aware that previous editions of the event have mostly been focused on what is usually called the “supply side,” this is the public organizations in charge of the custody and providing groups of open data. What is true is that in Spain, possibly due to the fact that it is the Ministry of Industry the one that promotes open data policies, it has always been encouraged that reuse companies are very present in events about open data. And this will surely be noticed in the program of the 4th IODC next October.

However, I would like to tell you a secret that could help understand why, apparently, there is no such long-awaited open data demand: it turns out that for reuse companies, it is often more productive to obtain data from the web than using open data portals. Unfortunately, technologies for data extraction from documents have advanced in recent years much faster than the existent datasets in portals.

Even though it is quite inefficient and we may not like it, currently it is the only possible way in many sectors for companies to generate data value. In other sectors, when there is no published data, neither in documents nor in datasets, there is no demand to stimulate. Companies, especially small companies, survive on the value that they can create and sell today, not on future promises.

If you were a company, where would you put resources? On an open source library to improve a data-extraction algorithm for PDFs or taking part in circular arguments about the best way of opening data?
In my opinion, as I am on the “demand side,” I would like IODC 2016 to be a turning point, not as much as to define more standards, more indexes and policies and laws, but to obtain a publication agreement of more useful datasets.

If we actually aim to encourage innovation and creation of value from open data, I suggest we flood portals with useful datasets. What could go wrong? Actually, much of these data are already inside published documents on the web, and much effort is being put on extracting and cleaning them when it could rather be put on creating data value.

miércoles, 15 de febrero de 2012

The important thing about the EU Open Data License is not which License will be selected.

Note: This article is a translation of what I wrote in Spanish for my personal blog. You can see the original post in Spanish: "Lo importante de la Licencia Open Data europea es que exista, no cuál será la licencia elegida".

I have often written about how the open data community in the UK is working to become a global leading force. The Prime Minister, David Cameron, is leading an ambitious Open Data agenda aimed to boost digital economy in Britain. I have also been very critical with spanish political leadership on open government issues and with the lack of ambition of open data initiatives already launched in Spain.

However something seems to be changing. Today, I have to say that I am very proud of the Spanish Open Data Community because of its leadership and support to the EU Open Data license campaign. As you know, the European Council is now carrying on the negotiations for the revision of the RISP Directive, and a few days ago, Andres Nin in his blog, launched a campaign to request a single licensing model for open data in the European Union (#1OdataLicenseEU). To date, over 330 supporters signed for the campaign, some as relevant as Patxi Lopez, president of the Basque Government. And surely many more will join in the coming days.

As you know, I am supporting the campaign because I believe that a single EU license is very important for the development of Open Data companies such as Euroalert.

However, during these past days, when I have been following and supporting the campaign, some relevant people and organizations of the European open data community told me why they are not actively supporting it. Main reasons regard to discussions about which would be the selected license or if it would be better to include an Open Data Definition rather than just a license.

In my humble opinion, at this point, it is not important to agree on which is the most appropriate license as there are a number of licenses that would fit perfectly.
"What is truly important is that we could have a single Open Data license for all  European Union countries to strengthen the single market"
And I am very concerned that this discussion may be reducing the strength of the campaign. It would be really sad that interests on the selection of the license would make us miss this opportunity. So, let's support the inclusion of a single Open Data license in the RISP Directive and then let's work so the license can be as simple as the one proposed by Alberto Ortiz on his blog. I wish it could be that easy.

viernes, 3 de febrero de 2012

A single Open Data licence is very important for companies

Note: This article is a translation of what I wrote in Spanish for my personal blog. You can see the original post in Spanish: "Una licencia open data única es muy importante para las empresas".

As you all know in Euroalert we are working on the exciting challenge of building a pan-European platform to aggregate tendering and procurement data from all public authorities in the European Union Member States. A few months ago, following my presentation at the First Digital Agenda Assembly, I wrote about the importance of a single open data licence for the development of a pan-european data market.

The European Commission is currently reviewing the Directive on the Reuse of Public Sector Information, and a new draft was published in December. There is therefore a great opportunity to establish a single licensing model for open data in the European Union. However the ideal would be a the single open data global licence.

I will share a true example to elaborate on the subject. Euroalert aggregates data from many diverse sources with the most heterogeneous licenses, inspired by the laws of different countries and not always compatible between each other. Sometimes we've been asked, especially from NGOs, to release aggregated databases of procurement data for studies or other projects. Although we would have been glad to donate these datasets, we could not do it because of the restrictions of the licenses. As you know the licences of some datasets often forbid mixing its data with other databases, others set limitations to the commercial re-use or in some cases even any treatment other than the publication as we get it is limited.

Just studying the legal implications of the redistribution of our aggregated raw databases is something that we could not afford. Our project that will publish a Linked Open Data node for procurement data is facing a similar problem that could be easily solved with a single EU license.

Andres Nin yesterday launched a petition "Say to @neeliekroesEU we want a single #opendata licence in the #EU" to raise awareness of the matter. This is truly a key issue in the development of companies that aim to create wealth through pan-European initiatives for the reuse of public data. And one more opportunity to help the development of a European single market in which companies powered by open data, like Euroalert, operate. I encourage you to sign the petition to the European Commissioner Neelie Kroes and to help us in its promotion in order to make the voice of the Open Data Community heard in the European Institutions.

martes, 5 de julio de 2011

About the Digital Agenda Assembly and Open Data licenses

Note: This article is a translation of what I wrote in Spanish for my personal blog. You can see the original post in Spanish: "La Asamblea de la Agenda Digital Europea y las licencias Open Data".

On June 16th I was lucky to be one of the 1300 participants in Brussels at the 1st Digital Agenda Assembly, the first high-level event organized by the European Commission about the Digital Agenda for Europe. The main objective was to evaluate the development of objectives and actions of the Europe's Digital Agenda, which as you know is one of the 7 strategic actions of the Europe 2020 strategy, aimed to put Europe on the path of smart, sustainable and inclusive growth. That's nothing.
by Jose M. Alonso

I was also honoured to be a speaker at the seminar held in the plenary room: "Open data and re-use of public sector information", where I talked about Euroalert and our experience building a business powered by open data that provides information services about public procurement for European PYMEs. At Euroalert's blog you can find more information about the presentation, that was live streamed, and the pictures taken Jose M. Alonso.

The two days were pretty intense and lots of discussions took place, many as a follow-up to the May Share PSI seminar, either in person or via twitter (see hashtag #daa11psi and stats). You can find a great summary at the Open Data@CTIC blog. I'm going to focus on two important details and one announcement that I'd like to share with you.

The first one is the speech about the State of the Digital Union, as vice president Neelie Kroes called her speech at the first plenary session. I recommend you to have a look at it, because in these times of illiterate politicians when it comes to technology, this remarkable woman is an inspiration. I never imagined I would recommend here the speech of a politician. I hope she will achieve all these ambitious goals.

The second one is a tweet from Michele Barbera, quoting Federico Morando, which did not have a big impact, though it represents an important topic I've been discussing with members of the open data community and which I find extremely important for the development a pan-european market for data re-use.

Many of us believe that if the future revision of the PSI directive endorse a simple license, applicable by default to datasets released by governments, a critic roadblock would be removed, especially for companies that operate with pan-European vision. From the point of view of a company like Euroalert, that creates value from data aggregated from multiple sources and countries, a unique EU license would contribute with legal certainty to operations in the single digital market

It seems to me that the pursuit of interoperability for the growing number of data licenses is becoming a grail that threatens to appear as one of the greatest barriers to data re-use.The idea implemented in the draft of the Spanish PSI Royal Decree, which includes as an Annex a very simple license to be applicable by default, in my opinion would be ideal to be copied into the new directive. Maybe with a EU logo or seal recognizable to all operators... but I am not qualified to judge which one is the best license to be included and endorsed by the directive.

Moreover, thanks to the long networking sessions (great success) of the DAA,  I was finally able to spend some time with Chris Taggart figuring out how Euroalert and Open Corporates  can exchange data and information for the benefit of our users and the Open Data community at large. Soon we will be releasing more details of what we hope will be a small contribution to the European single market.

The truth is that I came back very happy to belong to such an active and motivated community which is luring more and more members everyday and that step by step is becoming mainstream. Too bad that in this June full of events I have missed the great OKCon2011 in Berlín. You can not be everywhere.

jueves, 19 de mayo de 2011

Adobe and the PDF role in Open Data context

Note: This article is a translation of what I wrote in Spanish for my personal blog. You can see the original post in Spanish: "El papel de Adobe y los PDF dentro del panorama Open Data".

Source: Wikipedia
Last week in Brussels a lot what was discussed about Open Data in the Share PSI Workshop, but I also had a number of interesting conversations in the time between sessions. As I said, one issue that caught my attention was the attendance of large technology companies at an event about Open Data. Some like IBM and Orange made ​​their presentations based on the position papers they had sent, and others like Adobe were attending with relevant representatives. So far I had only identified the work done by Microsoft to enter the scene with its OGDI initiative, aimed at positioning Azure as an opendata friendly technology and the occasional attendance at relevant Open Data events from for example Telefonica or Google, but with a very low profile activity.

As I wrote, this step forward from large technology companies, which I think has very much to do with the work of ETSI and W3C, will be for good if we manage to focus their interests in the right direction, that is to say Open Data, but well done. And that's what this post is about. One of the topics that myself in my presentation about Euroalert, and others like Chris Taggart or François Bancilhon stressed, was the fact that the release of public sector data in Adobe PDF format is not adequate for reuse. It is clearly the best format for distributing information (reports, documents, presentations, etc.) but, as HTML itself, not to publish datasets or machine readable information.

Mr Marc Straat spoke from the public to tell us how Adobe is working on PDF technology so it can evolve to be a more useful format within the Open Data context. I must admit that I did not know about the potential of PDF as a container for other types of information, and after reading the article My PDF Hammer that Marc talked me about in a very pleasant conversation over lunch, I think I have a clear idea of what he meant.

I find very interesting the idea that a PDF container may associate the usual PDF file with its editable original version, whatever format it comes from: either a Microsoft Word document (.Doc), OpenOffice or LibreOffice (.Odt), or whatever. If Adobe works to promote that all the tools that convert documents to PDF do the job of embedding the source file, and contributes to disseminate and encourage the use of the feature, I think it would be a great step forward. And excellent news if the governments take as common practice the distribution of their reports in PDF along with the original file and datasets within the PDF file as a container.

However, after thoroughly reading the article, the idea of using PDF as a container for open data files, seems to me an even worse idea than in my first thought. I really see no advantage in using a PDF container instead of a simple ZIP file to distribute XML datasets along with XSD schemas and their documentation or manuals of course in PDF.

On the other hand I do see a major drawback. No programming language has native support for processing PDF files, while there are many options (and well known) for dealing with ZIP and of course XML, XSD or plain text. This means that an almost trivial data processing task, for which exist many well known open source tools, could be turned into a problem that will require licenses and very specific knowledge with no additional benefit for developers in exchange.

As a conclusion, I will say that I do not believe that solutions based on PDF as a container for open data should be promoted. Considering existing tools, it is much more practical for re-users to deal with information distributed in ZIP containers. Instead, it seems a great idea to start encouraging the practice of embedding the original files and even XML datasets within PDF reports or documents to facilitate reuse.

By the way, as a Linux user, I keep waiting for a version of Adobe Acrobat Reader for my platform (x86_64). At present I am not able to open most of the files that make use of advanced PDF features such as forms, published by public authorities.