sábado, 20 de agosto de 2016

Some very simple practices to help with the reuse of open datasets

Note: This article is a translation of what I wrote in Spanish for the International Open Data Conference 2016 blog. You can read the original post in spanish: Algunas prácticas muy sencillas que facilitan la reutilización de conjuntos de datos abiertos and in english: "Some very simple practices to help with the reuse of open datasets".
In the past few years, an important number of materials have been published to help data holders with that release of open government data. In his article “The good, the bad…and the best practices”, Martin Alvarez gathered a total of 19 documents, including guides, manuals, good practices, toolkits, cookbooks, etc. Different kinds of authors have invested resources in the development of these materials: national governments, regional governments, foundations, standardization organisms, the European Commission, etc.; hence the number of different perspectives.
On the other hand, a large amount of effort is still being made in the development of standards for the release of open government datasets, either for general purposes or specific domains.
Photo by: Barn Images
Too often, however, very easy rules that facilitate sustainable open datasets reuse are forgotten when datasets are published. I am just mentioning some of the obstacles we often find when we explore a new dataset and assess whether it is worth incorporating it to our service:
  1. Records do not include a field with a unique identifier, which makes it very difficult to monitor changes when the dataset is updated.
  2. Records do not contain a field with the date when it was last updated, which also complicates monitoring which records have changed from one publication version to the next one.
  3. Records do not contain a field with the date of creation, which makes it difficult to know the date each one were incorporated to the dataset.
  4. Fields do not use commonly agreed standards for the type of data they contain. This often occurs in fields with dates and times, or economic values, etc…but is also common in other fields.
  5. Inconsistencies between the content of the dataset and its equivalent published on HTML web pages. Inconsistencies can be of many types, from records published on the website and not exported to the dataset to differences in fields that are published in one format or the other.
  6. The record is published on the dataset much later than on the website. This can make a dataset useless for reuse if the service requires immediacy.
  7. Service Level Agreements on the publication of datasets are not specified overtly. It is not that important to merely judge those agreements as good or bad; what is really important is that they are known, as it is very hard to plan data reuse ahead when you do not know what to expect.
  8. These elements are not provided: a simple description about the content of the fields and structure of the dataset, as well as the relevant criteria used to analyze that content (lists of elements for factor variables, update criteria, meaning of different states, etc.).
As you can see, these practices are not necessarily linked to open-data-related work; they rather deal with the experience in software development projects, or simply with common sense.

Even though most of them are very easy to implement, they are of great importance to convince somebody to invest their time in an open dataset. As you may know, dealing with web scrapping can be more convenient than reusing open datasets; And these are a few simple practices that make the difference.

sábado, 6 de agosto de 2016

How far should a public administration go with regard to the provision of value—added services based on open data?

Note: This article is a translation of what I wrote in Spanish for the International Open Data Conference 2016 blog. You can read the original post in spanish: ¿Hasta dónde debe llegar la administración en la prestación de servicios sobre datos abiertos? and in english: How far should a public administration go with regard to the provision of value—added services based on open data?
Last Monday I took part in the panel “Reuse of EU open data: challenges and opportunities” during the Reuse of EU legal data Workshop, organized by the Publications Office of the European Union. One of the interesting issues that came up during the panel (you can watch it here) focused on the well-known question: How far should a public administration go with regard to the provision of services based on open government datasets?

The discussion in the context of fostering open government data reuse, arise from the difficulties of finding a balance between the services that every public administration must provide to citizens for free and the space that should be left for private initiatives to create value from open government datasets. In many cases, that unstable balance creates certain tensions that do not contribute to innovation.

In the past few years, I have heard numerous arguments both from the supply and the demand side. These arguments show positions from one end: “public administrations should only provide raw data and no services;” to the opposite: “the public administration should go forward in the value chain as much as possible when providing services to citizens.”

My position in this matter, which I had the chance to defend during the debate, is that it is not useful to work in drawing a red line between what should be considered a basic/core service and a premium/value-added service. Quite on the contrary, we should work on the definition of the minimum incentives that should be designed for opendata-driven innovation to flourish and deliver wealth creation.
Photo by: Rodion Kutsaev

For that reason, I used the panel to make the following statement, which could be a starting point to clearly define the minimum conditions that a reuser needs to create value added services:

“open government datasets should be released in such condition that a reuser can build the same services that are provided for free by the data holder.”

This is basically because, in many cases, value creation starts from that baseline; this is, from just improving a service that already exists. If an existing public service cannot be reproduced, for example due to a few hours delay in the release of the underlying dataset or because of the limited quality of the released data, then it will not be possible to innovate by launching and improved product or service to the market.

In my opinion, this approach to the issue can help us make some progress in this debate. I hope this first statement can be improved and better worded by contributions from the community, or otherwise proved wrong by evidence or better arguments than my own.

martes, 15 de marzo de 2016

Let’s open more datasets, because what could go wrong?

Note: This article is a translation of what I wrote in Spanish for the International Open Data Conference 2016 blog. You can read the original post in spanish: Abramos más conjuntos de datos, ¿qué puede salir mal? and in english: "Let’s open more datasets. What could go wrong?".
In conversations between members of the open data community, especially those responsible of providing data, one often overhears statements such as “it’s necessary to stimulate the demand for open data,” “we can’t reach the reusers,” “it would be interesting if data providers and reusers talked more.” I am sure that you have heard such statements in many occasions.
Most probably, this uneasiness is not unknown to the IODC organizers, whom need to be aware that previous editions of the event have mostly been focused on what is usually called the “supply side,” this is the public organizations in charge of the custody and providing groups of open data. What is true is that in Spain, possibly due to the fact that it is the Ministry of Industry the one that promotes open data policies, it has always been encouraged that reuse companies are very present in events about open data. And this will surely be noticed in the program of the 4th IODC next October.

However, I would like to tell you a secret that could help understand why, apparently, there is no such long-awaited open data demand: it turns out that for reuse companies, it is often more productive to obtain data from the web than using open data portals. Unfortunately, technologies for data extraction from documents have advanced in recent years much faster than the existent datasets in portals.

Even though it is quite inefficient and we may not like it, currently it is the only possible way in many sectors for companies to generate data value. In other sectors, when there is no published data, neither in documents nor in datasets, there is no demand to stimulate. Companies, especially small companies, survive on the value that they can create and sell today, not on future promises.

If you were a company, where would you put resources? On an open source library to improve a data-extraction algorithm for PDFs or taking part in circular arguments about the best way of opening data?
In my opinion, as I am on the “demand side,” I would like IODC 2016 to be a turning point, not as much as to define more standards, more indexes and policies and laws, but to obtain a publication agreement of more useful datasets.

If we actually aim to encourage innovation and creation of value from open data, I suggest we flood portals with useful datasets. What could go wrong? Actually, much of these data are already inside published documents on the web, and much effort is being put on extracting and cleaning them when it could rather be put on creating data value.

miércoles, 15 de febrero de 2012

The important thing about the EU Open Data License is not which License will be selected.

Note: This article is a translation of what I wrote in Spanish for my personal blog. You can see the original post in Spanish: "Lo importante de la Licencia Open Data europea es que exista, no cuál será la licencia elegida".

I have often written about how the open data community in the UK is working to become a global leading force. The Prime Minister, David Cameron, is leading an ambitious Open Data agenda aimed to boost digital economy in Britain. I have also been very critical with spanish political leadership on open government issues and with the lack of ambition of open data initiatives already launched in Spain.

However something seems to be changing. Today, I have to say that I am very proud of the Spanish Open Data Community because of its leadership and support to the EU Open Data license campaign. As you know, the European Council is now carrying on the negotiations for the revision of the RISP Directive, and a few days ago, Andres Nin in his blog, launched a campaign to request a single licensing model for open data in the European Union (#1OdataLicenseEU). To date, over 330 supporters signed for the campaign, some as relevant as Patxi Lopez, president of the Basque Government. And surely many more will join in the coming days.

As you know, I am supporting the campaign because I believe that a single EU license is very important for the development of Open Data companies such as Euroalert.

However, during these past days, when I have been following and supporting the campaign, some relevant people and organizations of the European open data community told me why they are not actively supporting it. Main reasons regard to discussions about which would be the selected license or if it would be better to include an Open Data Definition rather than just a license.

In my humble opinion, at this point, it is not important to agree on which is the most appropriate license as there are a number of licenses that would fit perfectly.
"What is truly important is that we could have a single Open Data license for all  European Union countries to strengthen the single market"
And I am very concerned that this discussion may be reducing the strength of the campaign. It would be really sad that interests on the selection of the license would make us miss this opportunity. So, let's support the inclusion of a single Open Data license in the RISP Directive and then let's work so the license can be as simple as the one proposed by Alberto Ortiz on his blog. I wish it could be that easy.

viernes, 3 de febrero de 2012

A single Open Data licence is very important for companies

Note: This article is a translation of what I wrote in Spanish for my personal blog. You can see the original post in Spanish: "Una licencia open data única es muy importante para las empresas".

As you all know in Euroalert we are working on the exciting challenge of building a pan-European platform to aggregate tendering and procurement data from all public authorities in the European Union Member States. A few months ago, following my presentation at the First Digital Agenda Assembly, I wrote about the importance of a single open data licence for the development of a pan-european data market.

The European Commission is currently reviewing the Directive on the Reuse of Public Sector Information, and a new draft was published in December. There is therefore a great opportunity to establish a single licensing model for open data in the European Union. However the ideal would be a the single open data global licence.

I will share a true example to elaborate on the subject. Euroalert aggregates data from many diverse sources with the most heterogeneous licenses, inspired by the laws of different countries and not always compatible between each other. Sometimes we've been asked, especially from NGOs, to release aggregated databases of procurement data for studies or other projects. Although we would have been glad to donate these datasets, we could not do it because of the restrictions of the licenses. As you know the licences of some datasets often forbid mixing its data with other databases, others set limitations to the commercial re-use or in some cases even any treatment other than the publication as we get it is limited.

Just studying the legal implications of the redistribution of our aggregated raw databases is something that we could not afford. Our project that will publish a Linked Open Data node for procurement data is facing a similar problem that could be easily solved with a single EU license.

Andres Nin yesterday launched a petition "Say to @neeliekroesEU we want a single #opendata licence in the #EU" to raise awareness of the matter. This is truly a key issue in the development of companies that aim to create wealth through pan-European initiatives for the reuse of public data. And one more opportunity to help the development of a European single market in which companies powered by open data, like Euroalert, operate. I encourage you to sign the petition to the European Commissioner Neelie Kroes and to help us in its promotion in order to make the voice of the Open Data Community heard in the European Institutions.

martes, 5 de julio de 2011

About the Digital Agenda Assembly and Open Data licenses

Note: This article is a translation of what I wrote in Spanish for my personal blog. You can see the original post in Spanish: "La Asamblea de la Agenda Digital Europea y las licencias Open Data".

On June 16th I was lucky to be one of the 1300 participants in Brussels at the 1st Digital Agenda Assembly, the first high-level event organized by the European Commission about the Digital Agenda for Europe. The main objective was to evaluate the development of objectives and actions of the Europe's Digital Agenda, which as you know is one of the 7 strategic actions of the Europe 2020 strategy, aimed to put Europe on the path of smart, sustainable and inclusive growth. That's nothing.
by Jose M. Alonso

I was also honoured to be a speaker at the seminar held in the plenary room: "Open data and re-use of public sector information", where I talked about Euroalert and our experience building a business powered by open data that provides information services about public procurement for European PYMEs. At Euroalert's blog you can find more information about the presentation, that was live streamed, and the pictures taken Jose M. Alonso.

The two days were pretty intense and lots of discussions took place, many as a follow-up to the May Share PSI seminar, either in person or via twitter (see hashtag #daa11psi and stats). You can find a great summary at the Open Data@CTIC blog. I'm going to focus on two important details and one announcement that I'd like to share with you.

The first one is the speech about the State of the Digital Union, as vice president Neelie Kroes called her speech at the first plenary session. I recommend you to have a look at it, because in these times of illiterate politicians when it comes to technology, this remarkable woman is an inspiration. I never imagined I would recommend here the speech of a politician. I hope she will achieve all these ambitious goals.

The second one is a tweet from Michele Barbera, quoting Federico Morando, which did not have a big impact, though it represents an important topic I've been discussing with members of the open data community and which I find extremely important for the development a pan-european market for data re-use.

Many of us believe that if the future revision of the PSI directive endorse a simple license, applicable by default to datasets released by governments, a critic roadblock would be removed, especially for companies that operate with pan-European vision. From the point of view of a company like Euroalert, that creates value from data aggregated from multiple sources and countries, a unique EU license would contribute with legal certainty to operations in the single digital market

It seems to me that the pursuit of interoperability for the growing number of data licenses is becoming a grail that threatens to appear as one of the greatest barriers to data re-use.The idea implemented in the draft of the Spanish PSI Royal Decree, which includes as an Annex a very simple license to be applicable by default, in my opinion would be ideal to be copied into the new directive. Maybe with a EU logo or seal recognizable to all operators... but I am not qualified to judge which one is the best license to be included and endorsed by the directive.

Moreover, thanks to the long networking sessions (great success) of the DAA,  I was finally able to spend some time with Chris Taggart figuring out how Euroalert and Open Corporates  can exchange data and information for the benefit of our users and the Open Data community at large. Soon we will be releasing more details of what we hope will be a small contribution to the European single market.

The truth is that I came back very happy to belong to such an active and motivated community which is luring more and more members everyday and that step by step is becoming mainstream. Too bad that in this June full of events I have missed the great OKCon2011 in Berlín. You can not be everywhere.

jueves, 19 de mayo de 2011

Adobe and the PDF role in Open Data context

Note: This article is a translation of what I wrote in Spanish for my personal blog. You can see the original post in Spanish: "El papel de Adobe y los PDF dentro del panorama Open Data".

Source: Wikipedia
Last week in Brussels a lot what was discussed about Open Data in the Share PSI Workshop, but I also had a number of interesting conversations in the time between sessions. As I said, one issue that caught my attention was the attendance of large technology companies at an event about Open Data. Some like IBM and Orange made ​​their presentations based on the position papers they had sent, and others like Adobe were attending with relevant representatives. So far I had only identified the work done by Microsoft to enter the scene with its OGDI initiative, aimed at positioning Azure as an opendata friendly technology and the occasional attendance at relevant Open Data events from for example Telefonica or Google, but with a very low profile activity.

As I wrote, this step forward from large technology companies, which I think has very much to do with the work of ETSI and W3C, will be for good if we manage to focus their interests in the right direction, that is to say Open Data, but well done. And that's what this post is about. One of the topics that myself in my presentation about Euroalert, and others like Chris Taggart or François Bancilhon stressed, was the fact that the release of public sector data in Adobe PDF format is not adequate for reuse. It is clearly the best format for distributing information (reports, documents, presentations, etc.) but, as HTML itself, not to publish datasets or machine readable information.

Mr Marc Straat spoke from the public to tell us how Adobe is working on PDF technology so it can evolve to be a more useful format within the Open Data context. I must admit that I did not know about the potential of PDF as a container for other types of information, and after reading the article My PDF Hammer that Marc talked me about in a very pleasant conversation over lunch, I think I have a clear idea of what he meant.

I find very interesting the idea that a PDF container may associate the usual PDF file with its editable original version, whatever format it comes from: either a Microsoft Word document (.Doc), OpenOffice or LibreOffice (.Odt), or whatever. If Adobe works to promote that all the tools that convert documents to PDF do the job of embedding the source file, and contributes to disseminate and encourage the use of the feature, I think it would be a great step forward. And excellent news if the governments take as common practice the distribution of their reports in PDF along with the original file and datasets within the PDF file as a container.

However, after thoroughly reading the article, the idea of using PDF as a container for open data files, seems to me an even worse idea than in my first thought. I really see no advantage in using a PDF container instead of a simple ZIP file to distribute XML datasets along with XSD schemas and their documentation or manuals of course in PDF.

On the other hand I do see a major drawback. No programming language has native support for processing PDF files, while there are many options (and well known) for dealing with ZIP and of course XML, XSD or plain text. This means that an almost trivial data processing task, for which exist many well known open source tools, could be turned into a problem that will require licenses and very specific knowledge with no additional benefit for developers in exchange.

As a conclusion, I will say that I do not believe that solutions based on PDF as a container for open data should be promoted. Considering existing tools, it is much more practical for re-users to deal with information distributed in ZIP containers. Instead, it seems a great idea to start encouraging the practice of embedding the original files and even XML datasets within PDF reports or documents to facilitate reuse.

By the way, as a Linux user, I keep waiting for a version of Adobe Acrobat Reader for my platform (x86_64). At present I am not able to open most of the files that make use of advanced PDF features such as forms, published by public authorities.

lunes, 16 de mayo de 2011

Road Blocks to a Pan European Market for PSI Reuse, a long summary

Note: This article is a translation with a few add-ons of what I wrote in Spanish for my personal blog. You can see the original post in Spanish: "Obstáculos para el desarrollo de un mercado pan-europeo de reutilización, un largo resumen".

SharePSI #daa1psi
Source: ePSIplatform
On Tuesday 10th and Wednesday 11th I participated in Brussels at the workshop "Removing Road Blocks to a Pan European Market for PSI Reuse" held by the Share PSI initiative. It was superbly organized by W3C y ETSI for the European Commission, and gathered a good number of members of the European open data community: governments, businesses and civil society organizations.

The European Commission will use the output of the SharePSI workshop at the 1st Digital Agenda Assembly event: "Beyond raw data: public sector information, done well". Ultimately the contributions, like those obtained through the public consultation on Open Data held at the end of 2010, will help to make the reform of the PSI directive richer and more effective.

In my opinion, compared to other seminars, the level of the discussion was very high in most of the sessions, though there are a few topics that are recurrent in this type of events (pricing, licensing, return of investment and privacy). Clearly, this shows that we are not being able to resolve the issues satisfactorily. In a few occasions it was also clear that not everyone is at the same level of discussion, but it is entirely normal because the open data community is growing at a rapid pace and many new people are joining the discussion.

I think the seminar was very intense and productive and this was largely due to the excellent work done by the program committee and especially by Margot Dor (ETSI) and Thomas Roessler (W3C) to create the workshop programme from the large number of position papers sent from all over Europe.

As you know I usually attend this type of events, and this time one of the things that caught my attention was, the presence of representatives of large companies in the discussion. It was rare until now that Adobe, IBM or Orange were interested in the Open Data movement. And I strongly believe that this is a good thing, because their software and their position in the IT services in governments can provide solutions that will drive the development of a more effective Open Data.

I guess that their presence has much to do with W3C and ETSI. I hope they are here to stay and contribute much to the debate and the solutions, though for now they are still far from the more advanced group. However, I also believe that it is the responsibility of those of us who have been long time in the debate, to bring them to the vision of what is the main objective of the Big Idea, Open Data, but well done.

I will also highlight the number of national government representatives that I could identify in the room (at least from Spain, Denmark, the Netherlands and Finland). And New Zealand representation in the person of Laurence Millar, who described us the situation in his country, which is enviable in many respects, such as the very active community of developers they have.

I found very interesting the discussion on the pricing of the meteorological datasets and the apparent long-running dispute which has been brought now to the open data ring by the Association of Private Meteorological Services (PRIMET). I think it's for good that this happened and that these discussions come to enrich the open data debate. There were also several new use cases like the very interesting FearSquare, presented by Andrew Garbett or the impressive Arcticweb that Erin Lynch showed us, that called my attention

On the other hand, it was a pleasure to hear entrepreneurs like François Bancilhon speaking about his work at Data Publica or like Chris Taggart on his excellent Open Corporates, which I have been following for a while. The risks that people like them are taking contribute greatly to push the boundaries of what can be done, although for sure they may have to face problems, because they are disrupting the established situation. My most sincere admiration, respect and support to go ahead.

On my side, I presented the work that Euroalert is doing to develop our 10ders Information Services platform, which aggregates data on procurement notices across the EU. You can find the slides and the summary of the intervention at Euroalert Blog. I also was the moderator of the second half of the session on Use Cases, where we heard the complains of the Federation of European Publishers about the difficulties they face in competing with they still call the culture of free. I was surprised by their approach in the context of Open Data, which I believe is completely misleading again. I hope they will take a more positive position in the future. I was also lucky to have one of the best quotes of the event, made ​​by Hervé Rannou, from ITEMS International, who presented the lessons learned in the Open Data project of the City of Marseille: "The use of the data is infrastructure, like roads"

On June 16th we will see at the 1st Europe Digital Agenda Assembly the most interesting outcomes and conclusions that the European Commission has harvested from this Workshop. I hope it will be useful to take firm steps forward to enable a more favourable environment for market growth based on the development of new information services. In short, for companies powered by opendata as I like to call them. I also hope that among all of them Euroalert will be a remarkable Open Data company, both because of the success of its value proposition and for our contribution to the development of this environment.

To finish this long post, though the occasion deserved it, I will leave some resources that you will find useful to dive into what was said in the workshop. I have used them review what was said in the last two sessions, which I could not attend. I highly recommend to read the excellent work done in collaborative note-taking which reflects faithfully the discussions. You can also check out the tweet archive created by the University of Lincoln, the slides used by speakers, the list of twitter accounts of attendees, the position papers submitted or the snaps of event.