jueves, 19 de mayo de 2011

Adobe and the PDF role in Open Data context

Note: This article is a translation of what I wrote in Spanish for my personal blog. You can see the original post in Spanish: "El papel de Adobe y los PDF dentro del panorama Open Data".

PDF
Source: Wikipedia
Last week in Brussels a lot what was discussed about Open Data in the Share PSI Workshop, but I also had a number of interesting conversations in the time between sessions. As I said, one issue that caught my attention was the attendance of large technology companies at an event about Open Data. Some like IBM and Orange made ​​their presentations based on the position papers they had sent, and others like Adobe were attending with relevant representatives. So far I had only identified the work done by Microsoft to enter the scene with its OGDI initiative, aimed at positioning Azure as an opendata friendly technology and the occasional attendance at relevant Open Data events from for example Telefonica or Google, but with a very low profile activity.

As I wrote, this step forward from large technology companies, which I think has very much to do with the work of ETSI and W3C, will be for good if we manage to focus their interests in the right direction, that is to say Open Data, but well done. And that's what this post is about. One of the topics that myself in my presentation about Euroalert, and others like Chris Taggart or François Bancilhon stressed, was the fact that the release of public sector data in Adobe PDF format is not adequate for reuse. It is clearly the best format for distributing information (reports, documents, presentations, etc.) but, as HTML itself, not to publish datasets or machine readable information.

Mr Marc Straat spoke from the public to tell us how Adobe is working on PDF technology so it can evolve to be a more useful format within the Open Data context. I must admit that I did not know about the potential of PDF as a container for other types of information, and after reading the article My PDF Hammer that Marc talked me about in a very pleasant conversation over lunch, I think I have a clear idea of what he meant.

I find very interesting the idea that a PDF container may associate the usual PDF file with its editable original version, whatever format it comes from: either a Microsoft Word document (.Doc), OpenOffice or LibreOffice (.Odt), or whatever. If Adobe works to promote that all the tools that convert documents to PDF do the job of embedding the source file, and contributes to disseminate and encourage the use of the feature, I think it would be a great step forward. And excellent news if the governments take as common practice the distribution of their reports in PDF along with the original file and datasets within the PDF file as a container.

However, after thoroughly reading the article, the idea of using PDF as a container for open data files, seems to me an even worse idea than in my first thought. I really see no advantage in using a PDF container instead of a simple ZIP file to distribute XML datasets along with XSD schemas and their documentation or manuals of course in PDF.

On the other hand I do see a major drawback. No programming language has native support for processing PDF files, while there are many options (and well known) for dealing with ZIP and of course XML, XSD or plain text. This means that an almost trivial data processing task, for which exist many well known open source tools, could be turned into a problem that will require licenses and very specific knowledge with no additional benefit for developers in exchange.

As a conclusion, I will say that I do not believe that solutions based on PDF as a container for open data should be promoted. Considering existing tools, it is much more practical for re-users to deal with information distributed in ZIP containers. Instead, it seems a great idea to start encouraging the practice of embedding the original files and even XML datasets within PDF reports or documents to facilitate reuse.

By the way, as a Linux user, I keep waiting for a version of Adobe Acrobat Reader for my platform (x86_64). At present I am not able to open most of the files that make use of advanced PDF features such as forms, published by public authorities.

lunes, 16 de mayo de 2011

Road Blocks to a Pan European Market for PSI Reuse, a long summary

Note: This article is a translation with a few add-ons of what I wrote in Spanish for my personal blog. You can see the original post in Spanish: "Obstáculos para el desarrollo de un mercado pan-europeo de reutilización, un largo resumen".

SharePSI #daa1psi
Source: ePSIplatform
On Tuesday 10th and Wednesday 11th I participated in Brussels at the workshop "Removing Road Blocks to a Pan European Market for PSI Reuse" held by the Share PSI initiative. It was superbly organized by W3C y ETSI for the European Commission, and gathered a good number of members of the European open data community: governments, businesses and civil society organizations.

The European Commission will use the output of the SharePSI workshop at the 1st Digital Agenda Assembly event: "Beyond raw data: public sector information, done well". Ultimately the contributions, like those obtained through the public consultation on Open Data held at the end of 2010, will help to make the reform of the PSI directive richer and more effective.

In my opinion, compared to other seminars, the level of the discussion was very high in most of the sessions, though there are a few topics that are recurrent in this type of events (pricing, licensing, return of investment and privacy). Clearly, this shows that we are not being able to resolve the issues satisfactorily. In a few occasions it was also clear that not everyone is at the same level of discussion, but it is entirely normal because the open data community is growing at a rapid pace and many new people are joining the discussion.

I think the seminar was very intense and productive and this was largely due to the excellent work done by the program committee and especially by Margot Dor (ETSI) and Thomas Roessler (W3C) to create the workshop programme from the large number of position papers sent from all over Europe.

As you know I usually attend this type of events, and this time one of the things that caught my attention was, the presence of representatives of large companies in the discussion. It was rare until now that Adobe, IBM or Orange were interested in the Open Data movement. And I strongly believe that this is a good thing, because their software and their position in the IT services in governments can provide solutions that will drive the development of a more effective Open Data.

I guess that their presence has much to do with W3C and ETSI. I hope they are here to stay and contribute much to the debate and the solutions, though for now they are still far from the more advanced group. However, I also believe that it is the responsibility of those of us who have been long time in the debate, to bring them to the vision of what is the main objective of the Big Idea, Open Data, but well done.

I will also highlight the number of national government representatives that I could identify in the room (at least from Spain, Denmark, the Netherlands and Finland). And New Zealand representation in the person of Laurence Millar, who described us the situation in his country, which is enviable in many respects, such as the very active community of developers they have.

I found very interesting the discussion on the pricing of the meteorological datasets and the apparent long-running dispute which has been brought now to the open data ring by the Association of Private Meteorological Services (PRIMET). I think it's for good that this happened and that these discussions come to enrich the open data debate. There were also several new use cases like the very interesting FearSquare, presented by Andrew Garbett or the impressive Arcticweb that Erin Lynch showed us, that called my attention

On the other hand, it was a pleasure to hear entrepreneurs like François Bancilhon speaking about his work at Data Publica or like Chris Taggart on his excellent Open Corporates, which I have been following for a while. The risks that people like them are taking contribute greatly to push the boundaries of what can be done, although for sure they may have to face problems, because they are disrupting the established situation. My most sincere admiration, respect and support to go ahead.

On my side, I presented the work that Euroalert is doing to develop our 10ders Information Services platform, which aggregates data on procurement notices across the EU. You can find the slides and the summary of the intervention at Euroalert Blog. I also was the moderator of the second half of the session on Use Cases, where we heard the complains of the Federation of European Publishers about the difficulties they face in competing with they still call the culture of free. I was surprised by their approach in the context of Open Data, which I believe is completely misleading again. I hope they will take a more positive position in the future. I was also lucky to have one of the best quotes of the event, made ​​by Hervé Rannou, from ITEMS International, who presented the lessons learned in the Open Data project of the City of Marseille: "The use of the data is infrastructure, like roads"

On June 16th we will see at the 1st Europe Digital Agenda Assembly the most interesting outcomes and conclusions that the European Commission has harvested from this Workshop. I hope it will be useful to take firm steps forward to enable a more favourable environment for market growth based on the development of new information services. In short, for companies powered by opendata as I like to call them. I also hope that among all of them Euroalert will be a remarkable Open Data company, both because of the success of its value proposition and for our contribution to the development of this environment.

To finish this long post, though the occasion deserved it, I will leave some resources that you will find useful to dive into what was said in the workshop. I have used them review what was said in the last two sessions, which I could not attend. I highly recommend to read the excellent work done in collaborative note-taking which reflects faithfully the discussions. You can also check out the tweet archive created by the University of Lincoln, the slides used by speakers, the list of twitter accounts of attendees, the position papers submitted or the snaps of event.