sábado, 20 de agosto de 2016

Some very simple practices to help with the reuse of open datasets

Note: This article is a translation of what I wrote in Spanish for the International Open Data Conference 2016 blog. You can read the original post in spanish: Algunas prácticas muy sencillas que facilitan la reutilización de conjuntos de datos abiertos and in english: "Some very simple practices to help with the reuse of open datasets".
In the past few years, an important number of materials have been published to help data holders with that release of open government data. In his article “The good, the bad…and the best practices”, Martin Alvarez gathered a total of 19 documents, including guides, manuals, good practices, toolkits, cookbooks, etc. Different kinds of authors have invested resources in the development of these materials: national governments, regional governments, foundations, standardization organisms, the European Commission, etc.; hence the number of different perspectives.
On the other hand, a large amount of effort is still being made in the development of standards for the release of open government datasets, either for general purposes or specific domains.
Photo by: Barn Images
Too often, however, very easy rules that facilitate sustainable open datasets reuse are forgotten when datasets are published. I am just mentioning some of the obstacles we often find when we explore a new dataset and assess whether it is worth incorporating it to our service:
  1. Records do not include a field with a unique identifier, which makes it very difficult to monitor changes when the dataset is updated.
  2. Records do not contain a field with the date when it was last updated, which also complicates monitoring which records have changed from one publication version to the next one.
  3. Records do not contain a field with the date of creation, which makes it difficult to know the date each one were incorporated to the dataset.
  4. Fields do not use commonly agreed standards for the type of data they contain. This often occurs in fields with dates and times, or economic values, etc…but is also common in other fields.
  5. Inconsistencies between the content of the dataset and its equivalent published on HTML web pages. Inconsistencies can be of many types, from records published on the website and not exported to the dataset to differences in fields that are published in one format or the other.
  6. The record is published on the dataset much later than on the website. This can make a dataset useless for reuse if the service requires immediacy.
  7. Service Level Agreements on the publication of datasets are not specified overtly. It is not that important to merely judge those agreements as good or bad; what is really important is that they are known, as it is very hard to plan data reuse ahead when you do not know what to expect.
  8. These elements are not provided: a simple description about the content of the fields and structure of the dataset, as well as the relevant criteria used to analyze that content (lists of elements for factor variables, update criteria, meaning of different states, etc.).
As you can see, these practices are not necessarily linked to open-data-related work; they rather deal with the experience in software development projects, or simply with common sense.

Even though most of them are very easy to implement, they are of great importance to convince somebody to invest their time in an open dataset. As you may know, dealing with web scrapping can be more convenient than reusing open datasets; And these are a few simple practices that make the difference.

No hay comentarios: