Europe

How to produce and use datasets: lessons learned

Various studies have focused on the complexities of publishing and using (open) data. A number of lessons can be learned from the experiences of (governmental) data providers, policy-makers, data users, entrepreneurs, competitors and researchers.

Data can be provided by the government, crawled from the web, or generated by sensors. Here are 50 lessons learned in the form of tips and guidelines on creating and using high-quality open datasets.

Publishing data

Organizational structure

1. Involve all appropriate stakeholders at an early stage. This secures support for the data initiative from both the supply and the demand side actors.

2. Clearly explain the reasons for the data initiative and mission statement to everyone involved.

3. Organize roundtables to actualize the data initiative, develop a business case and find out which already available data can be restructured and added to the dataset.

4. Release high-value and high-impact data first. Count data requests to see what data are popular. Conduct surveys to rank priorities and interests from the public.

5. Take away concerns with source trustworthiness, data provenance and legal aspects of re-use. Publish under a trusted username or on a trusted platform. Provide a link to the data maintainer and/or webmaster.

6. Discuss and describe license and any other legal aspects (non-disclosure) clearly and up front. A free and open license for a dataset could be CC0. That way a user is free to do what she wants with the data. Present the conditions and terms of usage. Discuss inside the institution whether the restrictions are in balance with the goals of the data sharing initiative.

7. Grant the data sharing initiative enough time and resources required to complete it and evaluate it. Make sure there is enough time to get the details and user adoption right. Find, employ or educate human resources to work with the data.

8. Set up a data sharing protocol inside your organization. From analyzing the data, to processing the data, to publishing the data.

9. Create a user feedback loop. Be open to and patient with user feedback. Use it to iterate on and improve the quality of the future datasets.

Data quality

10. Ensure data quality. Remove duplicates. Remove empty or broken records. Check the dataset for strange outliers, due to (measurement) artifacts.  Check the dataset for completeness and statistical significance (sample size).

Big Tedious data

11. Set up a quality data distribution platform and monitor its performance. Restore broken links. Set up descriptive, accessible and valid HTML pages to describe your datasets (Provide domain knowledge).  Focus on usability and user experience — Do not make your users think about performing basic actions.

12. Ensure easy-to-understand content and formatting. Attach metadata to state descriptive and legal information, coverage, measurement equipment, timeliness and reliability.

13. Establish compatibility and interoperability of systems. Do not use non-standard, closed-sourced or single-OS data formats. .CSV files are a very popular format among data users. Allow and analyze a data quality feedback system by users. Put data in a context.

14. Set up version control. Keep both access to/a history of the raw data and the processed data. Know where to find the files to delete if required by procedure or compliance. Keep notes on the canonical source file and any releases.

15. Hash the data and encrypt during transfer. Create a hash from the files so anyone can verify their contents. Transfer datasets over a secure connection.

16. Check the dataset for privacy sensitive identifiers. Anonymize  or pseudo-anonymize these identifiers by removing or substituting them.

17. Check the dataset for business value sensitive identifiers. Some data may contain information which may lead to a competitive advantage (profit numbers, sale numbers, inventory).

User support and communication

18. Plan appropriate outreach campaigns and support conversations around data:

  • Set up social media accounts and a blog.
  • Showcase success stories or interesting technical challenges.
  • Create Wiki’s and tutorials.
  • Post on data science community websites, such as /r/datasets and  Datatau.
  • Monitor Twitter and search engine alerts to find out if your data set is mentioned somewhere.
  • Organize a machine learning competition (on Kaggle).
  • Offer (financial) incentives to work with the data
  • Produce interesting visualizations and infographics.
  • Contact schools and universities for educational use.
  • Team up with a data journalist.

19. Facilitate interaction, error handling and user feedback through formal processes, coordination mechanisms, and by dedicated employees.

20. Implement a contact form, forum, error reports, confidentiality concerns reporting, responsible disclosure. Monitor and log user actions and save error reports. Ask permission to survey your users.

21. Organize events, such as app and machine learning competitions, hackathons and workshops. Also visit the ones organized by other data providers.

22. Develop user skills. Organize boot camps, master classes, e-learning courses. Provide free mentoring and advice. Create clear documentation, tutorials, case studies, how-to’s and FAQ’s.

23. Translate domain knowledge to comprehensible terms. Use domain experts to engage users on forums or mailing lists to get them up to speed.

24. Provide additional tools to work with the data. Plug-ins, cloud computing infrastructure, software libraries, converting and munging scripts.

Sustainability

25. Support the building of a community of data users, like journalists, civic hackers, non-profits, citizens and academics. Organize regular community meet-ups.

26. Create a market place for ideas, data services and people looking to team up. This could be a website, a mailing list or a forum.

27. Support deployment of newly developed services, implementations and usage.

28. Fund well-developed apps to develop scalable models. Help get competition winners and expert data users seed grants, jobs, mentoring and advice.

29. Integrate data-driven content and services into organizations and (government) operations. For example: Code Fellowships hosting civic coders in government agencies, media and civil society. Employ data engineer evangelists.

Avoiding adoption barriers

There are numerous institutional barriers for not publishing data sets. Managers and policy makers can help clear up such barriers, if they are identified inside the organization.

30. Institutional barriers:

  • Emphasis of barriers and neglect of opportunities
  • Unclear trade-off between public values (transparency vs. privacy values)
  • Risk-averse culture (no entrepreneurship)
  • No uniform policy for publicizing data
  • Making public only non-value-adding data
  • No resources and budget with which to publicize data
  • Fostering organizations’ interests at the expense of citizen interests
  • No process for dealing with user input
  • Debatable quality of user input

31. Task complexity barriers:

  • Lack of ability to discover appropriate data or data with potential.
  • No access to the original data (only processed data)
  • No information about the context, relevancy and quality of the data.
  • Duplication of data, data available in many formats, debate over the source of data.

32. Use and participation barriers:

  • There are no incentives for the users and the organization does not react to user input.
  • Frustration with the data sharing initiative. Little or zero time to worry about the details.
  • Unexpected escalated costs. Legal and privacy concerns.
  • Lack of knowledge or interest to make sense of the data.

Use Case: Publishing Open Data

Open Data & the Government

Governments have been gathering data for their own use for decades. This includes interesting data on geographical and meteorological matters, as well as environmental polution, crime and law enforcement.

Traditionally this data was only accessible to government experts. The full potential of this data as a catalyst for app development, democratic transparency, innovation and research was not realized.

Logo open knowledge foundation

Open data, according to the Open Knowledge Foundation, is data that can be freely used, shared and built-on by anyone, anywhere, for any purpose.

Some researchers and policy makers add another requirement: The data needs to be structured (machine-readable). Questionnaire data stored away in non-selectable .PDF documents or big data without an API is not really open data under that requirement. Researchers working with such tedious data sets need to perform heavy pre-processing or manual labour to “free” the data.

Open can apply to information from any source and about any topic. Anyone can release their data under an open licence for free use by and benefit to the public. Although we may think mostly about government and public sector bodies releasing public information such as budgets or maps, or researchers sharing their results data and publications, any organisation can open information (corporations, universities, NGOs, startups, charities, community groups and individuals). Open Knowledge Foundation

Neelie KroesIn 2011 Eurocommissioner Neelie Kroes said that “Data is the new gold”. Expectations were high. The Planet Open Knowledge Foundation started ranking countries for data availability.

The reality, however, included some snags. Organizations needed a change of mindset. Decision trees needed to be implemented to aid faster data sharing. And organizations faced privacy concerns.

Research and Documentation Centre (WODC)

The WODC is a criminal justice knowledge centre in The Netherlands. It aims to make a professional contribution to development and evaluation of justice policy set by the Ministry of Security and Justice.

WODC

Their Statistical Data and Policy Analysis division provides policy information to ministries, the police, the Prosecutor Office, the media and academic researchers.

They further:

  • collect, maintain, integrate, and query judicial data sources,
  • produce crime statistics,
  • monitor development and measure performance within the Dutch Justice chains,
  • produce forecasts of the capacity demand of the Dutch Justice chains,
  • write a statistical yearbook called “Crime and law enforcement”,
  • conduct research on topics such as e-government and cyber crime.

Their problems

They felt the demand and drive for opening data is high. However, for them, opening data had both benefits and drawbacks.

They knew there was a risk of a privacy breach even when data is anonymized or aggregated (See AOL searchlog leaks). Even with properly anonymized data, individuals may be identified by combining different data sources. This creates a conflict between replicability/completeness and trust/security.

Their solutions:

A data sharing protocol

The WODC now offers 3 kinds of access: Open access, where they publish the data online. Restricted access, where privacy-sensitive data is given to selected scientific organizations. And Demand-driven access, where highly aggregated data is send after receiving a WOB request (similar to a Freedom Of Information request).

All requests for data are monitored and audited. They’ve established a strict set of procedures for data sharing to ensure privacy is maintained. They are in compliance with standards and security policies.

A data sharing procedure
  1. Analyse the type and content of the data, the purpose of the data publication, and restrictions.
  2. Preparation: Retrieve, process, and (pseudo-)anonymize the data.
  3. Publication: Transfer the data and establish conditions and rules for data access and reuse.
Methods for privacy and security

Only share personal or privacy-sensitive data with trusted scientists, only when strictly necessary. For all other purposes, delete any attributes which may lead to a disclosure of identity.

Slide from security presentation WODC

Avoid publishing (statistical) data based on a small sample size. Data are shared on the highest level of aggregation possible–Statistical and aggregate data is preferred over data on individuals. Use good encryption while transferring data. Whitelist IP’s for external data access. Erase the data after a predefined time.

Using datasets

If you are a data publisher, act like a user following these guidelines, and identify and solve any issues that may appear.

Quality control

33. Measure data quality. How many researchers have looked at the data (“given enough eyeballs, all bugs are shallow”)? What is the skill level of the data publisher? What is the ease of independent verification?

34. Establish who the creator and/or maintainer of the dataset is. Perform a background check on the data source like you would when writing an investigative article. Establish motive for releasing the data (academic, compliance, group effort, commercial, leaking, propaganda etc.). Find and document provenance.

OpenStreetMap Editor Quality

Streets mapped in OpenStreetMap vs. editor location in OpenStreetMap

35. Collect meta-data and measurement data. Are there any more (column header) descriptions for the dataset available? How was the data measured and gathered? What was the research structure?

36. Make sure you have permission to access the data. Even publicly available data crawled from the web may cause legal problems if you lack permission or break the Terms of Service (for example: no automated access). Check out the license and terms so you are at least aware of them.

37. Set up version control. Store the raw dataset and your pre-processed datasets. You may remove extraneous columns, stem text, or reduce the dataset while working on it. Keep notes on the original raw dataset and any processing you’ve done.

38. Check for quality issues like duplication, missing records and incompleteness. Check for (near) duplicates (using hashes or fast all-pairs similarity search) inside the datasets and between datasets. Remove, ignore, restore or fix (replacing with the mean) “NA” values.

39. Check for outliers, noise and malformed structure. Print out the min and max for value columns. Check statistical relevance, level of noise and coherence.

40. Get a feel for the data. Open the datasets and manually inspect a few records. Familiarize yourself with the toolkits for data scientists to get a better research vs. data munging balance (for example: Pandas + Python, RapidMiner, OpenRefine). Read more about the context and domain of the data and data publisher.

User feedback & Community

41. Rate data sets (and check other data users’ ratings). If the data distribution platform allows it: add user feedback through voting and/or leaving a comment. Learn from the other data users.

42. Report successful usage of the data to the data publishers. Did you create a nice app, research article or visualization with the data? Make this known to the data publishers. This rewards their efforts and in turn they may reward you with free marketing and expert insights into the data (or provide access to more data sources).

Googling for 'Killer Data App'

43. Team up. Work together with other data users to create an ensemble of insights and techniques. Be approachable. Let others know you are working with the data and that you are willing to join forces.

44. Post data munging code and/or conversions for all to use. If the dataset is in .xlsx then more than one user has to convert that to .csv. If the dataset contains duplicates then more than one user benefits from a deduplication script. Similarly for a script that automates API access. Post code and tools to work with the data and check out code posted by other data users.

45. Report (privacy) issues with the data. Is the dataset incomplete, contains many errors or duplicates, does data make it possible to identify individuals? Responsibly disclose privacy issues and constructively disclose technical issues.

46. Set up a data user forum or data wiki (or participate in an existing one). Building a community and/or market place around a data source will improve knowledge and cooperation. It is also a stronger (aggregated) voice for data publishers to act on the issues you could raise.

Enhancing the data and drawing conclusions

Clippy47. Combine datasets for better datasets. If the datasets allow this, join data by unique identifiers to create richer datasets. Beware of problems, duplicate ID’s or relying on conversions too much. Some datasets may offer street address, others longitude and latitude. Linking such datasets can benefit from crowdsourcing due to their fuzzy and complex nature.

48. Do not overfit or draw unjustified conclusions from small sample size. If the feature-space is large and the sample size is small there is an increased risk of overfitting on your dataset. Models won’t be generalizable to other datasets. In a similar vein: Do not draw conclusions  or create data visualizations from small or untrustworthy data sources. Do your data journalism on correctly and significantly aggregated values.

49. Create attractive visualizations, reports and graphs. Popular visualizations require little domain expertise to convey information, conclusions and/or (business) intelligence. Find feature weights (which columns are most indicative). “Raw data is both an oxymoron and a bad idea“.

Pragmatic visualization is what we term the technical application of visualization techniques to analyze data. The goal of pragmatic visualization is to explore, analyze, or present information in a way that allows the user to thoroughly understand the data. — Robert Kosara

50. Create a data report for yourself or your superior(s). Present/communicate your results. For example: online, to management, in the data research community, to the data publishers.

 

Semantic web
The semantic web and its links as of 2011

Use Case: Using open data and healthcare data

Linking open energy data

Chris Davis is a Postdoc Energy & Industry. He tries to link up the many available data sources with a focus on industrial ecology and open data.

According to him, energy and sustainability are one of the most important topics of our century.

The problems

Researchers repeat a lot of work. Research is very data intensive. To get a clear picture one needs both aggregated and fine-grained data. Connecting all this data is tedious. The energy is sector is only slowly embracing data sharing initiatives.

The solutions

Create a platform, enipedia.tudelft.nl where researchers can cooperate to avoid duplicate work. Working with multiple editors/data users ensures that bugs are found earlier, that facts are double-checked and that the huge tasks do not fall on the shoulders of a few individual researchers.

Enipedia Technical Infrastructure

Enipedia’s technical infrastructure 

Also start a debate among data publishers to clear up any social issues. The technology is already here, it’s the social issues that are holding us back. Have debates about data quality and perform research on what constitutes data quality.

Privacy-sensitive healthcare data

The Dutch have started programs to supply heroine addicts with methadone. It is part of a treatment and risk reduction program. Heroin addicts could register at local distribution points ran by The Public Health Authority and would be provided small dosages of methadone (to reduce resale and misuse). The heroine addict would be registered with The Public Health Authority and sometimes with their personal physician.

The problems

This data is so privacy sensitive that data sharing initiatives, even among government organizations, were shunned. So when a heroin addict partaking in a methadone-program was arrested and taken to jail, the prison doctors had no timely access to this data. Many promising rehabilitation projects were cut short, forcing the addict to go cold-turkey. This resulted in an increase of relapses and death by overdose shortly after release from prison.

The solutions

public health authority needles

In 2013 the Public Health Authority distributed 400.000 dosages, also to jailed addicts.

User groups asked the data publishers and maintainers for a temporary identifier with which to identify participants in these programs. Delete, aggregate or anonymize the records once the participant has completed the program. A so-called anonymous chain-ID prevents privacy issues, and allows for a maximum of sharing — Relevant parties have near on-demand access to the data.

An ID card with expiration date was issued to the addict. Showing the ID at a distribution point allowed them to check if the person matched the photo on record. Showing the ID to the prison doctors served as identification and proof of participation in a program. Data sets and databases relevant to the program all referred to this new chain ID. This made the joining and evaluation of related data easier.

Further reading, Resources & References

  • Faculty of Technology, Policy and Management – Technical University of Delft (2014). Open data Workshop.
  • Essa, A. (2013) Python for data analysis (Intro to Pandas library). Youtube tutorial.
  • The introduction photo to this post was made by Jer Thorp. It visualizes data about hotels and restaurants in Europe.
  • W3C (2014) Best Practices for publishing linked data
  • Susha, I. (2014) Organizational measures and best practices to facilitate open data use.
  • Meijer, Conradie & Choenni ”Reconciling Contradictions of Open Data Regarding Transparency, Privacy, Security and Trust”
  • Van den Braak, Choenni, Meijer & Zuiderwijk (2012) ”Trusted third parties for secure and privacy-preserving data integration and sharing in the public sector”
  • Zuiderwijk, A. (2013) Towards an e-infrastructure to support the provision and use of open data. Conference for E-democracy and Open Government.
  • UNDESA (2013) Guidelines on open government data for citizen engagement
  • Krug, S. (2014) Don’t make me think, Revisited – A Common Sense Approach to Web Usability.
  • Janssen, M., Charalabidis, Y., Zuiderwijk, A. (2012) Benefits, Adoption Barriers and Myths of Open Data and Open Government.
  • Zuiderwijk, A., Janssen, M. (2014) Open data policies, their implementation and impact: A framework for comparison & Infomediary Business Models for Connecting Open Data Providers and Users
  • Zuiderwijk, A., Janssen, M., Choenni, S., Meijer, R. & Sheikh Alibaks, R. Socio-technical impediments of Open Data
  • World Bank (-) Demand and Engagement – Open Government Data Toolkit
  • Gurstein, M. (2011) Open data: Empowering the empowered or effective data use for everyone
  • Davies, T. (2012) Supporting open data use through active engagement. W3C “Using open data” Workshop.
  • Lee, G. & Kwak, Y. (2011) An Open Government implementation model: moving to increased public engagement. IBM Center.
  • Bowker C., G. (2008) Memory practices in the sciences.
  • Schneier, B. (2013) Talks at Google
  • Open Data Institute (-) Guide – Engaging with re-users.
  • Gray, J., Bounegru, L., Chambers, L. (2014) Data journalism handbook
  • Van Veenstra, A & van den Broek, T. (2013) Open Moves – Drivers, Enablers and Barriers of Open Data in a Semi-public Organization. EGOV Conference.
  • Dutch Open Data Portal (2014)
  • Grijpink, J. (1999) Working with process chain automation – Information Strategy for the Data SocietyProcess chain automation – Theme: Identity Control

3 thoughts on “How to produce and use datasets: lessons learned”

Leave a Reply

Your email address will not be published. Required fields are marked *