Blog article: Dataset Readiness Criteria

Dataset Readiness Criteria

Article text

Toronto’s open data team is working hard to make preparing and releasing open data easier, faster and more efficient than ever. Over the last year, we researched and developed guidelines to improve the quality of our data. To do this, we assessed all 290+ datasets on the the current portal to come up with a set of evaluation criteria. These six criteria assess how close to ready a dataset is for automation and optimization. We are working with data stewards across the City to ensure that our data provides value. Each criteria, and a brief description of what it means is below.

1. Source System Connection

A source system connection refers to how a user accesses a data source. There are many benefits to having a source system connection for your data instead of a static file (eg. Excel spreadsheet). For one, the SSC serves as a “source of truth” for your data, so data stewards no longer need to update many different file types. Some datasets are very large and difficult to download efficiently. Others include more information than a user needs. The open data team will help guide City data stewards who don’t already have an SSC to set one up for their open dataset.

2. Open Data Readiness

An open dataset must readily import into data visualization and analysis tools like Tableau or PowerBI. Examples of these open file formats include CSV, JSON, XML, and GeoJSON. File format alone isn’t the only factor that make datasets machine readable. The structure of the dataset also has implications on the dataset’s readiness.  The Open Data team will work with data stewards to improve the open data readiness of datasets to best make them machine readable. Structural improvements include removing merged cells, formulas, and summary data. Style elements like colours, font, and formatting should also be removed. They can in fact hinder the machine readability of your data. Formulas are also an important consideration. Open dataset files should be free of them. As a general rule, the first column of every row in a document should be a heading that describes the values in the column. Each row in a dataset should describe a single data entry.

3. User Demand

hand outstretched We want to make sure that when data is requested, that data stewards are ready for it. By looking at site analytics, search terms, as well as current events, the Open Data team can get a general sense of how ‘in-demand’ a dataset is. Although a dataset may not have many hits on the Open Data portal doesn’t mean that it’s not important or relevant. We also consider requests for datasets as an important factor.

4. Freshness

apple Data freshness refers to not only how often a dataset is updated, but how accurately the metadata represents the refresh rate. For example, if a dataset says that it is updated on a weekly basis, but the last data entry was 8 months ago, the dataset would have a lower rating. Please note that it is possible for some datasets to be updated less consistently by design. An example is a survey or evaluation that occur every 10 years. Regardless, it’s important to ensure that metadata correctly represents how often a user can expect to see updates.

5. Data Granularity

Data should always aim to be as detailed as possible. Data should be non-aggregated and only provide raw values. This will allow users to visualize and analyze the data as they need. When raw data is provided. as opposed to summary data (e.g. totals),  this makes it easy for users to use the data in innovative and creative ways. Aggregated data may be provided on a case-by-case basis. This would include situations where it is impossible to report on granularity for privacy, technical, or legal reasons.

6. Proprietary Formats

A majority of the current open data catalogue is only available in proprietary formats. Proprietary formats, such as Excel Spreadsheets, are file types that are the property of a particular software company like Microsoft. This limits who can access the data, as the end user typically requires a paid software license to open these files. In some cases, the files may not render correctly in visualization tools. Luckily, there are many universal open formats that can be substituted that do not require special software to open or access, such as CSV. That’s why we will be moving to publishing in open formats only.