Blog article: 3 Questions…

3 Questions…

Article text

Why reinvent the wheel? We have some frequent questions that were asked through our Reedit forum administered a few years back with the team. We’ve sifted through them and found 3 more general questions we’ve received throughout the years. Also, check out our FAQ page with some other common questions. What are some of your specific questions that you might have for our team? Send it to us via opendata@toronto.ca.

What do you recommend as a tutorial to best explore your data? For example, when someone first stumbles upon a dataset, how do you recommend going about extracting valuable information from it?

Everybody has their own approach to data exploration, since it’s a bit of art and science. Although we don’t have a video, we are creating “data stories” to share how we analyze data. These are a new concept and we are still refining them:

Further, here’s roughly what we suggest when you first stumble upon a new dataset:

  1. Learn about the context. Why was it collected, and how? What are the known limitations of it? This usually helps minimize confusion later on in the process (e.g. maybe there is some data missing, or values are defaults, or how it was collected changed at some time period so standardization will be needed)
  2. Review the data attributes (e.g. columns in a table) to get an idea of what the data contains. I make note of datetime fields at this stage because durations (i.e. time between datetimes) may be possible.
  3. Identify questions I would like to answer. The focus is not on what can actually be answered; rather, this helps not be too narrow our way of thinking in the beginning. If in a team, we do this separately first for better and more diverse ideas.
  4. Narrow down to questions I think can be answered with the data and given the timeframe available. These are not really “final”, but they provide guidance while exploring the data because without them it’s too easy to get caught in a never-ending data exploration cycle. The questions also serve as a finish line to avoid this.
  5. Prepare the data. This includes initial cleaning such as standardizing date formats and ensuring the attributes are treated correctly (e.g. numbers are not treated as text); as reshaping it so it can be visualized; and transforming it by creating attributes as needed, such as duration from date fields.
  6. View each feature individually to better understand it and identify outliers (e.g. if it’s a number, I’d look at the distribution of values). Profiling tools such as the Python Pandas Profiling library (https://github.com/pandas-profiling/pandas-profiling) make this easy (we are also working on the Pandas Exploration Toolkit but it’s very early stages and still customized for our use: https://github.com/open-data-toronto/petk)
  7. Visualize the data, now that it’s prepared to work with the software, using the research questions guide the exploration. As I learn more about the data throughout this process I update my questions and assumptions. Here’s an example of a visualization dashboard for data exploration from our first data story, built in Tableau: https://public.tableau.com/profile/carlos.hernandez#!/vizhome/BuildingPermits-SampleExplorationDashboard/BuildingPermits-DataExplorationDashboard
  8. I make note of all the exceptions, assumptions, and questions about the data that have come up from steps 5-7 to bring up to the expert of the dataset, if fortunate enough to have access to one.
  9. After all this, with a much better understanding of the data, full-fledged analysis starts. I’ve depicted it to be a linear process for ease of communication but it’s a very cyclical process.

How can we get more involved with using Open Data?

Come out to meetups, hackathons, events, and co-design sessions! We’d suggest finding a social issue you’re interested in, such as ridesharing (for example), and thinking about the ways in which you can participate meaningfully. Everyone comes with a breadth of skills that can make them integral to planning processes, whether that’s research, analysis, product design or facilitating conversations.

Since there’s a lot of public concern surrounding data collection, analysis, and usage; do you publish a document with best practices for your purposes?

Our mandate is to help release good quality data that has no confidential or privacy content. The data released is collected from around the City and our agencies. The divisions are the data stewards and subject matter experts of the data collected and maintained in City repositories.

When we initially embark with a division on releasing a data set, we collect all sorts of metadata around that particular data. Examples are: Collection method, storage location, descriptions, limitations, data dictionaries, etc. We publish readme files or data dictionaries that help the user understand the content of the data. If a user has any further needs or clarification, we help them get in touch with the division who supplied the data for more information or analysis. We also host our policy document on the open data site.