Blog Post: Data Wrangling and Visualization

In this blog post, you’ll create several interesting, interactive data graphics using the NOAA climate data that we’ve explored in the first several weeks of lectures.

General Requirements

Your post should include not only code but also outputs (i.e. tables, figures) and expository writing that explains what you’re doing. Your target audience is a a student who has completed PIC16A but hasn’t taken PIC16B yet (i.e. you before the start of the quarter).

See the “Specifications” section at the bottom for a detailed list of specs.

§1. Create a Database

First, create a database with three tables: temperatures, stations, and countries. Refer back to lecture notes on how to access country names and relate them to temperature readings. Keep these as three separate tables in your database.

Make sure to close the database connection after you are finished constructing it.

§2. Write a Query Function

Write a Python function called query_climate_database() which accepts four arguments:

  • country, a string giving the name of a country (e.g. ‘South Korea’) for which data should be returned.
  • year_begin and year_end, two integers giving the earliest and latest years for which should be returned.
  • month, an integer giving the month of the year for which should be returned.

The return value of query_climate_database() is a Pandas dataframe of temperature readings for the specified country, in the specified date range, in the specified month of the year. This dataframe should have columns for:

  • The station name.
  • The latitude of the station.
  • The longitude of the station.
  • The name of the country in which the station is located.
  • The year in which the reading was taken.
  • The month in which the reading was taken.
  • The average temperature at the specified station during the specified year and month. (Note: the temperatures in the raw data are already averages by month, so you don’t have to do any aggregation at this stage.)

For example:

query_climate_database(country = "India", 
                       year_begin = 1980, 
                       year_end = 2020,
                       month = 1)
NAME LATITUDE LONGITUDE Country Year Month Temp
0 PBO_ANANTAPUR 14.583 77.633 India 1980 1 23.48
1 PBO_ANANTAPUR 14.583 77.633 India 1981 1 24.57
2 PBO_ANANTAPUR 14.583 77.633 India 1982 1 24.19
3 PBO_ANANTAPUR 14.583 77.633 India 1983 1 23.51
4 PBO_ANANTAPUR 14.583 77.633 India 1984 1 24.81
... ... ... ... ... ... ... ...
3147 DARJEELING 27.050 88.270 India 1983 1 5.10
3148 DARJEELING 27.050 88.270 India 1986 1 6.90
3149 DARJEELING 27.050 88.270 India 1994 1 8.10
3150 DARJEELING 27.050 88.270 India 1995 1 5.60
3151 DARJEELING 27.050 88.270 India 1997 1 5.70

3152 rows × 7 columns

§3. Write a Geographic Scatter Function for Yearly Temperature Increases

In this part, you will write a function to create visualizations that address the following question:

How does the average yearly change in temperature vary within a given country?

Write a function called temperature_coefficient_plot(). This function should accept five explicit arguments, and an undetermined number of keyword arguments.

  • country, year_begin, year_end, and month should be as in the previous part.
  • min_obs, the minimum required number of years of data for any given station. Only data for stations with at least min_obs years worth of data in the specified month should be plotted; the others should be filtered out. df.transform() plus filtering is a good way to achieve this task.
  • **kwargs, additional keyword arguments passed to px.scatter_mapbox(). These can be used to control the colormap used, the mapbox style, etc.

The output of this function should be an interactive geographic scatterplot, constructed using Plotly Express, with a point for each station, such that the color of the point reflects an estimate of the yearly change in temperature during the specified month and time period at that station. A reasonable way to do this is to compute the first coefficient of a linear regression model at that station, as illustrated in lecture.

For example, after writing your function, you should be able to create a plot of estimated yearly increases in temperature during the month of January, in the interval 1980-2020, in India, as follows:

# assumes you have imported necessary packages
color_map = px.colors.diverging.RdGy_r # choose a colormap

fig = temperature_coefficient_plot("India", 1980, 2020, 1, 
                                   min_obs = 10,
                                   zoom = 2,
                                   mapbox_style="carto-positron",
                                   color_continuous_scale=color_map)

fig.show()

Please pay attention to the following details:

  • The station name is shown when you hover over the corresponding point on the map.
  • The estimates shown in the hover are rounded to a sober number of significant figures.
  • The colorbar and overall plot have professional titles.
  • The colorbar is centered at 0, so that the “middle” of the colorbar (white, in this case) corresponds to a coefficient of 0.

It’s not necessary for your plot to look exactly like mine, but please attend to details such as these. Feel free to be creative about these labels, as well as the choice of colors, as long as your result is polished overall.

You are free (and indeed encouraged) to define additional functions as needed.

Once you have one plot for 1980-2020 India that looks similar to this example, make one more plot that looks at a different country and/or time period.

§4. Create Two More Interesting Figures

Create at least one more SQL query function and at least two more complex and interesting interactive data visualizations using the same data set. These plots must be of different types (e.g. line and bar, scatter and histogram, etc). The code to construct each visualization should be wrapped in functions, such that a user could create visualizations for different parts of the data by calling these functions with different arguments. At least one of these plots must involve multiple facets (i.e. multiple axes (in the sense of facets), each of which shows a subset of the data).

Alongside the plots, you should clearly state a question that the plot addresses, similar to the question that we posed in §3. The questions for your two additional plots should be meaningfully different from each other and from the §3 question. You will likely want to define different query functions for extracting data for these new visualizations.

It is not necessary to create geographic plots for this part. Scatterplots, histograms, and line plots (among other choices) are all appropriate. Please make sure that they are complex, engaging, professional, and targeted to the questions you posed. In other words, push yourself! Don’t hesitate to ask your peers or talk to me if you’re having trouble coming up with questions or identifying plots that might be suitable for addressing those questions.

Once you’re done, commit and push your post to publish it to the web. Then, print the webpage as a PDF and submit it on Gradescope.

Specifications

Please remember that all specifications must be met in order for the blog post to earn credit.

Coding Problem

  1. query_climate_database() function is correctly defined according to the prompt.
  2. There are two geographic scatterplots, one for 1980-2020 India that looks similar to the provided example, and another one for a different time and/or country.
  3. The geographic scatterplots are correctly constructed and professionally labeled, including a title, hovers with rounded estimates, and a correctly centered colorbar.
  4. There is at least one more sql query function defined.
  5. There are two other interactive plots constructed using Plotly.
  6. One of these plots involves the use of multiple facets.
  7. The two other plots are wrapped in appropriate, user-friendly functions.
  8. Each of the two other interactive plots have descriptive titles and centered color maps when appropriate.

Style and Documentation

  1. Repeated operations should are enclosed in functions.
  2. For-loops are minimized by making full use of vectorized operations for Numpy arrays and Pandas data frames.
  3. Helpful comments are supplied throughout the code. Docstrings are supplied for any functions and classes you define.

Writing

  1. The overall post is written in engaging and unambiguous English prose. There is written explanation throughout the post, such that a PIC16A student could learn to perform the demonstrated tasks by reading the post.
  2. Each block of code has a clearly-explained purpose.
  3. The post is organized into clearly delimited sections using markdown headers (#), making it easier for the reader to navigate.
  4. The post has a different title from “Blog Post: Data Wrangling and Visualization” or “Blog Post 1”. Please rename it to something more relevant and specific to your data analysis.
Written on March 29, 2022