Blog Post: Data Wrangling and Visualization
In this blog post, you’ll create several interesting, interactive data graphics using the NOAA climate data that we’ve explored in the first several weeks of lectures.
General Requirements
Your post should include not only code but also outputs (i.e. tables, figures) and expository writing that explains what you’re doing. Your target audience is a a student who has completed PIC16A but hasn’t taken PIC16B yet (i.e. you before the start of the quarter).
See the “Specifications” section at the bottom for a detailed list of specs.
§1. Create a Database
First, create a database with three tables: temperatures
, stations
, and countries
. Refer back to lecture notes on how to access country names and relate them to temperature readings. Keep these as three separate tables in your database.
Make sure to close the database connection after you are finished constructing it.
§2. Write a Query Function
Write a Python function called query_climate_database()
which accepts four arguments:
country
, a string giving the name of a country (e.g. ‘South Korea’) for which data should be returned.year_begin
andyear_end
, two integers giving the earliest and latest years for which should be returned.month
, an integer giving the month of the year for which should be returned.
The return value of query_climate_database()
is a Pandas dataframe of temperature readings for the specified country, in the specified date range, in the specified month of the year. This dataframe should have columns for:
- The station name.
- The latitude of the station.
- The longitude of the station.
- The name of the country in which the station is located.
- The year in which the reading was taken.
- The month in which the reading was taken.
- The average temperature at the specified station during the specified year and month. (Note: the temperatures in the raw data are already averages by month, so you don’t have to do any aggregation at this stage.)
For example:
query_climate_database(country = "India",
year_begin = 1980,
year_end = 2020,
month = 1)
NAME | LATITUDE | LONGITUDE | Country | Year | Month | Temp | |
---|---|---|---|---|---|---|---|
0 | PBO_ANANTAPUR | 14.583 | 77.633 | India | 1980 | 1 | 23.48 |
1 | PBO_ANANTAPUR | 14.583 | 77.633 | India | 1981 | 1 | 24.57 |
2 | PBO_ANANTAPUR | 14.583 | 77.633 | India | 1982 | 1 | 24.19 |
3 | PBO_ANANTAPUR | 14.583 | 77.633 | India | 1983 | 1 | 23.51 |
4 | PBO_ANANTAPUR | 14.583 | 77.633 | India | 1984 | 1 | 24.81 |
... | ... | ... | ... | ... | ... | ... | ... |
3147 | DARJEELING | 27.050 | 88.270 | India | 1983 | 1 | 5.10 |
3148 | DARJEELING | 27.050 | 88.270 | India | 1986 | 1 | 6.90 |
3149 | DARJEELING | 27.050 | 88.270 | India | 1994 | 1 | 8.10 |
3150 | DARJEELING | 27.050 | 88.270 | India | 1995 | 1 | 5.60 |
3151 | DARJEELING | 27.050 | 88.270 | India | 1997 | 1 | 5.70 |
3152 rows × 7 columns
§3. Write a Geographic Scatter Function for Yearly Temperature Increases
In this part, you will write a function to create visualizations that address the following question:
How does the average yearly change in temperature vary within a given country?
Write a function called temperature_coefficient_plot()
. This function should accept five explicit arguments, and an undetermined number of keyword arguments.
country
,year_begin
,year_end
, andmonth
should be as in the previous part.min_obs
, the minimum required number of years of data for any given station. Only data for stations with at leastmin_obs
years worth of data in the specified month should be plotted; the others should be filtered out.df.transform()
plus filtering is a good way to achieve this task.**kwargs
, additional keyword arguments passed topx.scatter_mapbox()
. These can be used to control the colormap used, the mapbox style, etc.
The output of this function should be an interactive geographic scatterplot, constructed using Plotly Express, with a point for each station, such that the color of the point reflects an estimate of the yearly change in temperature during the specified month and time period at that station. A reasonable way to do this is to compute the first coefficient of a linear regression model at that station, as illustrated in lecture.
For example, after writing your function, you should be able to create a plot of estimated yearly increases in temperature during the month of January, in the interval 1980-2020, in India, as follows:
# assumes you have imported necessary packages
color_map = px.colors.diverging.RdGy_r # choose a colormap
fig = temperature_coefficient_plot("India", 1980, 2020, 1,
min_obs = 10,
zoom = 2,
mapbox_style="carto-positron",
color_continuous_scale=color_map)
fig.show()
Please pay attention to the following details:
- The station name is shown when you hover over the corresponding point on the map.
- The estimates shown in the hover are rounded to a sober number of significant figures.
- The colorbar and overall plot have professional titles.
- The colorbar is centered at 0, so that the “middle” of the colorbar (white, in this case) corresponds to a coefficient of 0.
It’s not necessary for your plot to look exactly like mine, but please attend to details such as these. Feel free to be creative about these labels, as well as the choice of colors, as long as your result is polished overall.
You are free (and indeed encouraged) to define additional functions as needed.
Once you have one plot for 1980-2020 India that looks similar to this example, make one more plot that looks at a different country and/or time period.
§4. Create Two More Interesting Figures
Create at least one more SQL query function and at least two more complex and interesting interactive data visualizations using the same data set. These plots must be of different types (e.g. line and bar, scatter and histogram, etc). The code to construct each visualization should be wrapped in functions, such that a user could create visualizations for different parts of the data by calling these functions with different arguments. At least one of these plots must involve multiple facets (i.e. multiple axes (in the sense of facets), each of which shows a subset of the data).
Alongside the plots, you should clearly state a question that the plot addresses, similar to the question that we posed in §3. The questions for your two additional plots should be meaningfully different from each other and from the §3 question. You will likely want to define different query functions for extracting data for these new visualizations.
It is not necessary to create geographic plots for this part. Scatterplots, histograms, and line plots (among other choices) are all appropriate. Please make sure that they are complex, engaging, professional, and targeted to the questions you posed. In other words, push yourself! Don’t hesitate to ask your peers or talk to me if you’re having trouble coming up with questions or identifying plots that might be suitable for addressing those questions.
Once you’re done, commit and push your post to publish it to the web. Then, print the webpage as a PDF and submit it on Gradescope.
Specifications
Please remember that all specifications must be met in order for the blog post to earn credit.
Coding Problem
query_climate_database()
function is correctly defined according to the prompt.- There are two geographic scatterplots, one for 1980-2020 India that looks similar to the provided example, and another one for a different time and/or country.
- The geographic scatterplots are correctly constructed and professionally labeled, including a title, hovers with rounded estimates, and a correctly centered colorbar.
- There is at least one more sql query function defined.
- There are two other interactive plots constructed using Plotly.
- One of these plots involves the use of multiple facets.
- The two other plots are wrapped in appropriate, user-friendly functions.
- Each of the two other interactive plots have descriptive titles and centered color maps when appropriate.
Style and Documentation
- Repeated operations should are enclosed in functions.
- For-loops are minimized by making full use of vectorized operations for Numpy arrays and Pandas data frames.
- Helpful comments are supplied throughout the code. Docstrings are supplied for any functions and classes you define.
Writing
- The overall post is written in engaging and unambiguous English prose. There is written explanation throughout the post, such that a PIC16A student could learn to perform the demonstrated tasks by reading the post.
- Each block of code has a clearly-explained purpose.
- The post is organized into clearly delimited sections using markdown headers (#), making it easier for the reader to navigate.
- The post has a different title from “Blog Post: Data Wrangling and Visualization” or “Blog Post 1”. Please rename it to something more relevant and specific to your data analysis.