HW2: Web Scraping
What’s your favorite movie or TV show? Wouldn’t it be nice to find more shows that you might like to watch, based on ones you know you like? Tools that address questions like this are often called “recommender systems.” Powerful, scalable recommender systems are behind many modern entertainment and streaming services, such as Netflix and Spotify. While most recommender systems these days involve machine learning, there are also ways to make recommendations that don’t require such complex tools.
In this Blog Post, you’ll use webscraping to answer the following question:
What movie or TV shows share actors with your favorite movie or show?
The idea of this question is that, if TV show Y has many of the same actors as TV show X, and you like X, you might also enjoy Y.
This post has two parts. In the first, larger part, you’ll write a webscraper for finding shared actors on IMDB. In the second, smaller part, you’ll use the results from your scraper to make recommendations.
Don’t forget to check the Specifications for a complete list of what you need to do to obtain full credit. As usual, this Blog Post should be printed as PDF from your PIC16B Blog.
§1. Setup
§1.1. Locate the Starting IMDB Page
Pick your favorite movie or TV show, and locate its IMDB page. For example, my favorite TV show is Person of Interest. Its IMDB page is at:
https://www.imdb.com/title/tt1839578/
Save this URL for a moment.
§1.2. Dry-Run Navigation
Now, we’re just going to practice clicking through the navigation steps that our scraper will take.
First, click on the All Cast & Crew link. This will take you to a page with URL of the form
<original_url>fullcredits/
Next, scroll until you see the Series Cast section. Click on the portrait of one of the actors. This will take you to a page with a different-looking URL. For example, the URL for Amy Acker is
https://www.imdb.com/name/nm0009918/
Finally, scroll down until you see the actor’s Filmography section. Note the titles of a few movies and TV shows in this section.
Our scraper is going to replicate this process. Starting with your favorite movie or TV show, it’s going to look at all the actors in that movie or TV show, and then log all the other movies or TV shows that they worked on.
§1.3 Initialize Your Project
- Create a new GitHub repository, and sync it with GitHub Desktop. This repository will house your scraper. You should commit and push each time you make significant changes to your code.
- Open a terminal in the location of your repository on your laptop, and type:
conda activate PIC16B scrapy startproject IMDB_scraper cd IMDB_scraper
This will create quite a lot of files, but you don’t really need to touch most of them.
§1.4 Tweak Settings
For now, add the following line to the file settings.py
:
CLOSESPIDER_PAGECOUNT = 20
This line just prevents your scraper from downloading too much data while you’re still testing things out. You’ll remove this line later.
§2. Write Your Scraper
Create a file inside the spiders
directory called imdb_spider.py
. Add the following lines to the file:
# to run
# scrapy crawl imdb_spider -o movies.csv
import scrapy
class ImdbSpider(scrapy.Spider):
name = 'imdb_spider'
start_urls = ['https://www.imdb.com/title/tt1839578/']
Replace the entry of start_urls
with the URL corresponding to your favorite movie or TV show.
Now, implement three parsing methods for the ImdbSpider
class.
parse(self, response)
should assume that you start on a movie page, and then navigate to the Cast & Crew page. Remember that this page has url<movie_url>fullcredits
. Once there, theparse_full_credits(self,response)
should be called, by specifying this method in thecallback
argument to a yieldedscrapy.Request
. Theparse()
method does not return any data. This method should be no more than 5 lines of code, excluding comments and docstrings.parse_full_credits(self, response)
should assume that you start on the Cast & Crew page. Its purpose is to yield ascrapy.Request
for the page of each actor listed on the page. Crew members are not included. The yielded request should specify the methodparse_actor_page(self, response)
should be called when the actor’s page is reached. Theparse_full_credits()
method does not return any data. This method should be no more than 5 lines of code, excluding comments and docstrings.parse_actor_page(self, response)
should assume that you start on the page of an actor. It should yield a dictionary with two key-value pairs, of the form{"actor" : actor_name, "movie_or_TV_name" : movie_or_TV_name}
. The method should yield one such dictionary for each of the movies or TV shows on which that actor has worked. Note that you will need to determine both the name of the actor and the name of each movie or TV show. This method should be no more than 15 lines of code, excluding comments and docstrings.
Provided that these methods are correctly implemented, you can run the command
scrapy crawl imdb_spider -o results.csv
to create a .csv
file with a column for actors and a column for movies or TV shows.
§2.1. Hints
Here are a few hints for the two trickier parsing methods.
parse_full_credits()
On the Cast & Crew page, the list comprehension
[a.attrib["href"] for a in response.css("td.primary_photo a")]
will create a list of relative paths, one for each actor. This command mimics the process of clicking on the headshots on this page.
parse_actor_page()
This is likely to be the most challenging method to implement. I’ll leave the challenge of figuring out the actor name to you. As far as the titles of movies and TV shows, here are CSS selectors that I found to be very helpful. You might be able to use them in your own solution.
element.css("::attr(id)")
element.css("div.filmo-row")
element.css("a::text")
Note that these are not in the order in which you should call them. Experimentation in the scrapy shell is strongly recommended. Don’t forget that you can also use the Developer Tools on your browser to inspect individual HTML elements.
§3. Make Your Recommendations
Once your spider is fully written, comment out the line
CLOSESPIDER_PAGECOUNT = 20
in the settings.py
file. Then, the command
scrapy crawl imdb_spider -o results.csv
will run your spider and save a CSV file called results.csv
, with columns for actor names and the movies and TV shows on which they worked.
Once you’re happy with the operation of your spider, compute a sorted list with the top movies and TV shows that share actors with your favorite movie or TV show. For example, here’s a list one might get for Star Trek: Deep Space Nine (https://www.imdb.com/title/tt0106145/). Of course, most shows will “share” the most actors with themselves.
movie | number of shared actors | |
---|---|---|
0 | Star Trek: Deep Space Nine | 538 |
1 | ER | 116 |
2 | Star Trek: Voyager | 111 |
3 | NYPD Blue | 111 |
4 | Star Trek: The Next Generation | 108 |
5 | L.A. Law | 104 |
6 | JAG | 99 |
7 | Murder, She Wrote | 85 |
8 | Hunter | 82 |
9 | The Practice | 78 |
§5. Blog Post
In your blog post, you should describe how your scraper works, as well as the results of your analysis. When describing your scraper, I recommend dividing it up into the three distinct parsing methods, and discussing them one-by-one. For example:
In this blog post, I’m going to make a super cool web scraper… Here’s a link to my project repository… Here’s how we set up the project…
<implementation of parse()>
This method works by…
<implementation of parse_full_credits()>
To write this method, I…
In addition to describing your scraper, your Blog Post should include a table or visualization of numbers of shared actors like the one shown above, as well as a link to your GitHub repository.
Remember that this post is still a tutorial, in which you guide your reader through the process of setting up and running the scraper. Don’t forget to tell them how to create the project and run the scraper!
§4. Specifications
Coding Problem
- Each of the three parsing methods appear logically and correctly implemented.
parse()
is implemented in no more than 5 lines.parse_full_credits()
is implemented in no more than 5 lines.parse_actor_page()
is implemented in no more than 15 lines.- A table or list of results or pandas dataframe is shown.
- A visualization with
matplotlib
orplotly
orseaborn
is shown. - The code is housed in a GitHub repository.
Style and Documentation
- Each of the three
parse
methods has a short docstring describing its assumptions (e.g. what kind of page it is meant to parse) and its effect, including navigation and data outputs. - Each of the three
parse
methods has helpful comments for understanding how each chunk of code operates.
Writing
- The blog post is written in tutorial format, in engaging and clear English. Grammar and spelling errors are acceptable within reason.
- The blog post explains clearly how to set up the project, run the scraper, and access the results.
- The blog post explains how each of the three
parse
methods works. - The blog post includes a link to the GitHub repository for the scraper project.
- Blog post has a descriptive title.