Basil Okola: Web scraping with R

Basil Okola

Web scraping in R

HTML Data structure

Web pages are styles with CSS files: cascade style sheets that determine layout of the webpage. CSS selectors can be used to look for HTML elements of interest. One such is the SelectorGadget google chrome extension. You need to install it to your browser before proceeding.

To use it open the page

Click on the element you want to select. SelectorGadget will make a first guess at what css selector you want. It’s likely to be bad since it only has one example to learn from, but it’s a start. Elements that match the selector will be highlighted in yellow.
Click on elements that shouldn’t be selected. They will turn red. Click on elements that should be selected. They will turn green.
Iterate until only the elements you want are selected. SelectorGadget isn’t perfect and sometimes won’t be able to find a useful css selector. Sometimes starting from a different element helps. More at tidyverse/rvest

For example, if we want the actors listed on the IMDB movie page, e.g. The Shawshank Redemption

HTML tags can be passed to functions to retrieve the web page elements of interest.

`rvest`

For scrapping (harvesting) data fro the web in a structured format that can be used in further analysis.

rvest functions

read_html(): collects data from the webpage
html_nodes(): extract the relevant pieces
html_text(): extract tags of the relevant piece
html_attributes(): extract attributes of the relevant piece

# specify url 
url = 'https://www.imdb.com/title/tt0111161/'
# reading the html code from the 
webpage = read_html(url)
webpage

{html_document}
<html xmlns:og="http://ogp.me/ns#" xmlns:fb="http://www.facebook.com/2008/fbml">
[1] <head>\n<meta http-equiv="Content-Type" content="text/html; cha ...
[2] <body id="styleguide-v2" class="fixed">\n            <img heigh ...

Once we have determined the CSS selector, we use it to extract the information we want

cast = html_nodes(webpage, ".primary_photo+ td a")
length(cast)

[1] 15

cast[1:2]

{xml_nodeset (2)}
[1] <a href="/name/nm0000209/?ref_=tt_cl_t1"> Tim Robbins\n</a>
[2] <a href="/name/nm0000151/?ref_=tt_cl_t2"> Morgan Freeman\n</a>

Finally, we extract the text from the selected HTML nodes.

html_text(cast, trim = T)

 [1] "Tim Robbins"       "Morgan Freeman"    "Bob Gunton"       
 [4] "William Sadler"    "Clancy Brown"      "Gil Bellows"      
 [7] "Mark Rolston"      "James Whitmore"    "Jeffrey DeMunn"   
[10] "Larry Brandenburg" "Neil Giuntoli"     "Brian Libby"      
[13] "David Proval"      "Joseph Ragno"      "Jude Ciccolella"

Extracting a table

all_tables = html_table(webpage, "table", header = FALSE)
casttable = html_table(webpage, ".cast_list", header = F)[[1]]
head(casttable)

                                 X1                                X2
1 Cast overview, first billed only: Cast overview, first billed only:
2                      \n                  \n Tim Robbins\n          
3                      \n               \n Morgan Freeman\n          
4                      \n                   \n Bob Gunton\n          
5                      \n               \n William Sadler\n          
6                      \n                 \n Clancy Brown\n          
                                 X3
1 Cast overview, first billed only:
2   \n              ...\n          
3   \n              ...\n          
4   \n              ...\n          
5   \n              ...\n          
6   \n              ...\n          
                                                                       X4
1                                       Cast overview, first billed only:
2            \n            Andy Dufresne \n                  \n          
3 \n            Ellis Boyd 'Red' Redding \n                  \n          
4            \n            Warden Norton \n                  \n          
5                  \n            Heywood \n                  \n          
6           \n            Captain Hadley \n                  \n

Attributes of an element

If say we are also interested in extracting the links to the actor’s pages, we can acces html attributes of the selected nodes using html_attrs( ).

cast_attrs = html_attrs(cast)
cast_attrs[1:2]

[[1]]
                            href 
"/name/nm0000209/?ref_=tt_cl_t1" 

[[2]]
                            href 
"/name/nm0000151/?ref_=tt_cl_t2"

As we can see there’s only one attribute called href which contains relative url to the actor’s page. We can extract it using html_attr(), indicating the name of the attribute of interest. Relative urls can be turned into absolute urls using url_absolute().

cast_rel_urls = html_attr(cast, "href")
length(cast_rel_urls)

[1] 15

cast_rel_urls[1:2]

[1] "/name/nm0000209/?ref_=tt_cl_t1" "/name/nm0000151/?ref_=tt_cl_t2"

cast_abs_urls = html_attr(cast, "href") %>%
  url_absolute(url)
cast_abs_urls[1:2]

[1] "https://www.imdb.com/name/nm0000209/?ref_=tt_cl_t1"
[2] "https://www.imdb.com/name/nm0000151/?ref_=tt_cl_t2"

Making API Requests in R

Application Program Interface: Places where a computer interacts with another, or with itself:
- the client requests data, the server provide data.
Applicable R packages: httr and jsonlite

`httr`

Create request with GET() function. Input is a url which specifies the address of the server.
Example: current number of people in space

res = GET('http://api.open-notify.org/astros.json')
res

Response [http://api.open-notify.org/astros.json]
  Date: 2021-05-04 22:51
  Status: 200
  Content-Type: application/json
  Size: 355 B

JSON Format

Different formats to share data on the internet
Currently, JavaScript Object notation is being widely adopted

[ { “name”: “Miguel”, “student_id”: 1, “exam_1”: 85, “exam_2”: 86 }, { “name”: “Sofia”, “student_id”: 2, “exam_1”: 94, “exam_2”: 93

jsonlite

Several organizations provide a JSON API or a web service
The jsonlite package provides parser and generator functions:
- fromJSON()
- toJSON()
fromJSON():
- Input is a JSON string, URL, JSON file
- Returns a list of data.frames
toJSON():
- Input is any object
- Returns a JSON string

`fromJSON()`

convert the raw Unicode of the GET request into a JSON string

rawToChar(res$content)

[1] "{\"number\": 7, \"message\": \"success\", \"people\": [{\"name\": \"Mark Vande Hei\", \"craft\": \"ISS\"}, {\"name\": \"Oleg Novitskiy\", \"craft\": \"ISS\"}, {\"name\": \"Pyotr Dubrov\", \"craft\": \"ISS\"}, {\"name\": \"Thomas Pesquet\", \"craft\": \"ISS\"}, {\"name\": \"Megan McArthur\", \"craft\": \"ISS\"}, {\"name\": \"Shane Kimbrough\", \"craft\": \"ISS\"}, {\"name\": \"Akihiko Hoshide\", \"craft\": \"ISS\"}]}"

convert the JSON format to a list of data.frames

data = fromJSON(rawToChar(res$content))
data

$number
[1] 7

$message
[1] "success"

$people
             name craft
1  Mark Vande Hei   ISS
2  Oleg Novitskiy   ISS
3    Pyotr Dubrov   ISS
4  Thomas Pesquet   ISS
5  Megan McArthur   ISS
6 Shane Kimbrough   ISS
7 Akihiko Hoshide   ISS

Read directly from a url

for example: read from a github page

data <- fromJSON("https://api.github.com/users/hadley/orgs")
class(data)

[1] "data.frame"

names(data)

 [1] "login"              "id"                 "node_id"           
 [4] "url"                "repos_url"          "events_url"        
 [7] "hooks_url"          "issues_url"         "members_url"       
[10] "public_members_url" "avatar_url"         "description"

`toJSON()`

Write a dataframe to JSON format

jsondata = toJSON(data)
head(jsondata)

[1] "[{\"login\":\"ggobi\",\"id\":423638,\"node_id\":\"MDEyOk9yZ2FuaXphdGlvbjQyMzYzOA==\",\"url\":\"https://api.github.com/orgs/ggobi\",\"repos_url\":\"https://api.github.com/orgs/ggobi/repos\",\"events_url\":\"https://api.github.com/orgs/ggobi/events\",\"hooks_url\":\"https://api.github.com/orgs/ggobi/hooks\",\"issues_url\":\"https://api.github.com/orgs/ggobi/issues\",\"members_url\":\"https://api.github.com/orgs/ggobi/members{/member}\",\"public_members_url\":\"https://api.github.com/orgs/ggobi/public_members{/member}\",\"avatar_url\":\"https://avatars.githubusercontent.com/u/423638?v=4\",\"description\":\"\"},{\"login\":\"rstudio\",\"id\":513560,\"node_id\":\"MDEyOk9yZ2FuaXphdGlvbjUxMzU2MA==\",\"url\":\"https://api.github.com/orgs/rstudio\",\"repos_url\":\"https://api.github.com/orgs/rstudio/repos\",\"events_url\":\"https://api.github.com/orgs/rstudio/events\",\"hooks_url\":\"https://api.github.com/orgs/rstudio/hooks\",\"issues_url\":\"https://api.github.com/orgs/rstudio/issues\",\"members_url\":\"https://api.github.com/orgs/rstudio/members{/member}\",\"public_members_url\":\"https://api.github.com/orgs/rstudio/public_members{/member}\",\"avatar_url\":\"https://avatars.githubusercontent.com/u/513560?v=4\",\"description\":\"\"},{\"login\":\"rstats\",\"id\":722735,\"node_id\":\"MDEyOk9yZ2FuaXphdGlvbjcyMjczNQ==\",\"url\":\"https://api.github.com/orgs/rstats\",\"repos_url\":\"https://api.github.com/orgs/rstats/repos\",\"events_url\":\"https://api.github.com/orgs/rstats/events\",\"hooks_url\":\"https://api.github.com/orgs/rstats/hooks\",\"issues_url\":\"https://api.github.com/orgs/rstats/issues\",\"members_url\":\"https://api.github.com/orgs/rstats/members{/member}\",\"public_members_url\":\"https://api.github.com/orgs/rstats/public_members{/member}\",\"avatar_url\":\"https://avatars.githubusercontent.com/u/722735?v=4\"},{\"login\":\"ropensci\",\"id\":1200269,\"node_id\":\"MDEyOk9yZ2FuaXphdGlvbjEyMDAyNjk=\",\"url\":\"https://api.github.com/orgs/ropensci\",\"repos_url\":\"https://api.github.com/orgs/ropensci/repos\",\"events_url\":\"https://api.github.com/orgs/ropensci/events\",\"hooks_url\":\"https://api.github.com/orgs/ropensci/hooks\",\"issues_url\":\"https://api.github.com/orgs/ropensci/issues\",\"members_url\":\"https://api.github.com/orgs/ropensci/members{/member}\",\"public_members_url\":\"https://api.github.com/orgs/ropensci/public_members{/member}\",\"avatar_url\":\"https://avatars.githubusercontent.com/u/1200269?v=4\",\"description\":\"\"},{\"login\":\"rjournal\",\"id\":3330561,\"node_id\":\"MDEyOk9yZ2FuaXphdGlvbjMzMzA1NjE=\",\"url\":\"https://api.github.com/orgs/rjournal\",\"repos_url\":\"https://api.github.com/orgs/rjournal/repos\",\"events_url\":\"https://api.github.com/orgs/rjournal/events\",\"hooks_url\":\"https://api.github.com/orgs/rjournal/hooks\",\"issues_url\":\"https://api.github.com/orgs/rjournal/issues\",\"members_url\":\"https://api.github.com/orgs/rjournal/members{/member}\",\"public_members_url\":\"https://api.github.com/orgs/rjournal/public_members{/member}\",\"avatar_url\":\"https://avatars.githubusercontent.com/u/3330561?v=4\"},{\"login\":\"r-dbi\",\"id\":5695665,\"node_id\":\"MDEyOk9yZ2FuaXphdGlvbjU2OTU2NjU=\",\"url\":\"https://api.github.com/orgs/r-dbi\",\"repos_url\":\"https://api.github.com/orgs/r-dbi/repos\",\"events_url\":\"https://api.github.com/orgs/r-dbi/events\",\"hooks_url\":\"https://api.github.com/orgs/r-dbi/hooks\",\"issues_url\":\"https://api.github.com/orgs/r-dbi/issues\",\"members_url\":\"https://api.github.com/orgs/r-dbi/members{/member}\",\"public_members_url\":\"https://api.github.com/orgs/r-dbi/public_members{/member}\",\"avatar_url\":\"https://avatars.githubusercontent.com/u/5695665?v=4\",\"description\":\"R + databases\"},{\"login\":\"RConsortium\",\"id\":15366137,\"node_id\":\"MDEyOk9yZ2FuaXphdGlvbjE1MzY2MTM3\",\"url\":\"https://api.github.com/orgs/RConsortium\",\"repos_url\":\"https://api.github.com/orgs/RConsortium/repos\",\"events_url\":\"https://api.github.com/orgs/RConsortium/events\",\"hooks_url\":\"https://api.github.com/orgs/RConsortium/hooks\",\"issues_url\":\"https://api.github.com/orgs/RConsortium/issues\",\"members_url\":\"https://api.github.com/orgs/RConsortium/members{/member}\",\"public_members_url\":\"https://api.github.com/orgs/RConsortium/public_members{/member}\",\"avatar_url\":\"https://avatars.githubusercontent.com/u/15366137?v=4\",\"description\":\"The R Consortium, Inc was established to provide support to the R Foundation and R Community, using maintaining and distributing R software.\"},{\"login\":\"tidyverse\",\"id\":22032646,\"node_id\":\"MDEyOk9yZ2FuaXphdGlvbjIyMDMyNjQ2\",\"url\":\"https://api.github.com/orgs/tidyverse\",\"repos_url\":\"https://api.github.com/orgs/tidyverse/repos\",\"events_url\":\"https://api.github.com/orgs/tidyverse/events\",\"hooks_url\":\"https://api.github.com/orgs/tidyverse/hooks\",\"issues_url\":\"https://api.github.com/orgs/tidyverse/issues\",\"members_url\":\"https://api.github.com/orgs/tidyverse/members{/member}\",\"public_members_url\":\"https://api.github.com/orgs/tidyverse/public_members{/member}\",\"avatar_url\":\"https://avatars.githubusercontent.com/u/22032646?v=4\",\"description\":\"The tidyverse is a collection of R packages that share common principles and are designed to work together seamlessly\"},{\"login\":\"r-lib\",\"id\":22618716,\"node_id\":\"MDEyOk9yZ2FuaXphdGlvbjIyNjE4NzE2\",\"url\":\"https://api.github.com/orgs/r-lib\",\"repos_url\":\"https://api.github.com/orgs/r-lib/repos\",\"events_url\":\"https://api.github.com/orgs/r-lib/events\",\"hooks_url\":\"https://api.github.com/orgs/r-lib/hooks\",\"issues_url\":\"https://api.github.com/orgs/r-lib/issues\",\"members_url\":\"https://api.github.com/orgs/r-lib/members{/member}\",\"public_members_url\":\"https://api.github.com/orgs/r-lib/public_members{/member}\",\"avatar_url\":\"https://avatars.githubusercontent.com/u/22618716?v=4\",\"description\":\"\"},{\"login\":\"rstudio-education\",\"id\":34165516,\"node_id\":\"MDEyOk9yZ2FuaXphdGlvbjM0MTY1NTE2\",\"url\":\"https://api.github.com/orgs/rstudio-education\",\"repos_url\":\"https://api.github.com/orgs/rstudio-education/repos\",\"events_url\":\"https://api.github.com/orgs/rstudio-education/events\",\"hooks_url\":\"https://api.github.com/orgs/rstudio-education/hooks\",\"issues_url\":\"https://api.github.com/orgs/rstudio-education/issues\",\"members_url\":\"https://api.github.com/orgs/rstudio-education/members{/member}\",\"public_members_url\":\"https://api.github.com/orgs/rstudio-education/public_members{/member}\",\"avatar_url\":\"https://avatars.githubusercontent.com/u/34165516?v=4\",\"description\":\"\"}]"

And back from JSON format to dataframe

backagain = fromJSON(jsondata)
identical(data, backagain)

[1] TRUE

API with Query Parameters

When is the ISS was going to pass over a given location on earth?

The ISS and the Brooklynn Bridge

Query parameters:
- longitude
- latitude
Combine with the original URL

res = GET("http://api.open-notify.org/iss-pass.json",
query = list(lat = 40.7, lon = -74))
res

Response [http://api.open-notify.org/iss-pass.json?lat=40.7&lon=-74]
  Date: 2021-05-04 22:51
  Status: 200
  Content-Type: application/json
  Size: 518 B
{
  "message": "success", 
  "request": {
    "altitude": 100, 
    "datetime": 1620167911, 
    "latitude": 40.7, 
    "longitude": -74.0, 
    "passes": 5
  }, 
  "response": [
...

data = fromJSON(rawToChar(res$content))
data$response

  duration   risetime
1      564 1620188939
2      654 1620194681
3      592 1620200551
4      566 1620206436
5      632 1620212262

Check the documentation of the API you are using
- required and optional parameters
- authentication
Not every API works with query parameters
- Note the “?{var}=” construction in the URL of the ISS
- Study the URL before building your function Distill is a publication format for scientific and technical writing, native to the web.

Learn more about using Distill at https://rstudio.github.io/distill.

Web scraping with R