Web scraping with R

web scraping rvest httr jsonlite

Web Scraping with rvest, httr and jsonlite packages.

Basil Okola https://github.com/Bokola
05-05-2021

Web scraping in R

HTML Data structure

Web pages are styles with CSS files: cascade style sheets that determine layout of the webpage. CSS selectors can be used to look for HTML elements of interest. One such is the SelectorGadget google chrome extension. You need to install it to your browser before proceeding.

To use it open the page

  1. Click on the element you want to select. SelectorGadget will make a first guess at what css selector you want. It’s likely to be bad since it only has one example to learn from, but it’s a start. Elements that match the selector will be highlighted in yellow.

  2. Click on elements that shouldn’t be selected. They will turn red. Click on elements that should be selected. They will turn green.

  3. Iterate until only the elements you want are selected. SelectorGadget isn’t perfect and sometimes won’t be able to find a useful css selector. Sometimes starting from a different element helps. More at tidyverse/rvest

For example, if we want the actors listed on the IMDB movie page, e.g. The Shawshank Redemption

HTML tags can be passed to functions to retrieve the web page elements of interest.

rvest

For scrapping (harvesting) data fro the web in a structured format that can be used in further analysis.

rvest functions

# specify url 
url = 'https://www.imdb.com/title/tt0111161/'
# reading the html code from the 
webpage = read_html(url)
webpage
{html_document}
<html xmlns:og="http://ogp.me/ns#" xmlns:fb="http://www.facebook.com/2008/fbml">
[1] <head>\n<meta http-equiv="Content-Type" content="text/html; cha ...
[2] <body id="styleguide-v2" class="fixed">\n            <img heigh ...

Once we have determined the CSS selector, we use it to extract the information we want

cast = html_nodes(webpage, ".primary_photo+ td a")
length(cast)
[1] 15
cast[1:2]
{xml_nodeset (2)}
[1] <a href="/name/nm0000209/?ref_=tt_cl_t1"> Tim Robbins\n</a>
[2] <a href="/name/nm0000151/?ref_=tt_cl_t2"> Morgan Freeman\n</a>

Finally, we extract the text from the selected HTML nodes.

html_text(cast, trim = T)
 [1] "Tim Robbins"       "Morgan Freeman"    "Bob Gunton"       
 [4] "William Sadler"    "Clancy Brown"      "Gil Bellows"      
 [7] "Mark Rolston"      "James Whitmore"    "Jeffrey DeMunn"   
[10] "Larry Brandenburg" "Neil Giuntoli"     "Brian Libby"      
[13] "David Proval"      "Joseph Ragno"      "Jude Ciccolella"  

Extracting a table

all_tables = html_table(webpage, "table", header = FALSE)
casttable = html_table(webpage, ".cast_list", header = F)[[1]]
head(casttable)
                                 X1                                X2
1 Cast overview, first billed only: Cast overview, first billed only:
2                      \n                  \n Tim Robbins\n          
3                      \n               \n Morgan Freeman\n          
4                      \n                   \n Bob Gunton\n          
5                      \n               \n William Sadler\n          
6                      \n                 \n Clancy Brown\n          
                                 X3
1 Cast overview, first billed only:
2   \n              ...\n          
3   \n              ...\n          
4   \n              ...\n          
5   \n              ...\n          
6   \n              ...\n          
                                                                       X4
1                                       Cast overview, first billed only:
2            \n            Andy Dufresne \n                  \n          
3 \n            Ellis Boyd 'Red' Redding \n                  \n          
4            \n            Warden Norton \n                  \n          
5                  \n            Heywood \n                  \n          
6           \n            Captain Hadley \n                  \n          

Attributes of an element

If say we are also interested in extracting the links to the actor’s pages, we can acces html attributes of the selected nodes using html_attrs( ).

cast_attrs = html_attrs(cast)
cast_attrs[1:2]
[[1]]
                            href 
"/name/nm0000209/?ref_=tt_cl_t1" 

[[2]]
                            href 
"/name/nm0000151/?ref_=tt_cl_t2" 

As we can see there’s only one attribute called href which contains relative url to the actor’s page. We can extract it using html_attr(), indicating the name of the attribute of interest. Relative urls can be turned into absolute urls using url_absolute().

cast_rel_urls = html_attr(cast, "href")
length(cast_rel_urls)
[1] 15
cast_rel_urls[1:2]
[1] "/name/nm0000209/?ref_=tt_cl_t1" "/name/nm0000151/?ref_=tt_cl_t2"
cast_abs_urls = html_attr(cast, "href") %>%
  url_absolute(url)
cast_abs_urls[1:2]
[1] "https://www.imdb.com/name/nm0000209/?ref_=tt_cl_t1"
[2] "https://www.imdb.com/name/nm0000151/?ref_=tt_cl_t2"

Making API Requests in R

httr

res = GET('http://api.open-notify.org/astros.json')
res
Response [http://api.open-notify.org/astros.json]
  Date: 2021-05-04 22:51
  Status: 200
  Content-Type: application/json
  Size: 355 B

JSON Format

[ { “name”: “Miguel”, “student_id”: 1, “exam_1”: 85, “exam_2”: 86 }, { “name”: “Sofia”, “student_id”: 2, “exam_1”: 94, “exam_2”: 93

jsonlite

fromJSON()

rawToChar(res$content)
[1] "{\"number\": 7, \"message\": \"success\", \"people\": [{\"name\": \"Mark Vande Hei\", \"craft\": \"ISS\"}, {\"name\": \"Oleg Novitskiy\", \"craft\": \"ISS\"}, {\"name\": \"Pyotr Dubrov\", \"craft\": \"ISS\"}, {\"name\": \"Thomas Pesquet\", \"craft\": \"ISS\"}, {\"name\": \"Megan McArthur\", \"craft\": \"ISS\"}, {\"name\": \"Shane Kimbrough\", \"craft\": \"ISS\"}, {\"name\": \"Akihiko Hoshide\", \"craft\": \"ISS\"}]}"
data = fromJSON(rawToChar(res$content))
data
$number
[1] 7

$message
[1] "success"

$people
             name craft
1  Mark Vande Hei   ISS
2  Oleg Novitskiy   ISS
3    Pyotr Dubrov   ISS
4  Thomas Pesquet   ISS
5  Megan McArthur   ISS
6 Shane Kimbrough   ISS
7 Akihiko Hoshide   ISS

Read directly from a url

data <- fromJSON("https://api.github.com/users/hadley/orgs")
class(data)
[1] "data.frame"
names(data)
 [1] "login"              "id"                 "node_id"           
 [4] "url"                "repos_url"          "events_url"        
 [7] "hooks_url"          "issues_url"         "members_url"       
[10] "public_members_url" "avatar_url"         "description"       

toJSON()

jsondata = toJSON(data)
head(jsondata)
[1] "[{\"login\":\"ggobi\",\"id\":423638,\"node_id\":\"MDEyOk9yZ2FuaXphdGlvbjQyMzYzOA==\",\"url\":\"https://api.github.com/orgs/ggobi\",\"repos_url\":\"https://api.github.com/orgs/ggobi/repos\",\"events_url\":\"https://api.github.com/orgs/ggobi/events\",\"hooks_url\":\"https://api.github.com/orgs/ggobi/hooks\",\"issues_url\":\"https://api.github.com/orgs/ggobi/issues\",\"members_url\":\"https://api.github.com/orgs/ggobi/members{/member}\",\"public_members_url\":\"https://api.github.com/orgs/ggobi/public_members{/member}\",\"avatar_url\":\"https://avatars.githubusercontent.com/u/423638?v=4\",\"description\":\"\"},{\"login\":\"rstudio\",\"id\":513560,\"node_id\":\"MDEyOk9yZ2FuaXphdGlvbjUxMzU2MA==\",\"url\":\"https://api.github.com/orgs/rstudio\",\"repos_url\":\"https://api.github.com/orgs/rstudio/repos\",\"events_url\":\"https://api.github.com/orgs/rstudio/events\",\"hooks_url\":\"https://api.github.com/orgs/rstudio/hooks\",\"issues_url\":\"https://api.github.com/orgs/rstudio/issues\",\"members_url\":\"https://api.github.com/orgs/rstudio/members{/member}\",\"public_members_url\":\"https://api.github.com/orgs/rstudio/public_members{/member}\",\"avatar_url\":\"https://avatars.githubusercontent.com/u/513560?v=4\",\"description\":\"\"},{\"login\":\"rstats\",\"id\":722735,\"node_id\":\"MDEyOk9yZ2FuaXphdGlvbjcyMjczNQ==\",\"url\":\"https://api.github.com/orgs/rstats\",\"repos_url\":\"https://api.github.com/orgs/rstats/repos\",\"events_url\":\"https://api.github.com/orgs/rstats/events\",\"hooks_url\":\"https://api.github.com/orgs/rstats/hooks\",\"issues_url\":\"https://api.github.com/orgs/rstats/issues\",\"members_url\":\"https://api.github.com/orgs/rstats/members{/member}\",\"public_members_url\":\"https://api.github.com/orgs/rstats/public_members{/member}\",\"avatar_url\":\"https://avatars.githubusercontent.com/u/722735?v=4\"},{\"login\":\"ropensci\",\"id\":1200269,\"node_id\":\"MDEyOk9yZ2FuaXphdGlvbjEyMDAyNjk=\",\"url\":\"https://api.github.com/orgs/ropensci\",\"repos_url\":\"https://api.github.com/orgs/ropensci/repos\",\"events_url\":\"https://api.github.com/orgs/ropensci/events\",\"hooks_url\":\"https://api.github.com/orgs/ropensci/hooks\",\"issues_url\":\"https://api.github.com/orgs/ropensci/issues\",\"members_url\":\"https://api.github.com/orgs/ropensci/members{/member}\",\"public_members_url\":\"https://api.github.com/orgs/ropensci/public_members{/member}\",\"avatar_url\":\"https://avatars.githubusercontent.com/u/1200269?v=4\",\"description\":\"\"},{\"login\":\"rjournal\",\"id\":3330561,\"node_id\":\"MDEyOk9yZ2FuaXphdGlvbjMzMzA1NjE=\",\"url\":\"https://api.github.com/orgs/rjournal\",\"repos_url\":\"https://api.github.com/orgs/rjournal/repos\",\"events_url\":\"https://api.github.com/orgs/rjournal/events\",\"hooks_url\":\"https://api.github.com/orgs/rjournal/hooks\",\"issues_url\":\"https://api.github.com/orgs/rjournal/issues\",\"members_url\":\"https://api.github.com/orgs/rjournal/members{/member}\",\"public_members_url\":\"https://api.github.com/orgs/rjournal/public_members{/member}\",\"avatar_url\":\"https://avatars.githubusercontent.com/u/3330561?v=4\"},{\"login\":\"r-dbi\",\"id\":5695665,\"node_id\":\"MDEyOk9yZ2FuaXphdGlvbjU2OTU2NjU=\",\"url\":\"https://api.github.com/orgs/r-dbi\",\"repos_url\":\"https://api.github.com/orgs/r-dbi/repos\",\"events_url\":\"https://api.github.com/orgs/r-dbi/events\",\"hooks_url\":\"https://api.github.com/orgs/r-dbi/hooks\",\"issues_url\":\"https://api.github.com/orgs/r-dbi/issues\",\"members_url\":\"https://api.github.com/orgs/r-dbi/members{/member}\",\"public_members_url\":\"https://api.github.com/orgs/r-dbi/public_members{/member}\",\"avatar_url\":\"https://avatars.githubusercontent.com/u/5695665?v=4\",\"description\":\"R + databases\"},{\"login\":\"RConsortium\",\"id\":15366137,\"node_id\":\"MDEyOk9yZ2FuaXphdGlvbjE1MzY2MTM3\",\"url\":\"https://api.github.com/orgs/RConsortium\",\"repos_url\":\"https://api.github.com/orgs/RConsortium/repos\",\"events_url\":\"https://api.github.com/orgs/RConsortium/events\",\"hooks_url\":\"https://api.github.com/orgs/RConsortium/hooks\",\"issues_url\":\"https://api.github.com/orgs/RConsortium/issues\",\"members_url\":\"https://api.github.com/orgs/RConsortium/members{/member}\",\"public_members_url\":\"https://api.github.com/orgs/RConsortium/public_members{/member}\",\"avatar_url\":\"https://avatars.githubusercontent.com/u/15366137?v=4\",\"description\":\"The R Consortium, Inc was established to provide support to the R Foundation and R Community, using maintaining and distributing R software.\"},{\"login\":\"tidyverse\",\"id\":22032646,\"node_id\":\"MDEyOk9yZ2FuaXphdGlvbjIyMDMyNjQ2\",\"url\":\"https://api.github.com/orgs/tidyverse\",\"repos_url\":\"https://api.github.com/orgs/tidyverse/repos\",\"events_url\":\"https://api.github.com/orgs/tidyverse/events\",\"hooks_url\":\"https://api.github.com/orgs/tidyverse/hooks\",\"issues_url\":\"https://api.github.com/orgs/tidyverse/issues\",\"members_url\":\"https://api.github.com/orgs/tidyverse/members{/member}\",\"public_members_url\":\"https://api.github.com/orgs/tidyverse/public_members{/member}\",\"avatar_url\":\"https://avatars.githubusercontent.com/u/22032646?v=4\",\"description\":\"The tidyverse is a collection of R packages that share common principles and are designed to work together seamlessly\"},{\"login\":\"r-lib\",\"id\":22618716,\"node_id\":\"MDEyOk9yZ2FuaXphdGlvbjIyNjE4NzE2\",\"url\":\"https://api.github.com/orgs/r-lib\",\"repos_url\":\"https://api.github.com/orgs/r-lib/repos\",\"events_url\":\"https://api.github.com/orgs/r-lib/events\",\"hooks_url\":\"https://api.github.com/orgs/r-lib/hooks\",\"issues_url\":\"https://api.github.com/orgs/r-lib/issues\",\"members_url\":\"https://api.github.com/orgs/r-lib/members{/member}\",\"public_members_url\":\"https://api.github.com/orgs/r-lib/public_members{/member}\",\"avatar_url\":\"https://avatars.githubusercontent.com/u/22618716?v=4\",\"description\":\"\"},{\"login\":\"rstudio-education\",\"id\":34165516,\"node_id\":\"MDEyOk9yZ2FuaXphdGlvbjM0MTY1NTE2\",\"url\":\"https://api.github.com/orgs/rstudio-education\",\"repos_url\":\"https://api.github.com/orgs/rstudio-education/repos\",\"events_url\":\"https://api.github.com/orgs/rstudio-education/events\",\"hooks_url\":\"https://api.github.com/orgs/rstudio-education/hooks\",\"issues_url\":\"https://api.github.com/orgs/rstudio-education/issues\",\"members_url\":\"https://api.github.com/orgs/rstudio-education/members{/member}\",\"public_members_url\":\"https://api.github.com/orgs/rstudio-education/public_members{/member}\",\"avatar_url\":\"https://avatars.githubusercontent.com/u/34165516?v=4\",\"description\":\"\"}]"

And back from JSON format to dataframe

backagain = fromJSON(jsondata)
identical(data, backagain)
[1] TRUE

API with Query Parameters

The ISS and the Brooklynn Bridge

res = GET("http://api.open-notify.org/iss-pass.json",
query = list(lat = 40.7, lon = -74))
res
Response [http://api.open-notify.org/iss-pass.json?lat=40.7&lon=-74]
  Date: 2021-05-04 22:51
  Status: 200
  Content-Type: application/json
  Size: 518 B
{
  "message": "success", 
  "request": {
    "altitude": 100, 
    "datetime": 1620167911, 
    "latitude": 40.7, 
    "longitude": -74.0, 
    "passes": 5
  }, 
  "response": [
...
data = fromJSON(rawToChar(res$content))
data$response
  duration   risetime
1      564 1620188939
2      654 1620194681
3      592 1620200551
4      566 1620206436
5      632 1620212262

Learn more about using Distill at https://rstudio.github.io/distill.

Citation

For attribution, please cite this work as

Okola (2021, May 5). Basil Okola: Web scraping with R. Retrieved from https://bokola214.netlify.app/posts/2021-05-05-web-scraping-with-r/

BibTeX citation

@misc{okola2021web,
  author = {Okola, Basil},
  title = {Basil Okola: Web scraping with R},
  url = {https://bokola214.netlify.app/posts/2021-05-05-web-scraping-with-r/},
  year = {2021}
}