Web Scraping with rvest, httr and jsonlite packages.
Web pages are styles with CSS files: cascade style sheets that determine layout of the webpage. CSS selectors can be used to look for HTML elements of interest. One such is the SelectorGadget google chrome extension. You need to install it to your browser before proceeding.
To use it open the page
Click on the element you want to select. SelectorGadget will make a first guess at what css selector you want. It’s likely to be bad since it only has one example to learn from, but it’s a start. Elements that match the selector will be highlighted in yellow.
Click on elements that shouldn’t be selected. They will turn red. Click on elements that should be selected. They will turn green.
Iterate until only the elements you want are selected. SelectorGadget isn’t perfect and sometimes won’t be able to find a useful css selector. Sometimes starting from a different element helps. More at tidyverse/rvest
For example, if we want the actors listed on the IMDB movie page, e.g. The Shawshank Redemption
HTML tags can be passed to functions to retrieve the web page elements of interest.
rvest
For scrapping (harvesting) data fro the web in a structured format that can be used in further analysis.
read_html()
: collects data from the webpagehtml_nodes()
: extract the relevant pieceshtml_text()
: extract tags of the relevant piecehtml_attributes()
: extract attributes of the relevant piece# specify url
url = 'https://www.imdb.com/title/tt0111161/'
# reading the html code from the
webpage = read_html(url)
webpage
{html_document}
<html xmlns:og="http://ogp.me/ns#" xmlns:fb="http://www.facebook.com/2008/fbml">
[1] <head>\n<meta http-equiv="Content-Type" content="text/html; cha ...
[2] <body id="styleguide-v2" class="fixed">\n <img heigh ...
Once we have determined the CSS selector, we use it to extract the information we want
cast = html_nodes(webpage, ".primary_photo+ td a")
length(cast)
[1] 15
cast[1:2]
{xml_nodeset (2)}
[1] <a href="/name/nm0000209/?ref_=tt_cl_t1"> Tim Robbins\n</a>
[2] <a href="/name/nm0000151/?ref_=tt_cl_t2"> Morgan Freeman\n</a>
Finally, we extract the text from the selected HTML nodes.
html_text(cast, trim = T)
[1] "Tim Robbins" "Morgan Freeman" "Bob Gunton"
[4] "William Sadler" "Clancy Brown" "Gil Bellows"
[7] "Mark Rolston" "James Whitmore" "Jeffrey DeMunn"
[10] "Larry Brandenburg" "Neil Giuntoli" "Brian Libby"
[13] "David Proval" "Joseph Ragno" "Jude Ciccolella"
all_tables = html_table(webpage, "table", header = FALSE)
casttable = html_table(webpage, ".cast_list", header = F)[[1]]
head(casttable)
X1 X2
1 Cast overview, first billed only: Cast overview, first billed only:
2 \n \n Tim Robbins\n
3 \n \n Morgan Freeman\n
4 \n \n Bob Gunton\n
5 \n \n William Sadler\n
6 \n \n Clancy Brown\n
X3
1 Cast overview, first billed only:
2 \n ...\n
3 \n ...\n
4 \n ...\n
5 \n ...\n
6 \n ...\n
X4
1 Cast overview, first billed only:
2 \n Andy Dufresne \n \n
3 \n Ellis Boyd 'Red' Redding \n \n
4 \n Warden Norton \n \n
5 \n Heywood \n \n
6 \n Captain Hadley \n \n
If say we are also interested in extracting the links to the actor’s pages, we can acces html attributes of the selected nodes using html_attrs( )
.
cast_attrs = html_attrs(cast)
cast_attrs[1:2]
[[1]]
href
"/name/nm0000209/?ref_=tt_cl_t1"
[[2]]
href
"/name/nm0000151/?ref_=tt_cl_t2"
As we can see there’s only one attribute called href
which contains relative url to the actor’s page. We can extract it using html_attr()
, indicating the name of the attribute of interest. Relative urls can be turned into absolute urls using url_absolute()
.
cast_rel_urls = html_attr(cast, "href")
length(cast_rel_urls)
[1] 15
cast_rel_urls[1:2]
[1] "/name/nm0000209/?ref_=tt_cl_t1" "/name/nm0000151/?ref_=tt_cl_t2"
cast_abs_urls = html_attr(cast, "href") %>%
url_absolute(url)
cast_abs_urls[1:2]
[1] "https://www.imdb.com/name/nm0000209/?ref_=tt_cl_t1"
[2] "https://www.imdb.com/name/nm0000151/?ref_=tt_cl_t2"
httr
and jsonlite
httr
Create request with GET()
function. Input is a url which specifies the address of the server.
Example: current number of people in space
res = GET('http://api.open-notify.org/astros.json')
res
Response [http://api.open-notify.org/astros.json]
Date: 2021-05-04 22:51
Status: 200
Content-Type: application/json
Size: 355 B
[ { “name”: “Miguel”, “student_id”: 1, “exam_1”: 85, “exam_2”: 86 }, { “name”: “Sofia”, “student_id”: 2, “exam_1”: 94, “exam_2”: 93
jsonlite
package provides parser and generator functions:
fromJSON()
toJSON()
fromJSON()
:
toJSON()
:
fromJSON()
rawToChar(res$content)
[1] "{\"number\": 7, \"message\": \"success\", \"people\": [{\"name\": \"Mark Vande Hei\", \"craft\": \"ISS\"}, {\"name\": \"Oleg Novitskiy\", \"craft\": \"ISS\"}, {\"name\": \"Pyotr Dubrov\", \"craft\": \"ISS\"}, {\"name\": \"Thomas Pesquet\", \"craft\": \"ISS\"}, {\"name\": \"Megan McArthur\", \"craft\": \"ISS\"}, {\"name\": \"Shane Kimbrough\", \"craft\": \"ISS\"}, {\"name\": \"Akihiko Hoshide\", \"craft\": \"ISS\"}]}"
data = fromJSON(rawToChar(res$content))
data
$number
[1] 7
$message
[1] "success"
$people
name craft
1 Mark Vande Hei ISS
2 Oleg Novitskiy ISS
3 Pyotr Dubrov ISS
4 Thomas Pesquet ISS
5 Megan McArthur ISS
6 Shane Kimbrough ISS
7 Akihiko Hoshide ISS
Read directly from a url
data <- fromJSON("https://api.github.com/users/hadley/orgs")
class(data)
[1] "data.frame"
names(data)
[1] "login" "id" "node_id"
[4] "url" "repos_url" "events_url"
[7] "hooks_url" "issues_url" "members_url"
[10] "public_members_url" "avatar_url" "description"
toJSON()
jsondata = toJSON(data)
head(jsondata)
[1] "[{\"login\":\"ggobi\",\"id\":423638,\"node_id\":\"MDEyOk9yZ2FuaXphdGlvbjQyMzYzOA==\",\"url\":\"https://api.github.com/orgs/ggobi\",\"repos_url\":\"https://api.github.com/orgs/ggobi/repos\",\"events_url\":\"https://api.github.com/orgs/ggobi/events\",\"hooks_url\":\"https://api.github.com/orgs/ggobi/hooks\",\"issues_url\":\"https://api.github.com/orgs/ggobi/issues\",\"members_url\":\"https://api.github.com/orgs/ggobi/members{/member}\",\"public_members_url\":\"https://api.github.com/orgs/ggobi/public_members{/member}\",\"avatar_url\":\"https://avatars.githubusercontent.com/u/423638?v=4\",\"description\":\"\"},{\"login\":\"rstudio\",\"id\":513560,\"node_id\":\"MDEyOk9yZ2FuaXphdGlvbjUxMzU2MA==\",\"url\":\"https://api.github.com/orgs/rstudio\",\"repos_url\":\"https://api.github.com/orgs/rstudio/repos\",\"events_url\":\"https://api.github.com/orgs/rstudio/events\",\"hooks_url\":\"https://api.github.com/orgs/rstudio/hooks\",\"issues_url\":\"https://api.github.com/orgs/rstudio/issues\",\"members_url\":\"https://api.github.com/orgs/rstudio/members{/member}\",\"public_members_url\":\"https://api.github.com/orgs/rstudio/public_members{/member}\",\"avatar_url\":\"https://avatars.githubusercontent.com/u/513560?v=4\",\"description\":\"\"},{\"login\":\"rstats\",\"id\":722735,\"node_id\":\"MDEyOk9yZ2FuaXphdGlvbjcyMjczNQ==\",\"url\":\"https://api.github.com/orgs/rstats\",\"repos_url\":\"https://api.github.com/orgs/rstats/repos\",\"events_url\":\"https://api.github.com/orgs/rstats/events\",\"hooks_url\":\"https://api.github.com/orgs/rstats/hooks\",\"issues_url\":\"https://api.github.com/orgs/rstats/issues\",\"members_url\":\"https://api.github.com/orgs/rstats/members{/member}\",\"public_members_url\":\"https://api.github.com/orgs/rstats/public_members{/member}\",\"avatar_url\":\"https://avatars.githubusercontent.com/u/722735?v=4\"},{\"login\":\"ropensci\",\"id\":1200269,\"node_id\":\"MDEyOk9yZ2FuaXphdGlvbjEyMDAyNjk=\",\"url\":\"https://api.github.com/orgs/ropensci\",\"repos_url\":\"https://api.github.com/orgs/ropensci/repos\",\"events_url\":\"https://api.github.com/orgs/ropensci/events\",\"hooks_url\":\"https://api.github.com/orgs/ropensci/hooks\",\"issues_url\":\"https://api.github.com/orgs/ropensci/issues\",\"members_url\":\"https://api.github.com/orgs/ropensci/members{/member}\",\"public_members_url\":\"https://api.github.com/orgs/ropensci/public_members{/member}\",\"avatar_url\":\"https://avatars.githubusercontent.com/u/1200269?v=4\",\"description\":\"\"},{\"login\":\"rjournal\",\"id\":3330561,\"node_id\":\"MDEyOk9yZ2FuaXphdGlvbjMzMzA1NjE=\",\"url\":\"https://api.github.com/orgs/rjournal\",\"repos_url\":\"https://api.github.com/orgs/rjournal/repos\",\"events_url\":\"https://api.github.com/orgs/rjournal/events\",\"hooks_url\":\"https://api.github.com/orgs/rjournal/hooks\",\"issues_url\":\"https://api.github.com/orgs/rjournal/issues\",\"members_url\":\"https://api.github.com/orgs/rjournal/members{/member}\",\"public_members_url\":\"https://api.github.com/orgs/rjournal/public_members{/member}\",\"avatar_url\":\"https://avatars.githubusercontent.com/u/3330561?v=4\"},{\"login\":\"r-dbi\",\"id\":5695665,\"node_id\":\"MDEyOk9yZ2FuaXphdGlvbjU2OTU2NjU=\",\"url\":\"https://api.github.com/orgs/r-dbi\",\"repos_url\":\"https://api.github.com/orgs/r-dbi/repos\",\"events_url\":\"https://api.github.com/orgs/r-dbi/events\",\"hooks_url\":\"https://api.github.com/orgs/r-dbi/hooks\",\"issues_url\":\"https://api.github.com/orgs/r-dbi/issues\",\"members_url\":\"https://api.github.com/orgs/r-dbi/members{/member}\",\"public_members_url\":\"https://api.github.com/orgs/r-dbi/public_members{/member}\",\"avatar_url\":\"https://avatars.githubusercontent.com/u/5695665?v=4\",\"description\":\"R + databases\"},{\"login\":\"RConsortium\",\"id\":15366137,\"node_id\":\"MDEyOk9yZ2FuaXphdGlvbjE1MzY2MTM3\",\"url\":\"https://api.github.com/orgs/RConsortium\",\"repos_url\":\"https://api.github.com/orgs/RConsortium/repos\",\"events_url\":\"https://api.github.com/orgs/RConsortium/events\",\"hooks_url\":\"https://api.github.com/orgs/RConsortium/hooks\",\"issues_url\":\"https://api.github.com/orgs/RConsortium/issues\",\"members_url\":\"https://api.github.com/orgs/RConsortium/members{/member}\",\"public_members_url\":\"https://api.github.com/orgs/RConsortium/public_members{/member}\",\"avatar_url\":\"https://avatars.githubusercontent.com/u/15366137?v=4\",\"description\":\"The R Consortium, Inc was established to provide support to the R Foundation and R Community, using maintaining and distributing R software.\"},{\"login\":\"tidyverse\",\"id\":22032646,\"node_id\":\"MDEyOk9yZ2FuaXphdGlvbjIyMDMyNjQ2\",\"url\":\"https://api.github.com/orgs/tidyverse\",\"repos_url\":\"https://api.github.com/orgs/tidyverse/repos\",\"events_url\":\"https://api.github.com/orgs/tidyverse/events\",\"hooks_url\":\"https://api.github.com/orgs/tidyverse/hooks\",\"issues_url\":\"https://api.github.com/orgs/tidyverse/issues\",\"members_url\":\"https://api.github.com/orgs/tidyverse/members{/member}\",\"public_members_url\":\"https://api.github.com/orgs/tidyverse/public_members{/member}\",\"avatar_url\":\"https://avatars.githubusercontent.com/u/22032646?v=4\",\"description\":\"The tidyverse is a collection of R packages that share common principles and are designed to work together seamlessly\"},{\"login\":\"r-lib\",\"id\":22618716,\"node_id\":\"MDEyOk9yZ2FuaXphdGlvbjIyNjE4NzE2\",\"url\":\"https://api.github.com/orgs/r-lib\",\"repos_url\":\"https://api.github.com/orgs/r-lib/repos\",\"events_url\":\"https://api.github.com/orgs/r-lib/events\",\"hooks_url\":\"https://api.github.com/orgs/r-lib/hooks\",\"issues_url\":\"https://api.github.com/orgs/r-lib/issues\",\"members_url\":\"https://api.github.com/orgs/r-lib/members{/member}\",\"public_members_url\":\"https://api.github.com/orgs/r-lib/public_members{/member}\",\"avatar_url\":\"https://avatars.githubusercontent.com/u/22618716?v=4\",\"description\":\"\"},{\"login\":\"rstudio-education\",\"id\":34165516,\"node_id\":\"MDEyOk9yZ2FuaXphdGlvbjM0MTY1NTE2\",\"url\":\"https://api.github.com/orgs/rstudio-education\",\"repos_url\":\"https://api.github.com/orgs/rstudio-education/repos\",\"events_url\":\"https://api.github.com/orgs/rstudio-education/events\",\"hooks_url\":\"https://api.github.com/orgs/rstudio-education/hooks\",\"issues_url\":\"https://api.github.com/orgs/rstudio-education/issues\",\"members_url\":\"https://api.github.com/orgs/rstudio-education/members{/member}\",\"public_members_url\":\"https://api.github.com/orgs/rstudio-education/public_members{/member}\",\"avatar_url\":\"https://avatars.githubusercontent.com/u/34165516?v=4\",\"description\":\"\"}]"
And back from JSON format to dataframe
backagain = fromJSON(jsondata)
identical(data, backagain)
[1] TRUE
The ISS and the Brooklynn Bridge
res = GET("http://api.open-notify.org/iss-pass.json",
query = list(lat = 40.7, lon = -74))
res
Response [http://api.open-notify.org/iss-pass.json?lat=40.7&lon=-74]
Date: 2021-05-04 22:51
Status: 200
Content-Type: application/json
Size: 518 B
{
"message": "success",
"request": {
"altitude": 100,
"datetime": 1620167911,
"latitude": 40.7,
"longitude": -74.0,
"passes": 5
},
"response": [
...
data = fromJSON(rawToChar(res$content))
data$response
duration risetime
1 564 1620188939
2 654 1620194681
3 592 1620200551
4 566 1620206436
5 632 1620212262
Learn more about using Distill at https://rstudio.github.io/distill.
For attribution, please cite this work as
Okola (2021, May 5). Basil Okola: Web scraping with R. Retrieved from https://bokola214.netlify.app/posts/2021-05-05-web-scraping-with-r/
BibTeX citation
@misc{okola2021web, author = {Okola, Basil}, title = {Basil Okola: Web scraping with R}, url = {https://bokola214.netlify.app/posts/2021-05-05-web-scraping-with-r/}, year = {2021} }