The World Wide Web (WWW) has fabulously revolutionised the way we access information and led to opening the doors for both the searching and sharing of data. Web data is a certain determinant for business, academia or NGOs, leading the way to have better strategic decisions. This is especially so in contexts involving the analysis of the social aspect of data, which is including but, of course, not only limited to it.

As a real world example, when working with Web data, you usually need to work with some unclassified non-tabular data. These kinds of data are, obviously, mostly neither clean, organised nor structural. Due to the various platforms and compatibility issues, a method successfully worked at one web site might not work in another one. However, this should not be scare you, or us, because the influential programmers & advisors have spent tired and sleepless nights to decide on some standards in order to make the process less painful. Nevertheless, one must say that your hands inevitably become dirty when you start handling with the Web data.

Welcome aboard

Web data handling has two main phases. First, define the protocols (HTTP, POST, GET, …) and procedures (cookies, autentication, forms, …). Second, we parse (HTML, XML, JSON), extract and clean the data we obtained through protocols and procedures.

This blog post will go over how to work with Web data in R, using httr, xml2, jsonlite, and rvest packages in action. We will try to request and analyze a Wikipedia page, “The List of Chocolate Bar Brands” by multiple ways. At the end of the post, you will feel familiar with how the Web data is handled in R, and how it is transformed to make it useful in the R environment.

Above all, let’s load the tidyverse package first, as we will need it many times.

library(tidyverse)
## Loading tidyverse: ggplot2
## Loading tidyverse: tibble
## Loading tidyverse: tidyr
## Loading tidyverse: readr
## Loading tidyverse: purrr
## Loading tidyverse: dplyr

Have a go-fast boat with an “API”

Although it is possible to read .csv files from static sources, it can be cumbersome for website data (usually from social network sites) to obtain all the values in a structured way. An API (application programming interfaces) tool can set protocols, structured responses used to facilitate interaction between computers that help, e.g. an analyst, obtain all the necessary, most up-to-date data.

Together with API clients, you can read data as R objects. At that point, there are many API client packages available at CRAN that are basically wrapping the data for your usage. Things can be a little bit harder (but just a little bit) when you don’t find an API client on your target website data. At that point, you have to handle the API yourself - a process which we will also cover below.

HTTP (hyper text transfer protocol) is the main protocol on the web. A client sends a request to the server, and the server returns this response. In that way, the server data appears in our web browser. A request has four main steps to complete its own cycle: URL, method, headers, and body.

Let’s consider the method: It is concerned with the client wants from the server. The most popular two are: GET() - asking server to retrieve the source, and POST() - asking the server to create a new source. We will use GET() request more, as we want to get the data specified in the URL. POST() request can also be useful in the authentication process.

httr package is designed to have control over HTTP protocols and work with HTTP organised by HTTP verbs (GET(), POST(), etc). We use a free open-sourced httpbin to test our queries in an HTTP client. First of all, make a GET request to http://httpbin.org/get. Displaying the structure with str() function by using list.len option, which shows the max list elements.

#Load the httr package
library("httr")
get_http <- GET("http://httpbin.org/get")
str(get_http, list.len = 3)
## List of 10
##  $ url        : chr "http://httpbin.org/get"
##  $ status_code: int 200
##  $ headers    :List of 10
##   ..$ connection                      : chr "keep-alive"
##   ..$ server                          : chr "meinheld/0.6.1"
##   ..$ date                            : chr "Sat, 14 Oct 2017 18:13:06 GMT"
##   .. [list output truncated]

You can use GET() with query argument. Create list with country and manufacturer elements. Establish a parameter-based call to httpbin, with arg_pars argument parameters. Inspect the args: line below:

arg_pars <- list(country = "UK",
    manufacturer = "Mars")
arg_resp <- GET("https://httpbin.org/get", query = arg_pars)
print(arg_resp)
## Response [https://httpbin.org/get?country=UK&manufacturer=Mars]
##   Date: 2017-10-14 18:13
##   Status: 200
##   Content-Type: application/json
##   Size: 409 B
## {
##   "args": {
##     "country": "UK",
##     "manufacturer": "Mars"
##   },
##   "headers": {
##     "Accept": "application/json, text/xml, application/xml, */*",
##     "Accept-Encoding": "gzip, deflate",
##     "Connection": "close",
##     "Host": "httpbin.org",
## ...

To note it, Status: 200, means successful request. Status: 404, as you know very well, means server not found. If you wish to see the full list of HTTP status codes, you can click here.

We can also make a POST request to http://httpbin.org/post with the body “This message is from strboul”, and then we print it for inspection.

post_http <- POST(url="http://httpbin.org/post", body="This message is from strboul")
print(post_http)
## Response [http://httpbin.org/post]
##   Date: 2017-10-14 18:13
##   Status: 200
##   Content-Type: application/json
##   Size: 448 B
## {
##   "args": {},
##   "data": "This message is from strboul",
##   "files": {},
##   "form": {},
##   "headers": {
##     "Accept": "application/json, text/xml, application/xml, */*",
##     "Accept-Encoding": "gzip, deflate",
##     "Connection": "close",
##     "Content-Length": "28",
## ...

You are expected to be reasonable and respect the API service1, for sure. APIs have access tokens to control the misuse of their services, which usually limit the number of requests coming by unique users or sessions making the servers very busy or locked. Giving an example, Twitter controls the number of requests to the server by rate-limiting. That makes available GET requests in two ways: “15 calls every 15 minutes, and 180 calls every 15 minutes.” To correspond this request time, you can use Sys.sleep() function in a for loop in order to send expressions in a specified time interval.

Further, we define a simple R function working as an API client that helps obtain the content of desired Wikipedia pages. This purpose directs us to work with MediaWiki API web service.

The MediaWiki action API is a web service that provides convenient access to wiki features, data, and meta-data over HTTP, via a URL usually at api.php. Clients request particular “actions” by specifying an action parameter, mainly action=query to get information. It was known as the MediaWiki API, but there are now other web APIs available that connect to MediaWiki such as RESTBase and the Wikidata query service.

They are two types of URL. First, directory-based URLs seperated by slash, i.e. https://wikipedia.org/wiki/post/article/ ; second, parameter based URLs like Google Analytics’s UTM links http://www.example.com/?utm_source=adsite&utm_campaign=adcampaign .

This is a generic parameter-based url that tells Wikipedia’s web service API to send the content of the “main page”:

https://en.wikipedia.org/w/api.php?action=query&titles=Main%20Page&prop=revisions&rvprop=content&format=json

We can basically deconstruct it with modify_url function that modifies a url by first parsing it, and then replaces the components with the defined arguments. Also, we use it together with query argument which is ready to change the components of a url.

This URL tells English Wikipedia’s web service API to send you the content of the main page: Use any programming language to make an HTTP GET request for that URL (or just visit that link in your browser), and you’ll get a JSON document which includes the current wiki markup for the page titled “Main Page”. Changing the format to jsonfm will return a “pretty-printed” HTML result good for debugging.

An R function for page source defined:

library(httr)
page_source <- function(url_name, parse_format){
  url <- modify_url(
                "https://en.wikipedia.org/w/api.php?",
                query = list(
                action = "query",
                titles = url_name,
                prop = "revisions",
                rvprop = "content",
                format = parse_format)
                )
  response <- GET(url)
  if(http_error(response)){
    stop("The request failed")}
  else {
    result <- content(response)
    return(result)
  }
}

url_name is the name of the article page in the address bar. It is much better to copy & paste it rather than the actual article title seen in the HTML body part, as there might be disambiguation pages served in the Wikipedia.

Let’s run the function with;

a <- page_source("List_of_chocolate_bar_brands", "json")
str(a)
## List of 2
##  $ batchcomplete: chr ""
##  $ query        :List of 2
##   ..$ normalized:List of 1
##   .. ..$ :List of 2
##   .. .. ..$ from: chr "List_of_chocolate_bar_brands"
##   .. .. ..$ to  : chr "List of chocolate bar brands"
##   ..$ pages     :List of 1
##   .. ..$ 18509922:List of 4
##   .. .. ..$ pageid   : int 18509922
##   .. .. ..$ ns       : int 0
##   .. .. ..$ title    : chr "List of chocolate bar brands"
##   .. .. ..$ revisions:List of 1
##   .. .. .. ..$ :List of 3
##   .. .. .. .. ..$ contentformat: chr "text/x-wiki"
##   .. .. .. .. ..$ contentmodel : chr "wikitext"__truncated__

You can also use jsonfm, instead of json, at format variable of the page_source function to get XML-type “JSON with whitespace”. As said in the MediaWiki API:Main Page,

Changing the format to jsonfm will return a “pretty-printed” HTML result good for debugging.

Writing "jsonfm" in the page_source() function:

a01 <- page_source("List_of_chocolate_bar_brands", "jsonfm")
print(a01)
## {xml_document}
## <html class="client-nojs" lang="en" dir="ltr">
## [1] <head>\n<meta http-equiv="Content-Type" content="text/html; charset= ...
## [2] <body class="mediawiki ltr sitedir-ltr mw-hide-empty-elt ns--1 ns-sp ...

I love fresh air, seasalt and the JSON & XML

As we have seen in the above example, the Wikipedia API function retrieved the page file in JSON format. So what is that? Both JSON (Javascript Object Notation) and XML (Extensible Markup Language) - which we will see below, are very popular data storing formats in the web. Their data structure storage (listed) are different from R’s data frame or matrix (rectangular) - click here for the list of data structures.

jsonlite package offers a good implementation for working with JSON files in the R environment. You can work with JSON data after you parse it as an R object. With fromJSON() function, you can parse JSON and return it as text. Moreover, these arguments of the function, simplifyVector = TRUE converts number or string arrays into vectors; and simplifyDataFrame = TRUE converts number or string arrays into data frames.

library(jsonlite)
##
## Attaching package: 'jsonlite'

## The following object is masked from 'package:purrr':
##
##     flatten
chocolate_brands <- c('
  [
    {
      "name" : "Dove Bar",
      "founded" : 1956
    },
    {
      "name" : "Toblerone",
      "founded" : 1908
    }
  ]')

choco_json <- fromJSON(chocolate_brands, simplifyVector = F) # parse that with fromJSON()
print(choco_json)
## [[1]]
## [[1]]$name
## [1] "Dove Bar"
##
## [[1]]$founded
## [1] 1956
##
##
## [[2]]
## [[2]]$name
## [1] "Toblerone"
##
## [[2]]$founded
## [1] 1908

dplyr package is capable of data manipulation and reformatting with JSON objects. Extracting the title column with bind_rows() function:

name <- choco_json %>%
              bind_rows %>%
              select(name)
print(name)
## # A tibble: 2 x 1
##        name
##       <chr>
## 1  Dove Bar
## 2 Toblerone

The ballad of the XML

What if there is no data provided for us by any tabular data, or through an API? We can basically scrape the web page and obtain the desired data, although it may not be as organized as we have seen in the previous examples.

The scraped web data can be stored in XML format. A sample XML document will look like that:

<?xml version="1.0" encoding="UTF-8"?>
<products>
  <product>
    <name country = "UK">Dove Bar</name>
    <manufacturer>Mars</manufacturer>
  </product>
  <product>
    <name country = "USA">Toblerone</name>
    <manufacturer>Kraft Foods</manufacturer>
  </product>
</products>

XML is hiearchical and that can be seen as a tree. In the code above, <name> and <manufacturer> are child of <product> tag. <name> and <manufacturer> are siblings. Tags can have attributes such as <country = "...">.

xml2 package is useful for parsing XML documents. First, load the package. We parse the Wikipedia page, the list of chocolate bar brands. read_html() function with our defined url above returns a HTML page in the form of an XML document.

library(xml2)
url = "https://en.wikipedia.org/wiki/List_of_chocolate_bar_brands" #Repeat again here
choco_xml <- read_xml(url)
print(choco_xml)

## {xml_document} ## <html class="client-nojs" lang="en" dir="ltr"> ## [1] <head>\n \n List of chocolate bar br ... ## [2] <body class="mediawiki ltr sitedir-ltr mw-hide-empty-elt ns-0 ns-sub ...

Optionally, you can use write_xml() function, if you want to save XML in local. Changing options as_html to force HTML output.

write_xml(choco_xml, chocolates, "path/to/directory", options="as_xml")

We are going to use XPATH (standing for XML path language) allowing you to write queries for XML, especially to locate parts (locations of nodes, a bit like file paths: /movies/movie/title).

XPATH syntax is very basic,

  • / - Elements with root node at this level
  • // - Elements with nodes matching the selection anywhere or below the level
  • . - Elements of the current node
  • .. - Elements of the parent of the current node
  • @ - Extracting attributes

If you have unknown nodes,

  • * - Any element node
  • @* - Any attribute node

If you want several paths, use the | operator. //node1 | //node2

xml2::xml_find_all() function finds nodes matching an XPATH expression. It is also possible to use XPATH with html_node() function from rvest package. When we inspect the element in the page source, we see that it has <table class="wikitable sortable...>.//table selects the table elements anywhere in the document, with @class attribute referring to 'wikitable sortable'.

r <- xml_find_all(choco_xml, "//table[@class='wikitable sortable']")
head(r)
## {xml_nodeset (1)}
## [1] <table class="wikitable sortable" style="width:100%;" cellpadding="5 ...

You can always inspect the page source in your browser.

Use xml_children() to navigate in the family tree of an XML. If you want to see the number of children, use xml_length().

a <- xml_children(r)
head(a)
## {xml_nodeset (6)}
## [1] <tr>\n  <th>Name</th>\n  <th class="unsortable">Image</th>\n  <th>Di ...
## [2] <tr>\n  <td>\n    <a href="/wiki/100_Grand_Bar" title="100 Grand Bar ...
## [3] <tr>\n  <td>\n    <a href="/wiki/3_Musketeers_(candy)" class="mw-red ...
## [4] <tr>\n  <td>\n    <a href="/wiki/3_Musketeers_(candy)" class="mw-red ...
## [5] <tr>\n  <td>\n    <a href="/wiki/5th_Avenue_(candy)" title="5th Aven ...
## [6] <tr>\n  <td>\n    <a href="/w/index.php?title=98%25_Cocoa_Stevia_Bar ...

xml_text() extracts the parsed text from nodes that return a character vector, whereas xml_double() returns a numeric vector, and xml_integer() returns an integer vector.

x <- xml_text(a)
head(x)
## [1] "NameImageDistributionManufacturerDescription"                                                                                          
## [2] "100 Grand BarUnited StatesNestléCaramel and crisped rice"                                                                              
## [3] "3 MusketeersUnited States, CanadaMarsAerated chocolate-flavored nougat with milk chocolate coating; also available in mint and caramel"
## [4] "3 Musketeers Truffle CrispUnited StatesMars"                                                                                           
## [5] "5th AvenueUnited StatesHersheyHoneycombed crunchy peanut butter candy center in milk chocolate"                                        
## [6] "98% Cocoa Stevia BarUnited StatesDante ConfectionsHighest cocoa content chocolate bar, sweetened with stevia"

The rest of the job is based on data cleaning. This is a point, I will not dive into it as the main purpose of the title is to show the XML parsing. However, I leave some practices below for those who might be interested (yet I do not test them). Data cleaning and organization is the majority of the job (between 60% to 80% of time), that should never be underestimated.

At first sight, our chocolate data seems to have strings in rows diverged by \n seperator. All we need to do is to split them into different columns, and transform it into a data frame (but check whether the seperator is consistent across all the rows or not).

A sample data cleaning code might be:

z01 <- strsplit(x, '\n', fixed = T)
#set column names by the first element of list `z01[[1]]`
df <- setNames(data.frame(matrix(unlist(z01), nrow = 285, byrow = T)), z01[[1]])
#not need the first row that became the column names
df <- df[-1,]
#not need "Image" column so remove
df$Image <- NULL

Captain, our amazing trip is done!

Parsing, cleaning and transforming the data may not be an easy task, but certainly, it is not the hardest thing in the analysis procedure. Rather, analyzing and interpreting the nature of the data to turn into meaningful insights is much more challenging.

Anyway, keep going and have fun with the process!


Footnotes

[1]: For more, see API manipulation.

References

Cooksey, B. Chapter 2: Protocols. Retrieved on Oct 11, 2017 from https://zapier.com/learn/apis/chapter-2-protocols/

Keyes, O., Wickham, C. Working with Web Data in R. Retrieved on Oct 10, 2017 from https://www.datacamp.com/courses/working-with-web-data-in-r

Meissner, P. (2015) A Fast-Track-Overview on Web Scraping with R UseR! 2015 Comparative Parliamentary Politics Working Group: University of Konstanz.

Ooms, J. (2014). The jsonlite Package: A Practical and Consistent Mapping Between JSON Data and R Objects. Retrieved from arXiv:1403.2805 [stat.CO] https://arxiv.org/abs/1403.2805.

Wickham, H. (2016). httr: Tools for Working with URLs and HTTP. R package version 1.2.1. Retrieved from https://CRAN.R-project.org/package=httr

Wickham, H., Hester J., & Ooms, J. (2017). xml2: Parse XML. R package version 1.1.1. Retrieved from https://CRAN.R-project.org/package=xml2