Easy Test Analysis
Introduction
The easy test focuses on basic XML parsing using the xml2 package. It involves extracting specific information from a simple XML document. The code snippet below demonstrates how to load the xml2 package and parse a simple XML document to extract the director name for the second movie.
Setting Up the Environment
Section 1: Loading Libraries and XML String
library(xml2)
library(stringr)
xml_string <- c( '<?xml version="1.0" encoding="UTF-8"?>',
'<movies>',
'<movie mins="126" lang="eng">',
'<title>Good Will Hunting</title>',
'<director>',
'<first_name>Gus</first_name>',
'<last_name>Van Sant</last_name>',
'</director>',
'<year>1998</year>',
'<genre>drama</genre>',
'</movie>',
'<movie mins="106" lang="spa">',
'<title>Y tu mama tambien</title>',
'<director>',
'<first_name>Alfonso</first_name>',
'<last_name>Cuaron</last_name>',
'</director>',
'<year>2001</year>',
'<genre>drama</genre>',
'</movie>',
'</movies>')
Explanation:
-
The xml2 library is loaded to handle XML data in R.
-
The stringr library is loaded for string manipulation, though it’s not used in this snippet.
-
An XML string representing a list of movies is defined, including details like title, director, year, and genre.
Section 2: Parsing the XML Document
doc <- read_xml(paste(xml_string, collapse = ''))
doc
## {xml_document}
## <movies>
## [1] <movie mins="126" lang="eng">\n <title>Good Will Hunting</title>\n <dir ...
## [2] <movie mins="106" lang="spa">\n <title>Y tu mama tambien</title>\n <dir ...
Explanation:
-
The read_xml function from the xml2 package is used to parse the XML string into an XML document object.
-
The paste function with collapse = ’’ is used to concatenate the XML string into a single string before parsing.
-
The parsed XML document is stored in the variable doc.
Section 3: Navigating the XML Document
tu_mama <- xml_child(doc, search = 2)
tu_mama
## {xml_node}
## <movie mins="106" lang="spa">
## [1] <title>Y tu mama tambien</title>
## [2] <director>\n <first_name>Alfonso</first_name>\n <last_name>Cuaron</last ...
## [3] <year>2001</year>
## [4] <genre>drama</genre>
xml_children(tu_mama)
## {xml_nodeset (4)}
## [1] <title>Y tu mama tambien</title>
## [2] <director>\n <first_name>Alfonso</first_name>\n <last_name>Cuaron</last ...
## [3] <year>2001</year>
## [4] <genre>drama</genre>
Explanation
-
The xml_children function lists all child nodes of the XML document.
-
The xml_child function is used to select a specific child node by its index, in this case, the second movie.
Section 4: Extracting director Information
director <- xml_child(tu_mama,"director")
director
## {xml_node}
## <director>
## [1] <first_name>Alfonso</first_name>
## [2] <last_name>Cuaron</last_name>
xml_contents(director)
## {xml_nodeset (2)}
## [1] <first_name>Alfonso</first_name>
## [2] <last_name>Cuaron</last_name>
xml_text(director)
## [1] "AlfonsoCuaron"
Explanation
-
The xml_child function is used again to select the “director” child node of the selected movie.
-
The xml_contents function lists all nodes within the “director” node.
-
The xml_text function extracts the text content of the “director” node, providing the director’s name.