Saturday, November 08, 2014

R Code for Reading an XML File And Extracting Data From It

Let us say you use the services of a cloud provider who gives you data in XML format via a web site and you want to periodically look at that data and extract some information to make your decisions. The file is big. So loading it in Excel and manipulating the data is cumbersome and error prone. You can spend a lot of money to build a special software for it or you can write some simple R code to extract the data yourself. This is how you can do it.

Let us assume that the file I am working with is the master data file of 100,000 employees. At any given point of time I want to find out how many employees live in a certain zip code.

Step 1 is to load the web address of the XML file in a vector.
fileURL < - "http://www.website.com/filename.xml"

Step 2 is to load all the content of the XML file in another vector.
documentcontent <- xmlTreeParse(fileURL, userInternal=TRUE)

Step 3 is to parse the root node of the XML content and store it in another vector.
rootNode <- xmlRoot(documentcontent)

Step 4 is to extract all zip codes into a vector.
allzipcodes <- xpathSApply(rootNode, "//zipcode", xmlValue)

Step 5 is to count the number of people who have the zip code "90210".
sum(allzipcodes == "90210")

In 5 simple steps you have performed meaninful data extraction from XML data, which normally requires very sophisticated and costly tools.

To perform  data extraction like this, you will need some basic understanding of XML and some logical thinking. If you are a cloud professional services or an SAP ERP HCM functional consultant, I believe you can perform basic data extraction like the one I described below using R, with a little bit of effort .

No comments:

Post a Comment

Related Posts Plugin for WordPress, Blogger...