Friday, June 5, 2015

How to parse XML using XPath with Nokogiri Ruby : A Begining in Web Crawl

Xpath is a language used to find information in an XML or HTML files. XPath is used to navigate through several attributes or elements in an XML document. XPath can also be used to traverse through an XML file in Ruby. We use Nokogiri, a gem of Ruby for that purpose.



XPath is found to be a very important tool for fetching the relevant information, reading attributes and items in XML file.

Before you start reading this post, I should suggest you to learn a bit about XPath from here.

We will consider the following XML file for the demo, that holds the information of employees. 



 <?xml version="1.0"?>
<Employees>
    <Employee id="1111" type="admin">
        <firstname>John</firstname>
        <lastname>Watson</lastname>
        <age>30</age>
        <email>johnwatson@sh.com</email>
    </Employee>
    <Employee id="2222" type="admin">
        <firstname>Sherlock</firstname>
        <lastname>Homes</lastname>
        <age>32</age>
        <email>sherlock@sh.com</email>
    </Employee>
    <Employee id="3333" type="user">
        <firstname>Jim</firstname>
        <lastname>Moriarty</lastname>
        <age>52</age>
        <email>jim@sh.com</email>
    </Employee>
    <Employee id="4444" type="user">
        <firstname>Mycroft</firstname>
        <lastname>Holmes</lastname>
        <age>41</age>
        <email>mycroft@sh.com</email>
    </Employee>
</Employees>


If we go though the code, we can see there are four employees. Attribute-id type, Child nodes - firstname, lastname, age and email.
Lets now start with the code. We will use Nokogiri , a gem of Ruby which provides wonderfulAPI to parse, search the documents via XPath.

Nokogiri

Ex 1. Read firstname of all employees
 require 'nokogiri'
f = File.open("employee.xml")
doc = Nokogiri::XML(f)

puts "== First name of all employees"
expression = "Employees/Employee/firstname"
nodes = doc.xpath(expression)

nodes.each do |node|
  p node.text
end



Output : 


"John"
"Sherlock"
"Jim"
"Mycroft"




Ex 2: Read firstname of all employees who are older than 40 year
expression = "/Employees/Employee[age>40]/firstname"
nodes = doc.xpath(expression)
nodes.each do |node|
 p "#{ node.text }"
end 



Output: 

"Jim"
"Mycroft"


That is for today. I will write more about different process of Web Crawling. It is just a small beginning. Thanks for reading.