Xpath is a language used to find information in an XML or
HTML files. XPath is used to navigate through several attributes or elements in
an XML document. XPath can also be used to traverse through an XML file in
Ruby. We use Nokogiri, a gem of Ruby for that purpose.
XPath is found to be a very important tool for fetching the relevant
information, reading attributes and items in XML file.
Before you start reading this post, I should suggest you to learn a bit about XPath from here.
We will consider the following XML file for
the demo, that holds the information of employees.
<?xml version="1.0"?> <Employees> <Employee id="1111" type="admin"> <firstname>John</firstname> <lastname>Watson</lastname> <age>30</age> <email>johnwatson@sh.com</email> </Employee> <Employee id="2222" type="admin"> <firstname>Sherlock</firstname> <lastname>Homes</lastname> <age>32</age> <email>sherlock@sh.com</email> </Employee> <Employee id="3333" type="user"> <firstname>Jim</firstname> <lastname>Moriarty</lastname> <age>52</age> <email>jim@sh.com</email> </Employee> <Employee id="4444" type="user"> <firstname>Mycroft</firstname> <lastname>Holmes</lastname> <age>41</age> <email>mycroft@sh.com</email> </Employee> </Employees>
If we go though the code, we can see there are four employees. Attribute-id type, Child nodes - firstname, lastname, age and email.Lets now start with the code. We will use Nokogiri , a gem of Ruby which provides wonderfulAPI to parse, search the documents via XPath.
Nokogiri
Ex 1. Read firstname of all employees
require 'nokogiri' f = File.open("employee.xml") doc = Nokogiri::XML(f) puts "== First name of all employees" expression = "Employees/Employee/firstname" nodes = doc.xpath(expression) nodes.each do |node| p node.text end
Output :
"John" "Sherlock" "Jim" "Mycroft"
Ex 2: Read firstname of all employees who are older than 40 year
expression = "/Employees/Employee[age>40]/firstname" nodes = doc.xpath(expression) nodes.each do |node| p "#{ node.text }" end
Output:
"Jim"
"Mycroft"
That is for today. I will write more about different process of Web Crawling. It is just a small beginning. Thanks for reading.