How To HTML Parsing With PHP Xpath Query
I am doing a lot of PHP programming, it is easier than any other language, you don't need any setup to get started with PHP, in Windows there's program called XAMPP, and just about all you need to start a PHP programming. Because the simplicity, I've created many projects using PHP, and it's fun for creating small projects. Enough for the intro, here's some handy tips when you want to do some HTML parsing.
1. To get start, using file_get_contents function to get the HTML content of the page you want to parse, save it in a variable call $contents, whatever you name it.
$contents = file_get_contents("https://example.com/the-page");
2. Declare the DOMDocument class, name the variable $dochtml, again whatever you want.
$dochtml = new DOMDocument();
3. Call the loadHTML, pass the contents from file_get_contents to its parameter.
$dochtml->loadHTML($contents);
4. Declare DOMXpath, pass a dom object to it's constructor.
$xpath = new DOMXpath($dochtml);
5. Let's for example our page that we're gonna parse is something like this example:
<html>
<body>
<div class="some-class my-class-0">
<div class="some-class-child">Content 1</div>
</div>
<div class="some-class my-class-1">
<div class="some-class-child">Content 2</div>
</div>
</body>
<html>
And we want to get the "Content 2", you are gonna need this simple Xpath Query:
$getSomeClasses = $xpath->query('//div[contains(@class, "some-class")]');
$nodeClass0 = $getSomeClasses->item(0); //you can skip this
$nodeClass1 = $getSomeClasses->item(1);
$getClass1Child = $xpath->query('.//div[contains(@class, "some-class-child")]', $nodeClass1);
$child = $getClass1Child->item(0);
$myText = $child->textContent; //"Content 2"
// : is a symbol means root of document
.// : is a relative symbol, it will only get inside the specified node
That's so simple, little bit tricky and confusing at first, but it's quite handy if you are gonna do so many html parsing. I learn it the hard way, i mean after several years doing PHP programming, just recently i finally understand this Xpath thing.