A News Website by Web Scraping/Crawling

Web scraping or web crawling is the simplest method of extracting data from web pages without having to use any API. :mrgreen: It works by getting data out of the HTML on the page.
One of the easiest to use crawlers is Goutte. This tutorial shows step by step guidance on how to create your own news website by crawling your favorite news websites and pulling out data from there.

Step 1

Start by installing the required tools on your system. If you’re working on a host already configured with Goutte, you’re good to go. Follow the steps here for installation.

Step 2

Make a list of the news websites you wish to crawl. This may take time as we have to know the structure of each news item we wish to return. We’ll got through one example of BBC‘s website.

1

There are many different sections in the web page. Its up to us to use as any of them as we desire. For this example, we pull out data from the “Top Stories” section of the web page. To understand the HTML page structure, select “Inspect Element” from the right-click menu.

2

This takes us to the HTML code of the page. Simply hover over the code to highlight the corresponding visual part of the web page. This should help you understand the relationship between the code and the page.

3

Select the “Top Stories” code and paste it somewhere for your own convenience. If you’re comfortable in the code inspecting menu, thats fine too 🙂

<div class="buzzard-item" data-entityid="container-top-stories#1">
<a href="/news/world-middle-east-35336707" class="title-link">
<h3 class="title-link__title">

<span class="title-link__title-text">'New chapter' as Iran sanctions end</span></h3>
</a>
<div class="buzzard__image">
<div class="responsive-image responsive-image--16by9">

<img src="http://ichef-1.bbci.co.uk/news/410/cpsprodpb/7B2F/production/_87753513_87753512.jpg" class="js-image-replace" alt="An Iranian woman walks past a mural in Tehran, 16 January" width="976" height="549">
</div>
</div>
<div class="buzzard__body">
<p class="buzzard__summary">Iran "has opened a new chapter" in its ties with the world, President Hassan Rouhani says, hours after economic sanctions on Tehran are lifted.</p>

<div class="buzzard__info-list">
<ul class="mini-info-list">
	<li class="mini-info-list__item">
<div class="date date--v2 relative-time" data-seconds="1453012649" data-datetime="17 January 2016" data-timestamp-inserted="true">1 hour ago</div></li>
	<li class="mini-info-list__item"><span class="mini-info-list__section-desc off-screen">From the section </span><a href="/news/world/middle_east" class="mini-info-list__section" data-entityid="section-label">Middle East</a></li>
</ul>
</div>
</div>
<div class="buzzard__links-list">
<h4 class="off-screen">Related content</h4>
<ul class="links-list__list">
	<li class="links-list__item"><a href="/news/world-middle-east-33521655" class="links-list__link"> Key details of deal</a></li>
	<li class="links-list__item"><a href="/news/world-middle-east-35335166" class="links-list__link">

<span class="badge-icon-only badge-icon-only--video"><span class="svg-icon svg-icon--video-light"><span class="off-screen"> Video</span></span></span>


'Iran has fulfilled its commitment'</a></li>
	<li class="links-list__item"><a href="/news/world-middle-east-35333656" class="links-list__link"> Iran frees US reporter</a></li>
	<li class="links-list__item"><a href="/news/business-35324289" class="links-list__link">

<span class="badge-icon-only badge-icon-only--video"><span class="svg-icon svg-icon--video-light"><span class="off-screen"> Video</span></span></span>


Sanctions deal in 60 seconds</a></li>
</ul>
</div>
<a href="/news/world-middle-east-35336707" class="faux-block-link__overlay-link" tabindex="-1" aria-hidden="true">Full article 'New chapter' as Iran sanctions end</a>
</div>

Its a looooong code 😕

We dont want all of that data. We just need a few pieces from there. We now look for the identifiers of the tags we are interested in. For this example, we will extract the heading, image, and a summarized description of the news item.

To create a path for traversing the tags and picking out our required ones, we create a chain of tags, classes, IDs etc. to track the item. Studying the code, we need the following hierarchies.
Heading: class=”buzzard-item” -> <h3> -> <span>

Image: class=”buzzard-item” -> class=”buzzard__image” -> <img>

Description: class=”buzzard-item” -> class=”buzzard__body” -> class=”buzzard__summary”
Note that there can be multiple ways to reach any required item, choose the simplest path you see.

Step 3

Now that we know what information we need to pull, and from where, we can move towards writing our crawling code. Start by including vendor/autoload.php. It is important to find the symphony components. Refer to the installation guide mentioned in Step 1&nbsp;for details. <pre>

date_default_timezone_set('Asia/Karachi'); //setting site time to local time
ini_set('max_execution_time', 300); //setting session time
require_once("../vendor/autoload.php); // including the autoload

Create a new Goutte client.

$client = \Goutte\Client;

The getting-started stage is now complete.

Step 4

Now towards writing our function that uses the information of Step 2 to pull data and return it. The desired tags can be accessed using a similar syntax as JQuery. This is one of the best things about Goutte. No extra syntax inconveniences.


function get_news() {
 global $client;
 $top_news = [];
 $count = 0;
 $crawler = $client->('GET', "http://www.bbc.com/news/");
 $crawler->filter('.buzzard-item')->each(function($node) use($top_news) { //filtering the tags with class=""buzzard-item". This will give us the entire Top Stories
<div>
 if (count($node->filter('h3 > span')) != 0 && count($node -> filter('.buzzard__body > .buzzard__summary')) != 0) { //counting whether the items we are looking for exist or not
$news = []; //array to store each news item
$heading = $node->filter('h3 &amp;amp;amp;gt; span')->text(); //pulling out the heading from the news item
$news['heading'] = $heading; //saving heading in our array
$news['summary'] = $node->filter('.buzzard__body > .buzzard__summary')->text(); // pulling out the description or summary from the news item and saving it in our array
if (count($node->filter('img')) != 0) {
$image = $node->filter('img')->first();
$news['image_url'] = $image>attr('src'); //pulling image url information and saving it
$top_news[] = $news; //saving each news item to the global array
}
else
$news['image_url'] = "images/default.png";
}
});
return top_news;
}

The crawler is now ready to use. The get_news() function can be used anywhere to collect the data required.

Step 5

This sample code shows how to use the crawler function we created.

<div>
<h6> <a href="#">Business </a></h6>
<?php
$news = get_news();
$index = 0;
?>
<div class="col-sm-4 col-lg-4 col-md-4">
<div class="thumbnail">
<img src="<?= $news[$index]['image_url'] ?>" alt="No Image Available">
<div class="caption">
<h4><a href="<?=$news[$index]['url']?>"><?= $news[$index]['title']?></a></h4>
<?= substr($news[$index]['summary'], 0, 100)."..."?>

</div>
</div>

Takeaways

You’re now ready to develop your own website by crawling the websites of your choice. A similar news website I made earlier for my coursework can be found here. It reads news from CNN, BBC, The Tribune and the likes and displays them by categories.

This slideshow requires JavaScript.

Note that this method dynamically collects data and displays it in a format of your choice. This method can be lazy and cumbersome. A modification can be made to save the data being collected side by side to avoid re-crawling every single time.

Hope this tutorial was helpful. Good luck 😀

 

 

Advertisements

Leave a Reply

Fill in your details below or click an icon to log in:

WordPress.com Logo

You are commenting using your WordPress.com account. Log Out / Change )

Twitter picture

You are commenting using your Twitter account. Log Out / Change )

Facebook photo

You are commenting using your Facebook account. Log Out / Change )

Google+ photo

You are commenting using your Google+ account. Log Out / Change )

Connecting to %s