Sunday, December 12, 2010

How To Build A Basic Web Crawler To Pull Information From A Website (Part 1)

How To Build A Basic Web Crawler To Pull Information From A Website (Part 1): "

web crawlerWeb Crawlers, sometimes called scrapers, automatically scan the Internet attempting to glean context and meaning of the content they find. The web wouldn’t function without them. Crawlers are the backbone of search engines which, combined with clever algorithms, work out the relevance of your page to a given keyword set.

The Google web crawler will enter your domain and scan every page of your website, extracting page titles, descriptions, keywords, and links – then report back to Google HQ and add the information to their huge database.

Today, I’d like to teach you how to make your own basic crawler – not one that scans the whole Internet, though, but one that is able to extract all the links from a given webpage.


Generally, you should make sure you have permission before scraping random websites, as most people consider it to be a very grey legal area. Still, as I say, the web wouldn’t function without these kind of crawlers, so it’s important you understand how they work and how easy they are to make.

To make a simple crawler, we’ll be using the most common programming language of the internet – PHP. Don’t worry if you’ve never programmed in PHP – I’ll be taking you through each step and explaining what each part does. I am going to assume an absolute basic knowledge of HTML though, enough that you understand how a link or image is added to an HTML document.

Before we start, you will need a server to run PHP. You have a number of options here:

Getting Started

We’ll be using a helper class called Simple HTML DOM. Download this zip file, unzip it, and upload the simple_html_dom.php file contained within to your website first (in the same directory you’ll be running your programs from). It contains functions we will be using to traverse the elements of a webpage more easily. That zip file also contains today’s example code.

First, let’s write a simple program that will check if PHP is working or not. We’ll also import the helper file we’ll be using later. Make a new file in your web directory, and call it example1.php – the actual name isn’t important, but the .php ending is. Copy and paste this code into it:

<?php
include_once('simple_html_dom.php');
phpinfo();
?>

Access the file through your internet browser. If you don’t have a server set up, you can still run the program from my server if you want. If everything has gone right, you should see a big page of random debug and server information printed out like below – all from the little line of code! It’s not really what we’re after, but at least we know everything is working.

web crawler

The first and last lines simply tell the server we are going to be using PHP code. This is important because we can actually include standard HTML on the page too, and it will render just fine. The second line pulls in the Simple HTML DOM helper we will be using. The phpinfo(); line is the one that printed out all that debug info, but you can go ahead and delete that now. Notice that in PHP, any commands we have must be finished with a colon (;). The most common mistake of any PHP beginner is to forget that little bit of punctuation.

One typical task that Google performs is to pull all the links from a page and see which sites they are endorsing. Try the following code next, in a new file if you like.

<?php
include_once('simple_html_dom.php');

$target_url = “http://www.tokyobit.com/”;
$html = new simple_html_dom();
$html->load_file($target_url);
foreach($html->find(‘a’) as $link){
echo $link->href.”<br />”;
}
?>

Again, you can run that from my server too if you don’t have your own set up. You should get a page full of URLs! Wonderful. Most of them will be internal links, of course. In a real world situation, Google would ignore internal links and simply look at what other websites you’re linking to, but that’s outside the scope of this tutorial.

If you’re running on your own server, go ahead and change the target_URL variable to your own webpage or any other website you’d like to examine.

That code was quite a jump from the last example, so let’s go through in pseudo-code to make sure you understand what’s going on.

Include once the simple HTML DOM helper file.

Set the target URL as http://www.tokyobit.com.

Create a new simple HTML DOM object to store the target page

Load our target URL into that object

For each link <a> that we find on the target page

- Print out the HREF attribute

That’s it for today, but if you’d like a bit of challenge – try to modify to the second example so that instead of searching for links (<a> elements), it grabs images instead (<img>). Remember, the src attribute of an image specifies the URL for that image, not HREF.

Would you like learn more? Let me know in the comments if you’re interested in reading a part 2 (complete with homework solution!), or even if you’d like a back-basics PHP tutorial – and I’ll rustle one up next time for you. I warn you though – once you get started with programming in PHP, you’ll start making plans to create the next Facebook, and all those latent desires for world domination will soon consume you. Programming is fun.


Got Questions? Ask Them Now FREE on MakeUseOf Answers!


Similar MakeUseOf Articles



"

No comments:

Post a Comment

[Please do not advertise, or post irrelevant links. Thank you for your cooperation.]