PHP web spider

Discuss how to write good code, break bad code, your current pet projects, or the best way to approach novel problems

PHP web spider

Post by ampakine on Sat Jul 02, 2011 8:58 pm
([msg=59337]see PHP web spider[/msg])

Heres a PHP spider I made:

Code: Select all
<?
//         OPEN THE XML FILE
$xml_file = "filelist.xml";
$xml = simplexml_load_file($xml_file) or die("Could not load XML file.");

//         COUNT THE NUMBER OF FILES ON THE LIST
$num = count($xml->program);


$i=0;
while ($i < $num) {
   
   $file = $xml->file[$i];
   $code = $file['code'];
   $url = "http://www.website.com/dir/script.php?id=$code";



//            GET THE EXTERNAL FILES

   $contents = implode('', file($url));
   
//            CLEAN UP THE EXTERNAL FILES

   $pattern = '/<div class="tab">.+<!-- CONTENTS END -->/s';

   if (preg_match($pattern,$contents,$match)) {
      
      $start = $match[0];
      
      $cleaned2 = preg_replace('/(<br \/>)|(<p>)|(<\/p>)|(<\/li>)|(<ul>)|(<\/ul>)|(width="200")|(<div class="tab">)|(<\/div>)/','',$start);
      $cleaned3 = preg_replace('/<div class="further_info">.*<!-- CONTENTS END -->/s','',$cleaned2);
      $cleaned4 = preg_replace('/<div class="tab">/','    <info>',$cleaned3);
      $cleaned5 = preg_replace('/<li>/','   &#8226;',$cleaned4);
      $cleaned6 = preg_replace('/</','[',$cleaned5);
      $cleaned7 = preg_replace('/>/',']',$cleaned6);
      $cleaned8 = preg_replace('/a href="/','URL=',$cleaned7);
      $cleaned9 = preg_replace('/\/a/','/URL',$cleaned8);   
      $cleaned10 = preg_replace('/h4/','S3',$cleaned9);
      $cleaned11 = preg_replace('/\[em\]/','[I]',$cleaned10);
      $cleaned12 = preg_replace('/\[\/em\]/','[/I]',$cleaned11);
      $cleaned13 = preg_replace('/<img height="59" alt="N\.D\.P\.\/E\.U\. Structural Funds Logos" src="\.\.\//','[PIC=',$cleaned12);
      $cleaned14 = preg_replace('/("])|(\/])/',']',$cleaned13);
//      echo "<p><pre>$cleaned14;</pre></p>";

//                  UPDATE THE XML FILE
   
      $xml->file[$i]->addChild("info","$cleaned14") or die("Could not add child");
      $xml_doc = $xml->asXML();
      $fh = fopen("filelist.xml","w");
      fwrite($fh,$xml_doc) or die("Couldnt write to file");
      fclose($fh);

   }
   
   $i++;

}
?>

and heres the XML file to go with it:
Code: Select all
<?xml version="1.0"?>
<files>
    <file code="link to the page goes here">
        <info>Text saved from each page will go here.</info>  (NOTE: the <info></info> tag is only added when the spider  visits the link.     
    </file>
</files>


The list of links for the spider to visit is loaded into an XML file. The script downloads every page (it just downloads the whole HTML source) on the list then that series of preg_replaces cleans each file up by deleting the unwanted HTML (and other code) and replacing some of the tags with my own BBCode tags. All those regular expressions are specific to the site I directed the spider to, for other sites all that will have to be modified. This seems like a fairly sloppy way to clean up the downloaded pages, anyone know a better way to do it? The script then creates a new element in the XML file and loads the cleaned up string into it. Finally the script updates the XML document so it now contains the info downloaded from each page on the list.

I'm fairly new to XML, I dunno if I used it properly there or not but it was pretty handy not having to deal with MySQL so I'm starting to see why XML is so popular.
ampakine
Experienced User
Experienced User
 
Posts: 65
Joined: Tue May 31, 2011 5:21 pm
Blog: View Blog (0)


Return to Programming

Who is online

Users browsing this forum: No registered users and 0 guests