21 Line Python Web Crawler

For the discussion of Perl, Python, Ruby, and PHP and other interpreted languages.

21 Line Python Web Crawler

Post by Th3_M4d_H4tt3r on Tue Jun 18, 2013 8:30 am
([msg=76150]see 21 Line Python Web Crawler[/msg])

Hi Th3_M4d_H4tt3r here, I'm new and I want to show off my skills :)
this web crawler crawls a webpages homepage then crawls the links found on the homepage then crawls those links again.
Code: Select all
from bs4 import BeautifulSoup
import urllib2,sys
target=sys.argv[1]
def Spider(link):    #our spidering script
   f = urllib2.urlopen(link)
   soup = BeautifulSoup(f.read())
   for link in soup.find_all('a'):
      a=link.get('href')
      if a.startswith('/'):
         a=target+a
      yield a
for i in Spider(target):
   try:
      print i
      for i in Spider(i):
         print i
         for i in Spider(i):
            print i
   except:
      print "Above URL is broken or JavaScript."
      pass

usage: python spider.py http://www.hackthissite.org

Moved to Programming section of the forums. Please post in the proper section next time. ~Cent
Tip me if I helped you!
BTC Address: 15wu8gxFAemZH3jC4km3Z8gMYtKHLxpnEv
User avatar
Th3_M4d_H4tt3r
Experienced User
Experienced User
 
Posts: 54
Joined: Tue Jun 18, 2013 8:25 am
Blog: View Blog (0)


Return to Interpreted Languages

Who is online

Users browsing this forum: No registered users and 0 guests