Page 1 of 1

a python web crawler

PostPosted: Mon Apr 06, 2015 5:37 am
by pretentious
I maked this a while ago. Actually made a thread here here asking if I'd get in trouble for harassing yahoo with 15,000 page requests :P

I abandoned this project when I realized that I would get infinite recursion because of a lack of regex finesse on relative addressing of hyperlinks, and I just couldn't be fucked #YOLO

Se here's yet another coding failure for the world to see:
What this code does is pull hyperlinks from web pages and then follows those links for more hyperlinks. There is a data file that collects the whole of the internet because data mining intrigued me though you likely wont find anything interesting
Code: Select all
#! /usr/bin/python
from urllib2 import Request, urlopen, URLError
import re
# init count
try:
initial_count = open("url_count","r")
tmp = initial_count.read()
iterator = int(tmp)
initial_count.close()
except IOError:
iterator = 0
# init link
try:
initial_link_file = open("links","r")
links = initial_link_file.readlines()
if len(links) == 0:
  links.append('http://au.yahoo.com/')
initial_link_file.close()
except IOError:
links = ['http://au.yahoo.com/']

link_file = open("links","a")
content = open("content","a")
count = open("url_count","w")
def crawl(url):
req = Request(url)
try:
  response = urlopen(req)
  return response.read()
except URLError as e:
  return 'fail'
#links = ['http://au.yahoo.com/']
# iterator = 0
repeat = 0
while iterator < len(links):
print "trying " + links[iterator]
data = crawl(links[iterator])
if data != 'fail':
  print 'hit'
  #for i in word_list:
  # if i in words:
  #  word_list.remove(i)
    #find the work and add to frequency
  # else:
  #  words.append(word(i))
  content.write(data)
  result = re.findall('href=[\"|\']\S*[\"|\']', data)
  temp_list = list(result)
  for link in temp_list:
   temp_link = link
   print temp_link
   formated_link = temp_link[6:len(temp_link)-1]
   if formated_link == '/':
    continue
   if formated_link[:1] == '/':
    formated_link = formated_link[1:]
   if formated_link[:4] != 'http':
    if links[iterator]+formated_link in links:
     a = 0
    else:
     link_file.write(links[iterator]+formated_link+'\n')
     links.append(links[iterator]+formated_link)
   else:
     if formated_link in links:
      a = 0
     else:
      link_file.write(formated_link+'\n')
      #link_file.write("\n")
      links.append(formated_link)
   #print formated_link
   #print len(links)
  #print result
print 'done'
count.seek(0)
count.write(str(iterator + 1))
print str(len(links)) + ' links'
if iterator + 1 < len(links):
  print 'next url is ' + links[iterator + 1] + '(' + str(iterator + 1) + ' in the list)'
if repeat == 0:
  print 'continue?'
  answer = raw_input()
  if answer == 'no':
   break
  elif answer.isdigit():
   repeat = int(answer)
iterator += 1
if repeat > 0:
  repeat = repeat - 1
#for i in links:
# print i
#print len(links)

Re: a python web crawler

PostPosted: Sun Apr 12, 2015 11:44 am
by limdis
Nice. My python is a little rough (ruby ftw) so I'm not for sure about the layout of the output. Will it list in a tree format?

EX:
link1
--link1
--link2
link2
--link1
--link2
--link3
link3
link4

I also like that you are removing duplicates. +1

Re: a python web crawler

PostPosted: Tue Apr 14, 2015 5:29 am
by pretentious
limdis wrote:Will it list in a tree format?

From memory, Nah. It was just a 2 hour or so job that pasts raw url's line by line. It probably won't be too difficult to adjust it to suit your needs though

Re: a python web crawler

PostPosted: Wed Apr 15, 2015 6:37 pm
by cyberdrain
Reading through the code, your try block will probably error out.

Re: a python web crawler

PostPosted: Fri Apr 17, 2015 8:45 am
by pretentious
cyberdrain wrote:Reading through the code, your try block will probably error out.

Do you mean the lines for reading files that don't exist, would throw an exception? Yeah it looks dodgy but I've tested the code and it works by itself. Nuff said ;)