a python web crawler

A place to submit all custom code, scripts, and programs.
Forum rules
Do NOT post malicious code or programs. Please review all code posted in this forum before downloading or running any of the code or programs here.

a python web crawler

Post by pretentious on Mon Apr 06, 2015 5:37 am
([msg=87606]see a python web crawler[/msg])

I maked this a while ago. Actually made a thread here here asking if I'd get in trouble for harassing yahoo with 15,000 page requests :P

I abandoned this project when I realized that I would get infinite recursion because of a lack of regex finesse on relative addressing of hyperlinks, and I just couldn't be fucked #YOLO

Se here's yet another coding failure for the world to see:
What this code does is pull hyperlinks from web pages and then follows those links for more hyperlinks. There is a data file that collects the whole of the internet because data mining intrigued me though you likely wont find anything interesting
Code: Select all
#! /usr/bin/python
from urllib2 import Request, urlopen, URLError
import re
# init count
try:
initial_count = open("url_count","r")
tmp = initial_count.read()
iterator = int(tmp)
initial_count.close()
except IOError:
iterator = 0
# init link
try:
initial_link_file = open("links","r")
links = initial_link_file.readlines()
if len(links) == 0:
  links.append('http://au.yahoo.com/')
initial_link_file.close()
except IOError:
links = ['http://au.yahoo.com/']

link_file = open("links","a")
content = open("content","a")
count = open("url_count","w")
def crawl(url):
req = Request(url)
try:
  response = urlopen(req)
  return response.read()
except URLError as e:
  return 'fail'
#links = ['http://au.yahoo.com/']
# iterator = 0
repeat = 0
while iterator < len(links):
print "trying " + links[iterator]
data = crawl(links[iterator])
if data != 'fail':
  print 'hit'
  #for i in word_list:
  # if i in words:
  #  word_list.remove(i)
    #find the work and add to frequency
  # else:
  #  words.append(word(i))
  content.write(data)
  result = re.findall('href=[\"|\']\S*[\"|\']', data)
  temp_list = list(result)
  for link in temp_list:
   temp_link = link
   print temp_link
   formated_link = temp_link[6:len(temp_link)-1]
   if formated_link == '/':
    continue
   if formated_link[:1] == '/':
    formated_link = formated_link[1:]
   if formated_link[:4] != 'http':
    if links[iterator]+formated_link in links:
     a = 0
    else:
     link_file.write(links[iterator]+formated_link+'\n')
     links.append(links[iterator]+formated_link)
   else:
     if formated_link in links:
      a = 0
     else:
      link_file.write(formated_link+'\n')
      #link_file.write("\n")
      links.append(formated_link)
   #print formated_link
   #print len(links)
  #print result
print 'done'
count.seek(0)
count.write(str(iterator + 1))
print str(len(links)) + ' links'
if iterator + 1 < len(links):
  print 'next url is ' + links[iterator + 1] + '(' + str(iterator + 1) + ' in the list)'
if repeat == 0:
  print 'continue?'
  answer = raw_input()
  if answer == 'no':
   break
  elif answer.isdigit():
   repeat = int(answer)
iterator += 1
if repeat > 0:
  repeat = repeat - 1
#for i in links:
# print i
#print len(links)
Goatboy wrote:Oh, that's simple. All you need to do is dedicate many years of your life to studying security.

IF you feel like exchanging ASCII arrays, let me know ;)
Can you say brainwashing It's a non stop disco
User avatar
pretentious
Addict
Addict
 
Posts: 1217
Joined: Wed Mar 03, 2010 12:48 am
Blog: View Blog (0)


Re: a python web crawler

Post by limdis on Sun Apr 12, 2015 11:44 am
([msg=87671]see Re: a python web crawler[/msg])

Nice. My python is a little rough (ruby ftw) so I'm not for sure about the layout of the output. Will it list in a tree format?

EX:
link1
--link1
--link2
link2
--link1
--link2
--link3
link3
link4

I also like that you are removing duplicates. +1
"The quieter you become, the more you are able to hear..."
"Drink all the booze, hack all the things."
User avatar
limdis
Moderator
Moderator
 
Posts: 1657
Joined: Mon Jun 28, 2010 5:45 pm
Blog: View Blog (0)


Re: a python web crawler

Post by pretentious on Tue Apr 14, 2015 5:29 am
([msg=87689]see Re: a python web crawler[/msg])

limdis wrote:Will it list in a tree format?

From memory, Nah. It was just a 2 hour or so job that pasts raw url's line by line. It probably won't be too difficult to adjust it to suit your needs though
Goatboy wrote:Oh, that's simple. All you need to do is dedicate many years of your life to studying security.

IF you feel like exchanging ASCII arrays, let me know ;)
Can you say brainwashing It's a non stop disco
User avatar
pretentious
Addict
Addict
 
Posts: 1217
Joined: Wed Mar 03, 2010 12:48 am
Blog: View Blog (0)


Re: a python web crawler

Post by cyberdrain on Wed Apr 15, 2015 6:37 pm
([msg=87716]see Re: a python web crawler[/msg])

Reading through the code, your try block will probably error out.
Free your mind / Think clearly
User avatar
cyberdrain
Expert
Expert
 
Posts: 2160
Joined: Sun Nov 27, 2011 1:58 pm
Blog: View Blog (0)


Re: a python web crawler

Post by pretentious on Fri Apr 17, 2015 8:45 am
([msg=87750]see Re: a python web crawler[/msg])

cyberdrain wrote:Reading through the code, your try block will probably error out.

Do you mean the lines for reading files that don't exist, would throw an exception? Yeah it looks dodgy but I've tested the code and it works by itself. Nuff said ;)
Goatboy wrote:Oh, that's simple. All you need to do is dedicate many years of your life to studying security.

IF you feel like exchanging ASCII arrays, let me know ;)
Can you say brainwashing It's a non stop disco
User avatar
pretentious
Addict
Addict
 
Posts: 1217
Joined: Wed Mar 03, 2010 12:48 am
Blog: View Blog (0)



Return to Custom Code

Who is online

Users browsing this forum: No registered users and 0 guests