Python Web Crawler Tutorial – 17 – Running the Final Program

[ad_1]
Facebook –
GitHub –
Google+ –
LinkedIn –
reddit –
Support –
thenewboston –
Twitter –


Posted

in

by

Tags:

Comments

47 responses to “Python Web Crawler Tutorial – 17 – Running the Final Program”

  1. jad fady Avatar

    Hey bucky I like the work you have done with all these tutorials. I have a small problem with this. as soon as i run this i am getting an error which im not being able to solve.

    First spider crawling http://huntourage.com
    Queue: 1 | Crawled: 0
    Traceback (most recent call last):
    File "C:/Users/Vik/Desktop/Jad/Python/theNewBostonWebCrawlers/main.py", line 14, in <module>
    Spider(PROJECT_NAME, HOMEPAGE, DOMAIN_NAME)
    File "C:UsersVikDesktopJadPythontheNewBostonWebCrawlersspider.py", line 25, in _init_
    self.crawl_page('First spider', Spider.base_url)
    File "C:UsersVikDesktopJadPythontheNewBostonWebCrawlersspider.py", line 39, in crawl_page
    Spider.add_links_to_queue(Spider.gather_links(page_url))
    File "C:UsersVikDesktopJadPythontheNewBostonWebCrawlersspider.py", line 61, in add_links_to_queue
    for url in links:
    TypeError: 'type' object is not iterable

    Process finished with exit code 1

    I have the exact same code but I always keep getting this same error. Any idea?

  2. Samiha Abrita Avatar

    I have followed every step but it only crawl home page. I don't understand why it is not crawling other pages or links.

  3. Mak S Avatar

    n thanks a ton Bucky

  4. Mak S Avatar

    I'm amazed by your work ….
    excellent….. exception……

  5. Drashya Kushwah Avatar

    Thanx a ton Bucky 🙂

  6. Matt O'Toole Avatar

    Excellent series. Well worth watching. Thank you for doing this! 🙂

  7. Bárbara Perim Avatar

    Thank you! This is a very good tutorial (:

  8. Mike Kosewski Avatar

    Everything seems to work but I keep getting this error in cmd and pycharm:

    File "C:/Users/Jino/PycharmProjects/PyScripts/main.py", line 45
    create jobs()
    ^
    SyntaxError: invalid syntax

    Process finished with exit code 1

    =[ I am also using python 2.7

  9. Siddhant Bagga Avatar

    How do I make use of these links?

  10. Cri Dane Avatar

    Do you want to collaborate and create a big project to see how big it can get?

  11. Arvie San Avatar

    hi bucky, I'm wondering, I think Google is not using txt files to save their crawled pages so I want to know, will it be efficient if I'll try to modify your project and change it to an sqlite database?

  12. Ricardo Medley Avatar

    It seems like the crawler is useless, the only thing I was able to crawl was your site, and it only returned the url of the homepage.

  13. Ricardo Medley Avatar

    is there anyway to crawl news articles based on a key word? If so, how?

  14. Jeremy Collins Avatar

    This is a great tutorial! Thanks! I have one question. I tried crawling a wordpress site. Does this only work with HTML pages? I seem to remember you saying it would work with PHP, but I'm not sure.

  15. Jimmy Rodriguez Avatar

    its giving me an error.. to import queue from Queue

  16. Pranjal Gupta Avatar

    Hye If some has got problem with the code wise Implementation you may have a look here @ my gtihub source code:

    https://github.com/PranjalGupta3105/Website_Crawler

  17. DarkF0x Avatar

    Please, someone knows how to set my crawler for crawl my page of the Deep Web?

  18. Joshua Lee Avatar

    can this webcrawler extract reviews from travel websites?

  19. Zheng Luo Avatar

    The crawler costs too much I/O… Maybe we can modify the code to make the file updating less frequent.

  20. Xavi Beltran Avatar

    Nice tutorial 🙂

  21. Slim Shah Avatar

    I downloaded my source code from github. And there is also problem, program freezes i have to stop main.py and rerun it to process the queued item . Can anyone pls help me with this i dont know that the problem is

  22. oPerfectionist Avatar

    How can i make a sitemap using this crawler ?

  23. איתי כצנלסון Avatar

    Hey man , i saw your python web crawler and as a project for my school i want to add like a visual tree that shows the site tree map or something close to that …
    how can i add that option to your project ?
    Thanks alot if you help me

  24. Cody Spate Avatar

    i am able to crawl just fine up to about 3510 pages, then it just freezes with over 1000 pages in the queue. why is this happening? is the connection closing?

  25. Daniel Field Avatar

    Getting a UnicodeEncodeError when I try to crawl wikipedia 🙁

  26. Mike Zamayias Avatar

    File "g:DocumentsPythoncrawlermain.py", line 19
    for _ in range(NUMBER_OF_THREADS)
    ^
    SyntaxError: invalid syntax

    This is what I get… I can't understand why..

  27. Shubham Pandey Avatar

    I am using wifi of my university and thus require username and password for web connectivity…i am always getting output as — " Error : can not crawl page. " Can anyone please tell me how to include that in this project?

  28. Celio Souza Avatar

    Great Tutorial!!! Thanks a lot, brother

  29. Matsaru Yuko Avatar

    hey can anyone help me im still geting this error "addinfourl instance has no attribute 'getheader' "

  30. Akshat Uppal Avatar

    I am using python 2.7 on my system. When i am executing my crawler its showing this error.
    descriptor '__init__' of 'super' object needs an argument
    Help me!

  31. Wassan Hayale Avatar

    Hi , i run the code and it works, I just want to ask you if I want to use the code to crawl just one link in the homepage , for example, from giphy.com then to categories then to emotion , then under emotion , i need to crawl 7 emotion links , so my question is this code will never work for case like this , right? because we need the base_url to be the home page url and we can control to be anything else

  32. ianpooley89 Avatar

    Thanks alot man, precious.

  33. farpras Avatar

    Please make Godot Engine tutorial! ><

  34. Fang He Avatar

    Thank you so much : ) Keep doing~

  35. Fang He Avatar

    Will there be problem using multithread without any synchronized method?

  36. D. Refaeli Avatar

    Wow – great job bucky! Thanks!!!

  37. liang lu Avatar

    You'll find issue with duplicated urls saved in crawled.txt, it's due to error set_to_file, you find Bucky's update on github to fix it >> https://github.com/buckyroberts/Spider/blob/master/general.py

  38. Shawn Cong Avatar

    I guess there may be thread safety problems when IO. Because the file size of my 'crawled.txt' doesn't grow constantly, but jumps up and down 🙁
    I'm trying to fix it.

  39. Ojisan Pass-by (路过的欧吉桑) Avatar

    wow nice tutorial but in my crawled.txt have hundreds of link and in queue.txt i have thousands of links. And Some error found while crawling.

  40. Μάλλον Ακίνδυνος Avatar

    Make A tutorial on Selenium, and how to avoid getting timed_out.

  41. Blueprint Avatar

    Hi Bucky! I'm using your code to crawl 'http://www.reddit.com/subreddits' but it doesn't work properly, sometimes the urllib.open works and sometimes it doesn't. Is there any way I can make this to happen? The problem comes from a timeout exception from Reddit's website.

  42. jimmys_popcorn Avatar

    Thanks so much! favorite youtube tutorial so far by far 😀

    <3

  43. kennyPAGC Avatar

    Can this crawler be used to make a sitemap? How would you do it?

  44. kennyPAGC Avatar

    I ran the crawler in a small blog blog and reach max depth recursion. Anything wrong? :/ edit: I changed the number of threads from 8 to 4 (which is the number of cores my pc has) and it worked fine. Why is it?

  45. PixelRebellion Avatar

    Is there any way i could turn this into like an excecutable file where people can enter the link of a website and set the directory where they want the crawled and queued files to be? and then the spider does its thing.

Leave a Reply

Your email address will not be published. Required fields are marked *