[ad_1]
Facebook –
GitHub –
Google+ –
LinkedIn –
reddit –
Support –
thenewboston –
Twitter –
Python Web Crawler Tutorial – 17 – Running the Final Program
by
Tags:
Comments
47 responses to “Python Web Crawler Tutorial – 17 – Running the Final Program”
-
Hey bucky I like the work you have done with all these tutorials. I have a small problem with this. as soon as i run this i am getting an error which im not being able to solve.
First spider crawling http://huntourage.com
Queue: 1 | Crawled: 0
Traceback (most recent call last):
File "C:/Users/Vik/Desktop/Jad/Python/theNewBostonWebCrawlers/main.py", line 14, in <module>
Spider(PROJECT_NAME, HOMEPAGE, DOMAIN_NAME)
File "C:UsersVikDesktopJadPythontheNewBostonWebCrawlersspider.py", line 25, in _init_
self.crawl_page('First spider', Spider.base_url)
File "C:UsersVikDesktopJadPythontheNewBostonWebCrawlersspider.py", line 39, in crawl_page
Spider.add_links_to_queue(Spider.gather_links(page_url))
File "C:UsersVikDesktopJadPythontheNewBostonWebCrawlersspider.py", line 61, in add_links_to_queue
for url in links:
TypeError: 'type' object is not iterableProcess finished with exit code 1
I have the exact same code but I always keep getting this same error. Any idea?
-
WOW!!!
-
I have followed every step but it only crawl home page. I don't understand why it is not crawling other pages or links.
-
n thanks a ton Bucky
-
I'm amazed by your work ….
excellent….. exception…… -
Thanx a ton Bucky 🙂
-
Excellent series. Well worth watching. Thank you for doing this! 🙂
-
Thank you! This is a very good tutorial (:
-
Thank you so much
-
Everything seems to work but I keep getting this error in cmd and pycharm:
File "C:/Users/Jino/PycharmProjects/PyScripts/main.py", line 45
create jobs()
^
SyntaxError: invalid syntaxProcess finished with exit code 1
=[ I am also using python 2.7
-
How do I make use of these links?
-
Do you want to collaborate and create a big project to see how big it can get?
-
hi bucky, I'm wondering, I think Google is not using txt files to save their crawled pages so I want to know, will it be efficient if I'll try to modify your project and change it to an sqlite database?
-
It seems like the crawler is useless, the only thing I was able to crawl was your site, and it only returned the url of the homepage.
-
is there anyway to crawl news articles based on a key word? If so, how?
-
This is a great tutorial! Thanks! I have one question. I tried crawling a wordpress site. Does this only work with HTML pages? I seem to remember you saying it would work with PHP, but I'm not sure.
-
its giving me an error.. to import queue from Queue
-
Hye If some has got problem with the code wise Implementation you may have a look here @ my gtihub source code:
-
Please, someone knows how to set my crawler for crawl my page of the Deep Web?
-
can this webcrawler extract reviews from travel websites?
-
The crawler costs too much I/O… Maybe we can modify the code to make the file updating less frequent.
-
Nice tutorial 🙂
-
I downloaded my source code from github. And there is also problem, program freezes i have to stop main.py and rerun it to process the queued item . Can anyone pls help me with this i dont know that the problem is
-
How can i make a sitemap using this crawler ?
-
Hey man , i saw your python web crawler and as a project for my school i want to add like a visual tree that shows the site tree map or something close to that …
how can i add that option to your project ?
Thanks alot if you help me -
i am able to crawl just fine up to about 3510 pages, then it just freezes with over 1000 pages in the queue. why is this happening? is the connection closing?
-
Getting a UnicodeEncodeError when I try to crawl wikipedia 🙁
-
File "g:DocumentsPythoncrawlermain.py", line 19
for _ in range(NUMBER_OF_THREADS)
^
SyntaxError: invalid syntaxThis is what I get… I can't understand why..
-
I am using wifi of my university and thus require username and password for web connectivity…i am always getting output as — " Error : can not crawl page. " Can anyone please tell me how to include that in this project?
-
Great Tutorial!!! Thanks a lot, brother
-
hey can anyone help me im still geting this error "addinfourl instance has no attribute 'getheader' "
-
I am using python 2.7 on my system. When i am executing my crawler its showing this error.
descriptor '__init__' of 'super' object needs an argument
Help me! -
Hi , i run the code and it works, I just want to ask you if I want to use the code to crawl just one link in the homepage , for example, from giphy.com then to categories then to emotion , then under emotion , i need to crawl 7 emotion links , so my question is this code will never work for case like this , right? because we need the base_url to be the home page url and we can control to be anything else
-
Thanks alot man, precious.
-
Please make Godot Engine tutorial! ><
-
Thank you so much : ) Keep doing~
-
Will there be problem using multithread without any synchronized method?
-
Wow – great job bucky! Thanks!!!
-
You'll find issue with duplicated urls saved in crawled.txt, it's due to error set_to_file, you find Bucky's update on github to fix it >> https://github.com/buckyroberts/Spider/blob/master/general.py
-
I guess there may be thread safety problems when IO. Because the file size of my 'crawled.txt' doesn't grow constantly, but jumps up and down 🙁
I'm trying to fix it. -
wow nice tutorial but in my crawled.txt have hundreds of link and in queue.txt i have thousands of links. And Some error found while crawling.
-
Make A tutorial on Selenium, and how to avoid getting timed_out.
-
Hi Bucky! I'm using your code to crawl 'http://www.reddit.com/subreddits' but it doesn't work properly, sometimes the urllib.open works and sometimes it doesn't. Is there any way I can make this to happen? The problem comes from a timeout exception from Reddit's website.
-
Thanks so much! favorite youtube tutorial so far by far 😀
<3
-
Can this crawler be used to make a sitemap? How would you do it?
-
I ran the crawler in a small blog blog and reach max depth recursion. Anything wrong? :/ edit: I changed the number of threads from 8 to 4 (which is the number of cores my pc has) and it worked fine. Why is it?
-
Is there any way i could turn this into like an excecutable file where people can enter the link of a website and set the directory where they want the crawled and queued files to be? and then the spider does its thing.
Leave a Reply