I am writing a jango web app which uses scarpe and does all great work locally, but I wonder how set up is a production environment where my spiders are launched from time to time and automatically (I mean once the spiders complete their jobs, then it starts again after a certain time Goes ... for example 24 hours Later). Currently I launch my spiders using a custom Django command, which is the main goal of allowing Django's ORM to store scrapped items, so I run:
python manage.py scrapy crawl myspider
More results are stored in my postgraze database. I have installed, because it appears that production is the preferred method of running scanner, but unfortunately I can not use it without writing the monkey patch (which I would like to save), because its web-service API's Jason Is used and I get the exception to "Model X JSN is not a serial" I saw, but it seems that flexible and to optimize Can not be scanned because scary and in fact it says in the docs:
Since it simplifies things DDS is not usable for all types of scraps, but it is regular
I also have crontabacks for scheduling my spiders in the form of a scanned website with a list of updated items. Abuse of thought ... but what should interval I run my spiders? And if my EC2 example (I am using Amazon webservices to host my code) requires a reboot, then I need to manually run all my spiders again ... MMMH ... things Complexes ... So ... what could be an effective setup for a production environment? How do you handle it? What is your advice?
This was the question I used to point to you. What I did with my project and what I did was explained here.
Currently I launch my spiders using a custom Django command, which is the main goal of allowing JGRO's ORM to store scrapped items.
It seems very interesting that I also wanted to use the JCOMO ORM inside the scrappy spiders, so I imported the DJJange and set it before scrapping. I think that if you already call Scarere with the immediate Django context then it is unnecessary.
I have installed screwpad, because it seems to be the preferred way of running a scanner in production but unfortunately I can not use it without writing a monkey patch (which I would like to save)
I had the idea of using subprocess. Pipes were redirected with popen, stdout and stderr. Then take both standout and sticker results and process them. I did not need to collect items from production because the spiders were already writing results in the database through pipelines. If you call screwpip in this way from dengue, it becomes recursive, and sets the reference of the DJ to the scrape process so that it can use the ORM.
Then I tried Scrappide and yes, for you to deploy the script job, but it does not give you a hint when the job is finished or if it is pending that part you have to check and I It seems that this monkey is the location of the patch.
I also thought to use the cronat to schedule my spiders ... but at what interval should I run my spiders? And if my EC2 example (I am using Amazon webservices to host my code) requires a reboot, then I need to manually run all my spiders again ... MMMH ... things Complexes ... So ... what could be an effective setup for a production environment? How do you handle it? What is your advice?
I am currently using crawling for scrapping. It is not something that users can change, even if they want but it also has its professionals, in this way I am sure that users can reduce this period and many scrapers do not work at the same time.
I'm worried about launching unnecessary links in the series, Scrapyd would be a middle link and it seems that it is working for now, but it can also be a weak link if it does not produce output Can keep
Keeping in mind that you had posted this before, I would be grateful to see what the solution was about the whole of your DEGENGO-SCRAP-SCRAPED integration.
Cheers
Comments
Post a Comment