I need a crawler that runs on linux, is easy to install on multiple computers if needed and crawls through a list of different recipe sites i provide, it should have a number of features.
1. Download the page, including any images (recipe pictures etc) and store them in a folder with the folder name specified in the recipe database,
2. Process the downloaded page, put ingredients in a database field, then description in another field, and other information in another field.
Should be like this, recipe id 1, ingredient linked to recipe id 1, amount, quantity etc, have a look at phprecipebook, i want to mirror that structure in terms of processing the data and storing it in a mysql database, but also having another few fields for source name, source url, image url etc that sort of information.
3. Should be able to store quantities as well if that is within a textbox as some sites do,
4. should only record recipes, i want to build a database of millions of recipes so this would essentially be a giant google style crawler (but for only recipes)
5. It should be able to be speed limited, but also work in round robin fashion, so instead of overloading one site running quickly crawling, i should be able to have a list of base domain's and under those domain's url's, and the crawler should start on one url within one domain, then go to the next domain and leave the first, then the third domain, so it is getting lots of information very quickly but from different domains if that makes sense.
Should be semi template based so its easy to add new recipe sites, and modify what information is recorded if the layout of the site changes.
6. should be able to crawl recipe sites directly, or work through numerious proxy sites if my ip gets blocked, and if it crawls through recipe sites it should also be able to record the source url of the page being downloaded, without the proxy url, so say it goes through [url removed, login to view] it should record source as [url removed, login to view]
Thats what i mean, I will provide a big list of recipe sites i want the system to crawl, and i want it to extract all information, including ingredients (one by one in database) description, images, categories, related recipes, any other descriptions about recipes like starter, desert, etc, or gluten free etc.
All information other than images should be stored in mysql database, images stored in a folder and referenced within the database, can use open source crawlers or tools but needs to be easy to run, easy to add new recipe sites to crawl, and run on linux. (maybe even php is an idea? up to you)
Edit: Can be in windows if needed, but linux is prefered!
I have updated the list of sites i would like to crawl, we may as well keep it simple for the start and aim to keep costs low as this is a small home project with a small budget. This list is the list of sites i would like to crawl, and get all recipe information from the entire domain.
Information that should be collected is all the recipe information, including title, ingredients, description / summary, serving sizes, notes, categories, recipe types (dinner, supper etc) recipe page url, recipe source, any recipe information, any nutritional information. Basically any part of the site that is used for the recipe. Contact for more info.
Information should be stored in the database, (the entire page, images etc) and then that information should be processed and stored in the database with your own fields, and then that information taken and inserted into the phprecipebook database i mentioned earlier.
So your database that the html page is stored in, would be processed and put into your own database table that would include ingredients, descriptions, instructions, titles, sources, images, categories etc. Then the information from that table should be inserted into the phprecipebook.
What will happen is in the future i will contact you to add new sites to the crawler, i want to be able to run the crawler on my computer, or multiple computers if possible, but i also want to keep costs down so whatever you can do.
MUST BE ABLE TO WORK THROUGH PROXY SITES SO MY HOME IP ADDRESS DOES NOT GET BLOCKED, BUT CANT RECORD ANY HTML FROM THOSE PROXY SITES, AND LINKS / IMAGES ETC STORED MUST REFERENCE THE FOOD RECIPE SITE AND NOT INCLUDE THE PROXY INFORMATION / URLS.