I need a crawler that runs on linux, is easy to install on multiple computers if needed and crawls through a list of different recipe sites i provide, it should have a number of features.
1. Download the page, including any images (recipe pictures etc) and store them in a folder with the folder name specified in the recipe database,
2. Process the downloaded page, put ingredients in a database field, then description in another field, and other information in another field.
Should be like this, recipe id 1, ingredient linked to recipe id 1, amount, quantity etc, have a look at phprecipebook, i want to mirror that structure in terms of processing the data and storing it in a mysql database, but also having another few fields for source name, source url, image url etc that sort of information.
3. Should be able to store quantities as well if that is within a textbox as some sites do,
4. should only record recipes, i want to build a database of millions of recipes so this would essentially be a giant google style crawler (but for only recipes)
5. It should be able to be speed limited, but also work in round robin fashion, so instead of overloading one site running quickly crawling, i should be able to have a list of base domain's and under those domain's url's, and the crawler should start on one url within one domain, then go to the next domain and leave the first, then the third domain, so it is getting lots of information very quickly but from different domains if that makes sense.
Should be semi template based so its easy to add new recipe sites, and modify what information is recorded if the layout of the site changes.
6. should be able to crawl recipe sites directly, or work through numerious proxy sites if my ip gets blocked, and if it crawls through recipe sites it should also be able to record the source url of the page being downloaded, without the proxy url, so say it goes through [url removed, login to view] it should record source as [url removed, login to view]
Thats what i mean, I will provide a big list of recipe sites i want the system to crawl, and i want it to extract all information, including ingredients (one by one in database) description, images, categories, related recipes, any other descriptions about recipes like starter, desert, etc, or gluten free etc.
All information other than images should be stored in mysql database, images stored in a folder and referenced within the database, can use open source crawlers or tools but needs to be easy to run, easy to add new recipe sites to crawl, and run on linux. (maybe even php is an idea? up to you)