This project will develop set of crawlers based on Scrapy framework that can download and synchronize all of products' firmware (including all versions) from web pages of a given list of predefined vendors and store the firmware information (meta data) in PostgreSQL DB. Final number of crawlers would be ~100 and project milestones are defined per vendor and each milestone is max 65€ which is paid after we verify the completeness of each crawler and see no errors.
The mandatory metadata fields include (Manufacturer, Model, Version, Type, Name, Release Date (if available), Download link, (calculated Sha2 hash of the file)i.e. ( Cisco, Video Surveillance 6030 IP Camera, 2.7.0, IP Camera, [login to view URL], 21/08/2015, "link", “Sha2” ). There is a boolean field which indicates if the device is discontinued or not depending on the availability of such information on the website of the vendor. The firmware files itself will be stored in the file system and will be referenced in PostgreSQL.
The developer is required to extend an existing scraping framework that was partially developed based on Scrapy framework and follow DB schema and code templates provided by us. It's also the responsibility of the developer to test crawler and ensure completeness of the solution in terms of full coverage of the firmware files and product pages. There are no GUI components on the server that runs crawlers. Therefore, headless browsing mode should be used.
1. Crawlers will be written per vendor. This is required because each vendor website will have its own implementation of the firmware download page.
2. The user should be able to pause and resume crawling jobs.
3. Crawlers should detect previously downloaded files and only download updated and new content and firmware files. At first execution of each crawler, it will download all the available firmware files but the subsequent crawler runs will only download new firmware files which are added since the last crawling.
4. The developer is required to manually analyze each provided vendor site before writing a crawler to identify the following required information:
a. URLs for the firmware download page including all of the firmware versions for each product
b. URLs/files for each product that include the following information, required to be scraped: "Manufacturer", "Model", "Version", "Type", "Release Date", "if the product is discontinued"
c. Credential Requirements (Simple Signups, Specific Signups, No Signups)
d. Any Captcha on the page
e. Any honeypot traps
5. If a vendor site requires credential for firmware download, the developer is required to sign up an account using an email address dedicated for this project
6. Script will try to imitate human like behaviour (to a limit) while scraping the web page as well as using Tor if required
The developer MUST test the completeness of each crawler before delivering to us and present test completion evidence in the form of a populated PostgreSQL database of that vendor.
*An NDA and a contract must be signed before the beginning of the project. A copy of the developer's identification document is required to verify the identity.
*Please apply just when you fully read and understand the project and agree with the conditions.
17 freelancers are bidding on average €6912 for this job
Hello I agree your proposal with 65€ per one, I have rich experience in python and django for webscraping platform developent. Please discuss in details and share target urls. Thanks