-- Scope --
Create a command-line Python program capable of scrapping places information from the ‘Satellite + old places' map type on Wikimapia Beta website – [url removed, login to view] – given a bounding box.
The bounding box is defined by a pair of coordinates – latitude and longitude (decimal degrees) in WGS84 coordinate system – in the following format: (minimum latitude, minimum longitude), (maximum latitude, maximum longitude).
-- Required Knowledge --
Python – good OO design and memory management skills, experience with Beautiful Soup (or equivalent) is recommended.
Some experience with Google Maps API might be useful.
-- Specifications --
-Target Operating Systems – Windows XP, Debian, Ubuntu
-Language – Python 2.5(+)
-Data Output Format – TSV, UTF-8
-Geometries Format – Well-Known Text (WKT) strings (see [url removed, login to view])
-Coordinate System – Latitude and Longitude decimal degrees on WGS84
-- Deliverables --
(See also ‘Project Milestones' below.)
- Python script that fetches Wikimapia data for places in a given geographical area defined by a bounding-box;
- Comprehensive documentation – user manual, setup and commented code;
- Installer scripts for Windows XP, Debian and Ubuntu – listing any external dependencies and their setup procedures.
-- Requirements --
Small Memory/Disk Usage Foot-Print – the program has to use memory and disk space efficiently, via built-in house-keeping procedures to avoid leaving temporary files or to consume big chunks of memory unnecessarily.
No Wikimapia DOS – the program has to have random time intervals between requests to Wikimapia website and/or other measures to avoid over-stressing Wikimapia resources.
Completeness - the program has to account for the complete set of places existing in the given bounding-box. The places retrieval mechanism has to be aware of different map levels contents – not all places, if any, appear at every map level - and to be able to record information about every place present in the bounding-box once (and only once).
Tasks Script File - the program has to be able to sub-divide a task into smaller tasks – e.g.: by sub-dividing the original bounding-box into smaller bounding-boxes – generating a tasks script.
In order to be able to distribute a task across several machines, the program has to be able to interpret this tasks script – or a subset of it – and to process the sequence of tasks it describes. The tasks script can be an argument – as a path for a text file - for the command-line program and, when present, is a replacement of the bounding-box argument.
The aggregation of results from the processing of several subsets of a tasks script by distinct program copies has to be equal to the processing of the complete tasks script by a sole copy of the program.
Log File – the program has to have the ability to record (with time-stamps) its steps, warnings and errors in order to guarantee the possibility to restart a task from a specific point.
Data to Scrap - the places' information to extract from Wikimapia is as follows:
-Label – map place tooltip (equivalent to Google Maps API GMarker Title);
-Outline or Envelope – polygon that defines the boundaries of the place (Note: ‘old places' have envelopes, other places have outlines but all of these and those are polygons);
-Centroid – coordinates in top right corner of info window, converted to decimal degrees;
-Categories – text after “Category: “ on info window;
-Description – description in info window;
-Permalink – permalink URL in info window;
-Languages – language acronym in bottom right corner of info window;
-Last Edit Date – converted in year/month/day format from text after “Edited: “ in bottom left corner of info window.
Output Format - the collected data is to be exported to a UTF-8 tab separated values file with 8 fields:
-“label” – text;
-“envelope” – WKT polygon string;
-“centroid” – WKT point string;
-“categories” – text, if multiple categories exist, separate by semi-colons;
-“description” – text;
-“permalink” – text;
-“languages” – text, if multiple languages exist, separate by semi-colons;
-“last_edit_date” – number, format ‘yyyymmdd'.
-- Project Milestones --
If the developer agrees, partial payment will be processed on delivery and acceptance of the following working scripts:
-[40%] Create a program that, given a bounding-box defined by a pair of coordinates:
1. Retrieves the above mentioned ‘data to scrap' for the places present in the highest level that encompass the bounding-box, and;
2. Produces a UTF-8 tab separated values file with the above mentioned ‘output format' and fills it with the scraped data.
-[20%] Create an evolution of the previous program that:
1. Retrieves the above mentioned ‘data to scrap' for all the places present in every level that encompass the bounding-box;
2. Registers, if requested, the steps, warnings and errors of the previous task in a ‘log file' – one record per line with time-stamp, and;
3. Produces a UTF-8 tab separated values file with the above mentioned ‘output format' with one (and only one) record (line) of scraped data per place.
-[40%]Create the final version of the program, which is able to:
1. Generate a ‘tasks script file' given a bounding-box, describing the data scrapping process in modular (atomic) steps in such way that subsets (lines) of that ‘tasks script file' may be processed by independent machines (using the same final version of the program);
2. Retrieve the above mentioned ‘data to scrap' for all the places present in every level that encompass a bounding-box or the correspondent ‘tasks script file' (or subset of it);
3. Register, if requested, the steps, warnings and errors of the previous task in a ‘log file' – one record per line with time-stamp;
4. Collate scraped data resulting from the processing of different but related subsets of a ‘tasks script file', and;
5. Produce a UTF-8 tab separated values file with the above mentioned ‘output format', with each place that exists in the bounding-box – or equivalent ‘tasks script file' – recorded once (and only once)