script language: PHP
NO FRAME WORKS
id (auto insert)
createddate (auto insert)
modifieddate (auto update)
Table A (entities)
state (varchar) 2 letter us state code
type varchar(city, county, school district, university, college)
table B (url)
datasource (url or query where the data came from)
maxlayers (defaults 2)
statusname (values would include "found match", "no match")
table C (curl data)
statusname (values would include "undefined" (default), "good", "bad", "review")
The 1st script will:
1 parse the LEA_NAME column for unique values for "school district" names from here - [login to view URL], get the state, school district name, & url. 25,000 results
2 parse the "county names" from here - [login to view URL], get state name, convert it to 2 letter code, 3,098 records
3 parse the "city names" from here - [login to view URL] grab city name, usps (state). 29,000 records.
4 parse the "US college/university" from here - [login to view URL] - & grab college/university name, & url. 2,073 records.
5 populate table A with the name, type, state code (2 digit) while skipping duplicate. convert the state name to 2 digit.
The 1st script will be a one time script, run from linux cli.
The 2nd script will:
1 Loop through table A, & attempt to find the url that matches with a google search, if one was not present from the datasource. The logic must skip certain false positives such as a domain with the word "weather" or "census" or "zillow" or "google" in it or url with ".jpg" or ".asp"
2 populate the record in table B, with :
datasource = the url of the data source above
url = url (skip duplicate)
statusname = null
googleposition = 1-20 (first page of google results only)
The 2nd script will return 35,000 - 200,000 results.
The 2nd script will run periodically from linux cli, on a crontab, & will be rerun, in the future, when additional excemptions are added.
The 2nd script should be multi threaded, & should cap out above a 100mb/second connection
The 3rd php script will:
1 Loop through table B, use curl to retrieve the web page
2 Loop through each of the child pages, for the value in the maxlayers column
3 Look for a particular pattern of text, including a case insensitive search for "bids" "request for proposal" "rfp" "rfq" "request for bids" "proposals"
4 Compare the curl returned html against the keywords
if there is match - insert a record into table C with the url (skip non unique url), the retrieved html, what keyword caused a match, & update table B statusname to "found match"
if there is no match, updated table B maxlayers count upward 1, & updated the statusname to "no match"
Each record from table B may have multiple records in table C
The 3rd script will run be run periodically from the linux cli
The 3rd script should be multi threaded, & should cap out above a 100mb/second connection
The 4th script will be ANDROID PHONE FRIENDLY:
1 Define an sql query which should return the top 10 selection from table C, sorted by modifieddate ASC, WHERE type != "bad" & != "good"
2 Provide a simple html table view front end to review each of the url, which should have columns for all values from table B.
3 An additional column will show a update status button, which when pressed shows the values (as buttons) from table C, which when pressed, update record in table C
The 4th script view is intended for a quality check employee to review all results from, & log if url matches our ultimate criteria or not.