Joining and mapping government data sets

Posted on November 9, 2016February 18, 2020 by Ben North

A while ago, the principal at the primary school my children attend remarked at how time-consuming it was for her to manually collate two different sources of information on schools, and cross-reference against a map of schools’ locations. Her task, as I understand it, was to find nearby schools who might want to pool their ‘resource hours’ with our school and thereby share a resource teacher. To allow schools to do this, the relevant state bodies publish PDFs listing the resource hours allocated to each school, as well as the number of ‘permanent resource posts’, which is also relevant to the pooling discussions. The department also have on their website a mechanism for finding a school by map.

Separately, I’d come across Carto, which is a site/service for presenting data-sets on maps. I thought it would be an interesting project to try to tie all these together, and this is the result, as hosted by the INTO:

Ideally, the Department of Education and Skills would provide this, since they have all the data already there in machine-friendly form. Perhaps they will do so in subsequent years.

Aside: School ‘Roll Number’

After a bit of experimenting, it looks to me very much like a school’s Roll Number uses the same check-letter algorithm as the Personal Public Service Number, which is interesting and, as far as I could find, undocumented.

Implementation of map

Some notes on the process I went through to create this are below. For a revised version, I did the join externally (in Python) to get more appropriate handling of missing values, but the major steps are the same.

Gather data

School location data

This data-set gives the core school details including latitude and longitude.

Source

Go to http://www.education.ie/en/find-a-school and select:

School Level: Primary
Geographical Area: All

Before you click ‘Search’, open Chrome Developer Tools (or equivalent in other browser), and select the ‘Network’ tab. Then, back on the main browser window, click ‘Search’. The ‘network’ tab will fill with requests. Look down the Requests for ‘GetSchoolsMap’, which is a POST request. Right-click on it, and choose ‘Copy Response’. You now have a 1.6MB lump of JSON. Paste it into a new text file schools-list.json.

The department does offer a spreadsheet of this information, but it lacks the latitude/longitude information. It would be useful if a person in the department could add the lat/long information to the spreadsheet.

Convert to CSV

The schools_data.py Python program (instructions contained in it) will turn this into a CSV. You should now have schools.csv.

Base posts

We need the ‘permanent resource posts’ information, which the department does make available, but as a PDF. We need to extract the data from the PDF into a CSV.

It would be useful if a person in the department could make the underlying spreadsheet available rather than just a PDF.

Source

Start at http://www.education.ie/ and go to:

Schools & Colleges (red tab)
Allocation of Teachers to Schools (first bullet under ‘Services’)
Primary (big blue button)
Appendix C (under ‘Circular 0007/2016 Information and relevant forms’)

Just now, the final PDF was located at

http://www.education.ie/en/Circulars-and-Forms/Active-Circulars/cl0007_2016_appendix_C.pdf

but this will vary as it’s revised, I expect.

The column we need is ‘Base Schools for permanent Resource posts’.

Convert to CSV

To pull the actual data out of the PDF, I used the Free Software application Tabula. Download it and install. Then open a web browser and go to http://127.0.0.1:8080/ to actually use Tabula. ‘Import’ the ‘Appendix C’ PDF just downloaded, and ‘Extract Data’ from it. Drag out a rectangle over the main body of the table on the first page (excluding the headers), adjust by dragging its edges if required, and click ‘Repeat this selection’. Then click ‘Preview & Export Extracted Data’. Choose the ‘Stream’ extraction method and check everything looks OK (e.g., no groups of numbers all in one cell). Then export as CSV to base-posts.csv.

The Python program base_posts.py adds headers. It contains instructions. Finally we get base-posts-headed.csv.

(An alternative would be to open base-posts.csv in a spreadsheet program and add the headers yourself.)

Resource hours

The National Council for Special Education website contains this data, but as a PDF.

It would be useful if a person in the Council could make the underlying spreadsheet available rather than just a PDF.

Source

Go to http://www.ncse.ie/ and

Hover over ‘For Schools’
Click on ‘Resource Teaching and SNA Allocations’
Click on ‘Resource Teaching Hours Allocations 2016/2017 Primary’ (under ‘2016/2017 School Allocations – September 2016’)

Just now, the final PDF was located at

http://ncse.ie/wp-content/uploads/2016/05/NCSE-Allocation-of-RT-to-Primary-Schools-September-2016.pdf

but I imagine this will vary as it’s revised.

Convert to CSV

Use Tabula again to get resource-hours.csv and use the Python program resource_hours.py to turn this into resource-hours-headed.csv. Or, open the first CSV in a spreadsheet program and add the headers yourself.

Create Carto account

Go to https://carto.com/signup and create an account.

Upload data-sets to Carto

At the top of your Carto dashboard, choose Datasets, and then ‘New Dataset’. Upload the CSVs:

Schools dataset

The ‘schools’ data-set needs some care to avoid mangling the phone numbers. Make sure the ‘Let CARTO automatically guess data types and content on import’ tick-box is NOT ticked, then browse to schools.csv, and click ‘connect dataset’. Everything from the CSV should appear as ‘string’, with the phone numbers keeping their leading ‘0’. Unfortunately, this means Carto does not infer the lat/long information, so we must do that manually.

Add lat/long information into column ‘`the_geom`‘

Go to the ‘SQL’ tab in the toolbar on the right. Run the following query:

    UPDATE schools
    SET the_geom = cdb_latlng(CAST(lat AS NUMERIC), CAST(long AS NUMERIC));

and you should see ‘the_geom’ now contains (lat, long) information. Clicking on ‘map view’ should now show the schools’ locations.

(St Columba’s National School, Galway, has missing lat/long information so shows up off the coast of Africa.)

Base resource posts dataset

Create a new Carto dataset from the base-posts-headed.csv file. This time, do tick the ‘automatically guess types’ box.

Resource hours dataset

Again, create a dataset from resource-hours-headed.csv with ‘automatically guess types’.

Create one dataset with all information in it

Go to ‘your datasets’, and choose the ‘schools’ dataset.

Merge in ‘permanent resource posts’ column

Under ‘edit’, choose ‘merge with dataset’, then ‘column join’, then choose the base-posts dataset from the right-hand drop-down. Choose the ‘rollnumber’ entry in both datasets (schools and base-posts), then ‘next step’.

Select only the ‘permanent_resource_posts’ entry from the right-hand set, then ‘merge datasets’. This will create a dataset with ‘_merge’ appended to the name, and with the permanent_resource_posts column added.

Merge in ‘resource hours’ column

Start with the ‘_merge’ dataset. Go through the same process to add the ‘hours_sep_2016’ column from the ‘resource hours’ dataset.

Rename final dataset

Now you have a ‘schools_merge_merge’ dataset. Rename it to something more sane, say ‘schools_with_resource_info’, by clicking on the name.

Create a map from final dataset

Click on the ‘+’ tab (the top one in the toolbar on the right). It says ‘A map is required to add layers’; click ‘ok, create map’.

Add the ‘schools_with_resource_info’ dataset by selecting it from the list, and clicking ‘add layer’.

At this point you should have a map of Ireland covered in orange dots.

Customize info pop-up

Click on the speech-bubble tool in the toolbar on the right (‘infowindow’). Click on the ” at the right; this lets you provide custom HTML. Replace the contents there with

<div class="cartodb-popup v2">
  <a href="#close" class="cartodb-popup-close-button close">x</a>
  <div class="cartodb-popup-content-wrapper">
    <div class="cartodb-popup-content">
      <h3 style="color:#22a">{{name}}</h3>
      <p>{{rollnumber}} &mdash; {{address1}}</p>
      <p>{{hours_sep_2016}} hrs{{#permanent_resource_posts}};
      {{permanent_resource_posts}}
      full post/s{{/permanent_resource_posts}}</p>
      <p>{{principal}}</p><p>{{phone}}</p>
    </div>
  </div>
  <div class="cartodb-popup-tip-container"></div>
</div>

or adjust as you like.

Publish

When ready, click the ‘publish’ button at top-right.

Joining and mapping government data sets

Aside: School ‘Roll Number’

Implementation of map

Gather data

School location data

Source

Convert to CSV

Base posts

Source

Convert to CSV

Resource hours

Source

Convert to CSV

Create Carto account

Upload data-sets to Carto

Schools dataset

Add lat/long information into column ‘`the_geom`‘

Base resource posts dataset

Resource hours dataset

Create one dataset with all information in it

Merge in ‘permanent resource posts’ column

Merge in ‘resource hours’ column

Rename final dataset

Create a map from final dataset

Customize info pop-up

Publish

Recent Posts

Archives

Aside: School ‘Roll Number’

Implementation of map

Gather data

School location data

Source

Convert to CSV

Base posts

Source

Convert to CSV

Resource hours

Source

Convert to CSV

Create Carto account

Upload data-sets to Carto

Schools dataset

Add lat/long information into column ‘the_geom‘

Base resource posts dataset

Resource hours dataset

Create one dataset with all information in it

Merge in ‘permanent resource posts’ column

Merge in ‘resource hours’ column

Rename final dataset

Create a map from final dataset

Customize info pop-up

Publish

Recent Posts

Archives

Add lat/long information into column ‘`the_geom`‘