THE INTRONAUT [Easy: Success]—Great. There is a dataset that I find interesting in The Puso Project from the Data Engineering Pilipinas Facebook Group (DEP). It came from the Department of Education, and they called it the master list of schools.
It’s a PDF.
Yes, a PDF. A goddamn spreadsheet saved as a 544-page PDF. Why did they do this? Is it that hard to save as the native format they’re already working on? It was already in Microsoft Excel! To make it look more official?? To give *me* a hard time parsing it???
Bah! There’s no point in finding intent in this. The author is dead. Or uncontactable. Now, back to work.
The dataset. Back to the dataset.
It contains a list of all of the basic education schools, from elementary to senior high school, all resolved at the address level. Municipality, barangay, elementary or high school or both, urban or rural, or halfway. Lots of information here.
THE NIHILIST [Easy: Success]—But what do I want to do with this? Even after parsing this dataset, I doubt there are things I can do with the data. Count of SHS? How many elementary schools are in urban settlements? Number of JHS in Region V? These are useless. The school dimension is singular. The dataset exists for itself alone. I need a metric of some sort. I need a fact.
THE GENERATOR [Easy: Success] — What if I benchmark it against something? Against other countries? How do other countries build schools? How much schools are they building? Do they count one to five for each district?
THE OBSERVER [Medium: Success]—Well, there are administrative regions in the dataset. I can integrate population data to it.
THE ARCHIVIST [Easy: Success] —Population data is handled by the newly created Department of Economy, Planning, and Development (DEPDev). Under DEPDev is an agency that handles the census: The Philippine Statistics Office (PSA). They give out *statistics* to anyone who wants them! Under the Philippines Statistics Office is the Philippine Standard Geographic Code (PSGC) which provides the official identification of administrative regions based on court rulings. They also hand out datasets that combines census population and the PSGC Code. The dataset also contains data on regions, provinces, highly-urbanized-cities (HUCs), independent-component-cities (ICC), component-cities (CC), municipalities, and lastly, and the least and the smallest in the hierarchy, the barangays.
That means marriage of two dimensions, administrative regions and basic education institutions. With population as their wedding ring, population as the artifact that connects them to each other.
By the way, they call the PSGC dataset publication data file.
THE TECH SENTINEL [Medium: Success]—Be warned. There is a possible data issues on the way ahead. I highly doubt that the master list of schools uses the standardized administrative region identifiers or even names from the publication data file.
Just to get my hands a little dirty, I will test and check how easy it is to match values from the master list to the publication.
For manual testing, I will first select a school at random in the master list. I will obtain the location of the school. Region, province, districts, municipalities, and barangays. Then I will check them manually in the publication if they match one-to-one.
Leave a Reply