D3 Guide — Resale Price on Map — Data Processing

D

This is part 2 of the 4-part guide. In this part, we process the data that we have collected in part 1.

This is a practical step-by-step guide to creating a map visualization in d3. In this 4-part guide, we set off to create a visualization to display whether the area where the resale prices of HDB flats have changed throughout the years. Check out Part 1 here.

Here’s a demo of what you will be creating in this tutorial. Here’s where you can find the source code.

Tutorial Parts

Dataset analysis

First, the data from HDB was broken up into 5 CSVs. 2 are dated based on approval dates and 3 are based on registration dates. For our purposes, this information is irrelevant and I would like to merge the CSVs into one dataset. Another discrepancy I noticed was that a ‘remaining_lease’ column was added in 2 CSVs as well.

Dataset cleansing

If we look through the data, the CSV from ‘1990–1999’ has flat models in all-caps while the rest of the CSVs have it in Sentence Case. The CSV from ‘2015–2016’ has remaining_lease as an integer with the number of years left while the CSV for ‘2017 onwards’ has remaining_lease in a format of ’60 years 11 months’ which will need to be parsed.

Given that we do not need the lease information and flat types, I am opting to ignore them for the time being, but leave them in the collated dataset so that if I want to use them in the future, I can still clean them up to do so.

Data collation

Now, I would like to merge all 5 CSVs into 1 JSON dataset for ease of use. In this case, as the dataset is small, this is an option. If you are dealing with larger datasets (i.e. 500MB or larger), I will recommend that you look to either reduce the size of the dataset through early aggregation or use streaming methods to process the dataset.

First, we should set up our folder structure.

  • process — where I set up my node project. Processed data sits here as well. I always like to separate my raw datasets and processed datasets, much like cooking, where you keep your raw and cooked food apart for fear of contamination. However, we should never use the words ‘cooking’ and ‘data’ together.
  • rawdata — where I place all my raw datasets.
  • visualization — placeholder for where we will create our visualization

Navigate into your process directory and run the following commands to set up a basic node dev environment.

npm init
npm install --save-dev node-fetch neat-csv @turf/turf

We can then set up basic commands that we will be using frequently. In this case, I set up start and collate that we can run in the following way: npm run start or npm run collate. Now we can create the index.js and collate.js files.

"scripts": {
  "test": "echo \"Error: no test specified\" && exit 1",
  "start": "node index.js",
  "collate": "node collate.js"
},

When creating a nodeJS file for data processing, I like to set up a basic boilerplate so that I can use async-await and keep my basic execution path clean without callback-hell or promise-hell.

void async function() {
    try{
        // Your code here
    }
	catch(ex){
        console.log("Error: ", ex)
    }
}();

We can now bring in our dependencies.

  • neat-csv — a promises wrapper around csv-parser which helps convert CSV to JSON
  • fs.promises — promises-based API of node’s fs APIs, used for read/write of files
  • path — node path API, helps to combine path elements (i.e. “dir/file.ext” on linux/OSX and “dir\file.ext” on Windows OS)
  • node-fetch — provides a fetch-based API to be used for RESTful API calls
  • turf — a library for calculating geographical points (latitudes and longitudes)
const neatCsv = require('neat-csv')
const fsPromises = require('fs').promises
const path = require('path')
const fetch = require('node-fetch')
const turf = require('@turf/turf')

I have extracted the HDB resale prices dataset into a folder resale-flat-prices. We should load their content and convert them into arrays which can be concatenated into 1 long array.

let dataDirPath = path.join("..", "rawdata", "resale-flat-prices")
let csv1990Path = path.join(dataDirPath, 'resale-flat-prices-based-on-approval-date-1990-1999.csv')
let csv2000Path = path.join(dataDirPath, 'resale-flat-prices-based-on-approval-date-2000-feb-2012.csv')
let csv2012Path = path.join(dataDirPath, 'resale-flat-prices-based-on-registration-date-from-mar-2012-to-dec-2014.csv')
let csv2015Path = path.join(dataDirPath, 'resale-flat-prices-based-on-registration-date-from-jan-2015-to-dec-2016.csv')
let csv2017Path = path.join(dataDirPath, 'resale-flat-prices-based-on-registration-date-from-jan-2017-onwards.csv')

let csv1990File = await fsPromises.readFile(csv1990Path)
let csv2000File = await fsPromises.readFile(csv2000Path)
let csv2012File = await fsPromises.readFile(csv2012Path)
let csv2015File = await fsPromises.readFile(csv2015Path)
let csv2017File = await fsPromises.readFile(csv2017Path)

let csv1990 = await neatCsv(csv1990File)
let csv2000 = await neatCsv(csv2000File)
let csv2012 = await neatCsv(csv2012File)
let csv2015 = await neatCsv(csv2015File)
let csv2017 = await neatCsv(csv2017File)

let allLines = csv1990.concat(csv2000).concat(csv2012).concat(csv2015).concat(csv2017)
console.log(allLines.length)

Now we have loaded a full list of all 30 years of HDB resale prices from 1990–2020.

We do some pre-processing so that strings are converted to usable datatized fields. In addition, we calculate the price/sqm so that we can normalize the price of an apartment for ease of comparison.

function process(allLines){
    for(let line of allLines){
        line.monthDt = Date.parse(line.month)
        line.resalePriceNbr = Number(line.resale_price)
        line.floorAreaSqmNbr = Number(line.floor_area_sqm)
        line.perSqmPrice = line.resalePriceNbr / line.floorAreaSqmNbr
    }
}

Size of Dataset

So, if you have run the code above, you will see that there are 812,704 records in our data set. Makes sense, we are after all, looking at 30 years of HDB resale transactions. Doing some basic math, this works out to be 74 transactions per day, every day, for 30 years.

Sure, we can write script to query the Latitudes and Longitudes of all of the addresses on OneMap and then map them to their respective districts. However, taking a quick look at OneMap’s API documentation reveals the following: maximum of 250 calls per min.

Another use of basic math will tell us that if we do it the naive way, it will take our script 54.18 hours to complete (given that we rate-limit our script properly while calling the API).

While that is 54 hours that we don’t need to care about as a script is dutifully doing the work for us, there’s no need to make OneMap’s servers run harder than they need to if we can optimize the queries.

Reducing Work, Beating Naivety

Let’s look at a plan on how we can reduce the number of records to query on:

For all records, let’s collect their blocks and street names so that we can query OneMap with unique addresses for their latitudes/longitudes. We can de-duplicate the addresses so that we reduce the amount that we need to query.

function reduceAddresses(addresses){
    let reduced = {}
    for(let address of addresses){
        reduced[address.block + " " + address.street_name] = {
            block: address.block,
            street_name: address.street_name
        }
    }
    let reducedArr = Object.values(reduced)

    return reducedArr
}
// ...

let allLines = csv1990.concat(csv2000).concat(csv2012).concat(csv2015).concat(csv2017)
console.log(allLines.length)

process(allLines)
let reducedAddresses = reduceAddresses(allLines)
console.log(reducedAddresses.length)

// ...

The output shows, 9161✌.

We have reduced our 812,000 queries to 9161. We can query these slowly, not bursting the limit specified by OneMap and be done within an hour.

In a practical scenario, it is important to optimize and reduce the amount of API calls we need so that we reduce processing costs as well as bandwidth costs.

At this point, before we continue on, please feel free to take a break, you will need it. Go ahead, I’ll wait for you.

☕🍵🍰🍣

GPS coordinates retrieval and augmentation

Welcome back! Let’s continue by sorting out the districts for the transactions in the dataset.

We will now use the OneMap API again to augment the records with latitude and longitude. We use a new dependency, bottleneck, to ensure that we are not hammering the API by using 1 max concurrent connection and only starting a new connection every 250 milliseconds (max of 240 requests per minute, below their stated limit of 250/min).

const Bottleneck = require("bottleneck/es5");

// ...

const limiter = new Bottleneck({
    maxConcurrent: 1,
    minTime: 250
});
limiter.on("failed", async (error, jobInfo) => {
    const id = jobInfo.options.id;
    console.warn(`Job ${id} failed: ${error}`);

    if (jobInfo.retryCount < 10) {
        return 250
    }
});
async function augmentLatLong(addresses){
    let retAddresses = []
    for(let address of addresses){
        let tryFull = await limiter.schedule(() => searchAPI(address.block + " " + address.street_name))
        
        if(tryFull.results.length > 0) {
            address.lat = tryFull.results[0].LATITUDE
            address.long = tryFull.results[0].LONGITUDE
        }
        else {
            let tryStreet = await limiter.schedule(() => searchAPI(address.street_name))
            if(tryStreet.results.length > 0) {
                address.lat = tryStreet.results[0].LATITUDE
                address.long = tryStreet.results[0].LONGITUDE
            }
            else {
                console.log("Unable to find " + address.street_name)
            }
        }
        retAddresses.push(address)
    }

    let jsonContent = JSON.stringify(retAddresses)
    await fsPromises.writeFile("reducedAddressesLatLngJson.json", jsonContent, 'utf8')
    return retAddresses
}


async function searchAPI(searchTerm){
    console.log("searching " + searchTerm)

    let url = `https://developers.onemap.sg/commonapi/search?returnGeom=Y&getAddrDetails=Y&pageNum=1&searchVal=${searchTerm}`
    let returnJson = await (await fetch(url)).json()

    return returnJson
}

Notice that I store the results into a JSON file. This allows us to stub out the augmentLatLongfunction so that we do not need to call the API multiple times if we are re-running this script (as I did many times while iterating upon it).

async function augmentLatLong(addresses){
    let reducedAddressesLatLngPath = path.join("reducedAddressesLatLngJson.json")
    let reducedAddressesLatLngFile = await fsPromises.readFile(reducedAddressesLatLngPath)
    let reducedAddressesLatLng = JSON.parse(reducedAddressesLatLngFile)
    return reducedAddressesLatLng
}

// const limiter = new Bottleneck({
//     maxConcurrent: 1,
//     minTime: 250
// });
// limiter.on("failed", async (error, jobInfo) => {
//     const id = jobInfo.options.id;
//     console.warn(`Job ${id} failed: ${error}`);

//     if (jobInfo.retryCount < 10) {
//         return 250
//     }
// });
// async function augmentLatLong(addresses){
//     let retAddresses = []
//     for(let address of addresses){
//         let tryFull = await limiter.schedule(() => searchAPI(address.block + " " + address.street_name))
        
//         if(tryFull.results.length > 0) {
//             address.lat = tryFull.results[0].LATITUDE
//             address.long = tryFull.results[0].LONGITUDE
//         }
//         else {
//             let tryStreet = await limiter.schedule(() => searchAPI(address.street_name))
//             if(tryStreet.results.length > 0) {
//                 address.lat = tryStreet.results[0].LATITUDE
//                 address.long = tryStreet.results[0].LONGITUDE
//             }
//             else {
//                 console.log("Unable to find " + address.street_name)
//             }
//         }
//         retAddresses.push(address)
//     }

//     let jsonContent = JSON.stringify(retAddresses)
//     await fsPromises.writeFile("reducedAddressesLatLngJson.json", jsonContent, 'utf8')
//     return retAddresses
// }


// async function searchAPI(searchTerm){
//     console.log("searching " + searchTerm)

//     let url = `https://developers.onemap.sg/commonapi/search?returnGeom=Y&getAddrDetails=Y&pageNum=1&searchVal=${searchTerm}`
//     let returnJson = await (await fetch(url)).json()

//     return returnJson

// }

Fill in the Blanks

There were some addresses which could not be automatically filled in. I guess that these were either old blocks or roads that were replaced/demolished that OneMap no longer kept track of. So we’ll write some code to dump these addresses into a file so that we can deal with them separately.

reducedAddressesLatLngJson = reducedAddressesLatLng.filter(ra => ra.lat == undefined || ra.long == undefined)
console.log(reducedAddressesLatLngJson.length)
let reducedAddressesLatLngData = JSON.stringify(reducedAddressesLatLngJson)
await fsPromises.writeFile("reducedAddressesLatLngJsonIssues.json", reducedAddressesLatLngData, 'utf8')

Ah, 28, not too bad. We will have to roll up our sleeves and figure this out. (Who said that dealing with data was glamorous? 😉). However, let’s bring up some music and we’ll get this through. (I like to listen to the following playlist while working on data visualizations.)

We will use Google Maps to find out these GPS locations.

In order to save time, and given that we will only need the accuracy at a district level, some blocks that were close in St. George’s Road were given the same location.

However, there were some roads which I could not find the location of in Google Maps. Such as: “Buangkok South Farmway 1”, “Nile Road” and “Jalan Pasar Baru”. Some digging revealed that those areas were redeveloped and roads remade and renamed. I retrieved the new addresses from HDB’s site here.

So we fill up the exported JSON file, and reimport it.

async function loadIssuesLatLong(){
    let reducedAddressesLatLngPath = path.join("reducedAddressesLatLngJsonIssues.json")
    let reducedAddressesLatLngFile = await fsPromises.readFile(reducedAddressesLatLngPath)
    let reducedAddressesLatLng = JSON.parse(reducedAddressesLatLngFile)
    return reducedAddressesLatLng
}

And now we combine it back with the list from OneMap’s API.

let reducedAddresses = reduceAddresses(allLines)
console.log(reducedAddresses.length)

reducedAddressesLatLng = await augmentLatLong(reducedAddresses)
addressesWithIssues = await loadIssuesLatLong()

finalAddressList = reducedAddressesLatLng.filter(ra => ra.lat != undefined && ra.long != undefined).concat(addressesWithIssues)

Now, we have a full list of all unique addresses with latitudes and longitudes filled up.

Mapping GPS Coordinates to Singapore Planning Districts

With the latitudes and longitudes retrieved, we can now check which district they reside in. To do this, we can use Turf.js, a library that has helper methods to do tons of geography related calculations (please do not reinvent the wheel and just use a library like this).

We will be using 3 functions from Turf:

  • point — this allows us to create a point which represents a point in space, such as a latitude and longitude
  • multiPolygon — this creates a multiPolygon, an area on a geographical sphere. The reason we are using multiPolygon is because Singapore planning areas are made up of multiple polygons added to (or subtracted from) each other. (Mainly cases where an area is made up of multiple islands).
  • booleanPointInPolygon — this function allows us to check if a point is present inside a polygon/multipolygon

We will compare the geolocations with the TopoJSON file we retrieved as we will be using that to render the final visualization.

function populateDistrict(reducedAddressesLatLng, planningAreas){
    let polygonList = []
    for(let planningArea of planningAreas){
        let polygon = {
            multiPoly: planningArea,
            area: planningArea.properties.PLN_AREA_N
        }
        polygonList.push(polygon)
    }
    
    for(let reducedAddress of reducedAddressesLatLng){
        let pt = turf.point([Number(reducedAddress.long), Number(reducedAddress.lat)]);
        
        for(let polygon of polygonList){
            if (turf.booleanPointInPolygon(pt, polygon.multiPoly)){
                reducedAddress.town = polygon.area
                break
            }
        }

        if(reducedAddress.town == undefined){
            console.log("undefined for point (lat, long)", pt)
        }
    }
}

Mapping Districts to Sales Transactions

With the districts (“towns”) now tagged to the addresses, we can now map them back to the list of sales transactions.

Notice below that I perform some error checking to ensure that all districts are mapped to a new district value. It is always important to perform these checks to ensure that your data is clean and complete.

function matchDistrict(allLines, reducedAddressesLatLng){
    for(let line of allLines){
    
        for(let searchAddress of reducedAddressesLatLng){
            if(searchAddress.block == line.block &&
                searchAddress.street_name == line.street_name){
                    line.district = searchAddress.town
                }
        }
        if (line.district == undefined){
            console.log(line.block + " " + line.street_name, " not found in matchDistrict")
        }
    
    }
}

Closing up and Tying Everything Together

Finally, let’s hook up our code and produce an output file with all the districts set up.

// ...

finalAddressList = reducedAddressesLatLng.filter(ra => ra.lat != undefined && ra.long != undefined).concat(addressesWithIssues)

populateDistrict(finalAddressList, planningAreas.features)
let finalAddressListJson = JSON.stringify(finalAddressList)
await fsPromises.writeFile("finalAddressListJson.json", finalAddressListJson, 'utf8')
matchDistrict(allLines, finalAddressList)

let jsonContent = JSON.stringify(allLines)
await fsPromises.writeFile("output.json", jsonContent, 'utf8')

// ...

If you run npm run start, you will end up with output.jsonwhich will contain the sales transactions of all HDB apartments from 1990–2020 with their districts populated nicely.

A Tie and A Twist

We are not done yet! We still have 800 thousand lines of data (~260MB). What we want is a smaller dataset, consolidated by month/year so that we can easily load it in a HTML page and feed it into d3.

Let’s set up collate.js. You can follow the same steps as how we set up index.js.

What I am going for is something like this:

  • month 1
    – town 1
    – town 2
  • month 2
    – town 1
    – town 2

So that I can show the distribution of the prices across time. In addition, I want to calculate the average price/sqm of each town in each month.

First, we will load what we prepared above, and collect all the months and towns in order to generate the consolidated structure. We prepare two fields, one for the number of samples (n) and another for the sum total to be collected.

let dataPath = path.join("output.json")

let salePricesJson = await fsPromises.readFile(dataPath)
let salePrices = JSON.parse(salePricesJson)

let allMonths = [...new Set(salePrices.map(x => x.month))]
let allTowns = [...new Set(salePrices.map(x => x.town))]

let consolidatedPrices = {}
for(let month of allMonths){
  consolidatedPrices[month] = {}
  for(let town of allTowns){
    consolidatedPrices[month][town] = {
      n: 0,
      monthDt: Date.parse(month),
      perSqmPriceSum: 0,
      town,
    }
  }
}

Then, we will populate this consolidated object. We will an object so that it is easier to lookup the element needed instead of having to loop through an array.

for(let salePrice of salePrices){
  let selectedMonthTown = consolidatedPrices[salePrice.month][salePrice.town]
  selectedMonthTown.n += 1
  selectedMonthTown.perSqmPriceSum += salePrice.perSqmPrice
}

Finally (I promise), we calculate the mean price/sqm, and rollup the objects into arrays so that it is easier to handle them in d3. We also save the data into a json file.

for(let month in consolidatedPrices){
  for(let town in consolidatedPrices[month]){
    consolidatedPrices[month][town].meanPerSqmPrice = consolidatedPrices[month][town].perSqmPriceSum / consolidatedPrices[month][town].n
  }
  let rolledTowns = Object.values(consolidatedPrices[month])
  consolidatedPrices[month] = {
    towns: rolledTowns,
    month: month,
    monthDt: rolledTowns[0].monthDt
  }
}
consolidatedPrices = Object.values(consolidatedPrices)

let jsonContent = JSON.stringify(consolidatedPrices)
fsPromises.writeFile("consolidatedPrices.json", jsonContent, 'utf8')

Voila 🎉🎉🎈🎊! We are done and we now have a consolidated data file which is under 3MB. Not too shabby for 30 years of sales transactions.

Next Steps

Next we will start to create the d3 visualization that will make all the above worthwhile.

About the author

Clarence Cai

Add Comment

By Clarence Cai

Clarence Cai

Get in touch

Reach out if you have something to discuss, you can find me in the following socials.