Kotlin exercise - reading unsorted input as fast as possible

My app was receiving a large binary file which was not sorted nor indexed in any way. Here I show how I searched in the file as fast as possible.

The file consisted of repeated blocks:

  • latitude (4 bytes)
  • longitude (4 bytes)
  • names count (1 byte)
  • “names count” x the following:
    • name length (1 byte)
    • name (“name length” bytes, UTF-8 encoded)

We needed to read the file and obtain the first name of the place closest to the latitude and longitude provided as input.

The code:

private fun refreshPlace(lat : Double, lon : Double) {
  DataInputStream(getAssets().open("places").buffered()).use { // (1)
    val latToFind = (lat * 10000000).toLong() // (2)
    val lonToFind = (lon * 10000000).toLong()

    var closestDistance = Long.MAX_VALUE

    var selectedNameBytes = ByteArray(UByte.MAX_VALUE.toInt())
    var selectedNameLength : UByte = 0u

    while (true)
    {
      val latReceived = try {
        it.readInt()
      } catch (e: EOFException) {
        break;
      }
      val lonReceived = it.readInt()

      val latDiff = latReceived - latToFind
      val lonDiff = lonReceived - lonToFind
      val distance = latDiff * latDiff + lonDiff * lonDiff // (3)

      val namesCount = it.readByte().toInt()

      repeat(namesCount) { counter ->
        val nameLength = it.readByte().toUByte()

        if (counter == 0 && distance < closestDistance)
        {
          it.readFully(selectedNameBytes, 0, nameLength.toInt()) // (4)
          selectedNameLength = nameLength
          closestDistance = distance
        }
        else
        {
          var toSkip = nameLength.toLong()
          while (toSkip > 0) {
            val skipped = it.skip(toSkip) // (5)
            if (skipped <= 0) {
              throw Exception("Invalid count of skipped bytes: $skipped")
            }
            toSkip -= skipped
          }
        }
      }
    }

    findViewById<TextView>(R.id.parsed_place).setText(String(selectedNameBytes, 0, selectedNameLength.toInt(), StandardCharsets.UTF_8)) // (6)
  }
}

Notes:

  • (1) Use BufferedInputStream (the .bufferred() call). This greatly improves the performance and protects us from platform-dependent performance variations.
  • (2) The latitude and longitude in the file was encoded as a whole number instead of a decimal number, multiplied by 10000000 to avoid losing precision when rounding. We convert the input parameters to the file data format, instead of converting the file data of each entry to the format of the input parameters. So, later we can compare the numbers for each entry quickly - without any conversions, calculations, and using only integers.
  • (3) Use Pythagorean theorem for distance calculation instead of a complicated calculation which would take into account that the Earth is round, but might be slower. Additionally, skip taking the square root - we only want to compare the distances and a < b <=> sqrt(a) < sqrt(b).
  • (4) Reuse the same buffer when the closest name changes.
  • (5) Skip bytes of not needed names without reading them. Beware of the skip method behavior - it can actually skip less bytes than requested, so we sometimes need to call it repeatedly. This was especially the case when .buffered() was used and it did not happen otherwise - which makes it more tricky. But this behavior is well documented.
  • (6) Decode the bytes to UTF-8 string only at the end and for the one single name we actually need. Do not do the UTF-8 decoding unnecessarily for every single item in the data file.

Of course, when repeated searches are needed, it is useful to build an indexed file, a database or an in-memory structure allowing fast searches without looking at all the data.

Written on January 26, 2022