be + direction
Another follow up to my Location Code musings. I’ve spent a bit of time thinking about different ways to encode a location using everyday words; let’s once again start with a reference implementation so we can discuss what it does.
And now to go in the other direction. As an example, the previously discussed “short” location code for the Twin Cities area that was
G-7P now just happens become the location phrase I’ve used to name this post,
be + direction. The reference NYC location (40.716667, -74) can be turned into the location phrase
mass new + machine state.
To explain what’s happening here, I’m going to compare and contrast my approach with a similar scheme for encoding a location: what3words. The main difference to note as we go forward is that what3words is narrowly focused on matching locations down to 3x3 meters with 3 word combinations from a specific list of 40,000 words. My approach is more generic: to encode any number of bits (that happen to represent a location in this case) using any given list of words. Or, put another way, somehow what3words got a lot of awards and over $10 million by throwing together a particular list of words. Hell, I could do that for two orders of magnitude less money, and you can save the awards for people who are doing more difficult work in the world.
All I found what3words disclosing is that they set an upper bound on their list length at 16 bits, making for ~24 bits in each dimension. They give no methods for adjusting their precision.
As noted in the previous discussion on location codes, for a ZIP code replacement, 15 bits would be enough in each dimension. For the location phrases I implement here, I’m actually using 20-bit values by default (read on to see why). But it all still remains arbitrary, and you can use as many or as few bits as are necessary for whatever application you have in mind. The more bits you use, the more words you’ll use.
The what3words algorithm for converting from lat/lon to the 3 numbers that index their word list doesn’t appear to be readily available. Given that adjacent locations do not appear to share any common words, perhaps it’s done using a congruence class for modulus 3. Whatever the case, the intent clearly was not to make it easy for people to work with the values as anything other than a hunk of 3 words.
I will continue to use the two lat/lon values separately. Because it is possible for words to be hyphenated, instead of putting a - between the two values as I did for the location code, I’m using a + for location phrases. This has led to the suggestion of possibly calling them CrossWords. I’m not against that idea, but I’m going to stick with calling them location phrases for the purposes of this discussion.
And because we’re representing the actual lat/lon values, our location phrases gracefully degrade. For example, you should not be surprised to find that if you used just
mass + machine instead of
mass new + machine state, you’d still be talking about a location close to the NYC area. This potentially opens up the possibility of distinguishing a “relative” local location as maybe something like
new ^ state, but those sorts of add-on features are beyond the scope of this discussion.
The what3words list is not meaningfully being shared (it could all be reverse engineered, but it wouldn’t be worth the bother). All and all, it’s like they’re going out of their way to be as closed and unfriendly as possible. I can’t imagine why anyone would want to use 3 random words for a location which are completely different from the 3 words that are used to refer to a location 10 feet away. How is it useful to label the Statue of Liberty as planet.inches.most?
For my purposes here, I’m simply using a General Service List of common English words, ordered from the most frequently used to the least. It contains more than 2048 words, so it could be used to encode 11 bits per word. But, of course, I decided to do something impossibly stupid with it instead.
As we saw with base 32 encoding of a location code, it is necessary to pad anything less than 5 bits up to that length so it can be mapped to the character that encodes it. I could do the same sort of padding up to 11 bits, too, but what if I instead decided on no padding at all? What if I used the list in such a way that the bit lengths all got their own “sublist” of the main list? That way, the first two words map to the 1-bit values (0, 1), the next 4 words go to the 2-bits (00, 01, 10, 11), next 8 to the 3-bits, and so on. I’m calling this approach a Flexible Encoding Dictionary (FED).
Using a FED is a wasteful thing to do, because it essentially halves the maximum number of useful words for the whole list, whereas padding would only affect the last chunk of bits in a sequence (and the longer the sequence, the less overall waste that is). But if I only wanted to encode 15-bit values in the first place, I would have had to use 2 words any way. Going down to 10 bits for the list still results in 2 words, and I can even add in 5 more bits of precision for free.
But it’s now all arbitrary! Because the FED contains entries for all lower bit lengths, I could have stuck with 15 bits and encoded it with 3 5-bit words. Or an 8-bit and a 7-bit word. Or 4- 7- 4-bit words, or any and all combinations to whatever length of bits we need to encode. What we have gained by being “wasteful” is the possibility of generating more memorable (if somewhat longer) phrases for locations. If you don’t like the default
bathe powder + track shut for the Statue of Liberty, how does
mass trip voyage + she country success diamond sound? If you want to try your hand at putting together a better phrase of your own for a location, give the Location Phrase Playground a try.
And keep in mind that this is all without even trying! Nobody has paid me millions of dollars to curate a word list specifically for encoding locations. If they did, I could do an even better job matching lat/lon values with meaningful regional words. This is most easily seen for latitudes, where the binary representations for locations closer to the North and South poles contain longer sequences of 11111111... and 00000000... respectively, and so it would make sense to associate words like “cold”, “arctic”, “frozen”, “tundra” with them, and words for warmer, wet/dry climates with the bit sequences that alternate more. It might even make sense to use different word lists for each dimension (yet another advantage in not mixing the two values together at the representation stage).
In fact, custom lists could be used for all different kinds of data. Since I took an abstract approach, there is nothing I’m doing here that is specific to encoding just locations. Any bit sequence is fair game, and there are amazing steganographic possibilities when you allow for multiple different word combinations to convey the same data.
Another possibility presents itself if you look at how PGP uses word lists. It alternates between two 8-bit word lists as a way to reduce errors in communicating a value. So perhaps it would make sense to do the same sort of thing with a FED using different bit lengths.
Promise kept: location phrases as a followup to location codes. Which one is better is for you to decide. The phrases are obviously longer, but have the potential to be more descriptive and memorable. Let me know if you work out a cool phrase for a well-known location.