Overview

It’s often been said in both the commercial real estate industry and brick and mortar businesses alike “It’s about location, location, location”. That statement has never been truer than in the technological world in which we live today. So,

Get the GIS

If the precision of location is absolutely vital and “Data is the blood running through the veins of the Networked Society”, then GIS should be considered the occipital lobe since it’s a large determinant of how we see our world. Every since the advent of GIS (Geographical Information Systems) and spatial analysis over five decades ago, the science, as some might say art, of geocoding has significantly impacted our world and how it functions today. For all of the relatively impressive capabilities and benefits, which come from having a digitally encoded representation of a significant portion of the structures and topography that comprise our world as we know it, Geocoding is not without its shortcomings and misgivings. These are partially due to technical debt, constraints inherent in systems, and processes and approaches developed over several decades based on questionable “best” practices and obsolete technology.

Geocoding has been a godsend in the areas of cartography and topography, providing benefits to both Government and Industry alike. More recently, consumers have benefited by way of automotive based navigation systems and countless benefits afforded by the proliferation of the smartphone throughout modern and developing societies.

For the uninitiated, Geocoding is the process of converting street address details derived from textual information into viable geospatial data. This data is typically comprised of Longitude, Latitude, and accompanying enrichment text that’s captured and stored in a computer-friendly format. Typically, it resides in a relational database management system, where it may be further enhanced with additional attributes through data mappings and/or user input.

In a perfect world

Alas, if only it were so simple that all street address data could be robust, resilient, and reliable. In the real world, it is rare that any of these conditions are true for long. To make matters worse, computer systems fed malformed data produce erroneous results — a notion captured in the phrase “Garbage In, Garbage Out” and abbreviated GIGO. GIS systems are particularly subject to the law of GIGO, and street address data is considered to be unusually dirty.

Leading causes of dirty street addresses:

  • Non-standard abbreviations
  • Attribute misorderings
  • Data entry mistakes

Given the various methods of data acquisition and capture, it’s no surprise that address data can wreak havoc upon the OSS (Operational Support Systems) and BSS (Business Support Systems) efforts and processes at some of the most inconvenient times. What results are costly delays in meeting deliverables ranging from projects and initiatives that can involve something as essential as market analysis, real estate planning, Accounting/Taxation, Finance Management, shipping, customer relations, sales, strategic planning and, goodness forbid the most critical of all, Billing and overall user experience (be afraid, be very afraid).

An issue that needs to be Addressed

Imagine — your day-to-day work is happening and some of the new strategic corporate directives are beginning to unfold for the new fiscal year. One initiative happens to involve Finance, Sales, and Marketing groups. Someone in Marketing wants to utilize the existing customer base to determine if they can pursue business opportunities within close proximity of existing company assets. These assets are currently tracked and managed in the Finance systems with the hopes of identifying low hanging fruit that can be utilized to more easily aide the Sales team in hitting Q2 target goals. Marketing acquires customer lead data by way of outside firm and internal data via historical quotes and existing contracts. A market analyst acquires data extracts from the Finance system to do detailed analysis, only to find they are not easily able to map and correlate the datasets between the business domains due to excessive street address mismatches. After several unsuccessful attempts to merge the data, an escalation occurs and senior management requests that IT engage to assist the business with resolving the issues surrounding the datasets so that the analyst can proceed with the market analysis.

The Data Warehouse team is then tasked by the CIO to bring the unwieldy street addresses under control. The Developers and DBAs naturally assume — given the view of the problem space from a data-centric standpoint — that normalization of the street address data is the ideal approach.

 

An old Normal

Normalization may not be the silver bullet many would assume, and instead be the defective ammo that backfires on you. Even when dealing with internal datasets, street address fields’ standards can vary based upon business domain as well as the individual tools themselves.

Not fully taking into consideration the multitude of street address formats when deriving data from an array of non-standard sources, and how that can lead to a world of pain for both IT and the business, both should be seriously prepared to cycle multiple times as they attempt to resolve what can become a major pain point if not prepared for that type of challenge.

A few of the initial obstacles of normalization are detailed below.

Approach Challenge
Identifying the components parts of an address. Multiple address sources tend to vary based on primary business function.
Transform into standardized format. Each component of the address must be successfully mapped to its address attributes.
Implement fuzzy matching  The normalization logic will need to adequately identify the most likely address attributes to associate with each component of the input address.

 

 

 

 

 

 

 

 

 

If any of the challenges are not adequately addressed in the design, successfully tested and implemented, it can result in a multitude of issues turning those low grade quality and duplicate street address records into multiple street address records when attempting to merge datasets. This results in highly manual efforts and labor-intensive research to rectify the situation while potentially affecting end-user confidence and user experience. The following street addresses are for the same location, but depending upon the context may, or may not, be a usable street address for a particular business case. In addition, dependent upon DB id and key, constraints may, or may not, result in a key violation. Technical challenges aside, gaining interdepartmental business consensus to resolve duplicates record concerns can be a project in itself.

Address Business case
1840 Century Park East, # 1200,

Los Angeles, CA 90007-21000

Sales

1840 Century Park E, 1200,

Los Angeles, CA, 90007-21000

Finance
1840 CENTURY PARK E, UNIT 1200,

LA, CA

Marketing
E Century Park 1840,

Los Angeles, CA, 90007

Planning

 

 

 

 

 

 

 

 

 

 

 

Given the aforementioned challenges, all hope is not lost; although if not appropriately implemented, accuracy and precision of the data are likely to be in jeopardy. Next, let’s take a deeper dive into the methodologies and techniques often used.  

Substitution Based Normalization

  • Is a less complex method utilizing lookup tables to identify frequently encountered characters based on their string values.
  • Subsequently, the simplicity constrains the applicability since its capabilities are limited to correct abbreviations and immaterial data.
  •  The technique of “tokenization” drawback should not be overlooked as its shortcomings could wreak havoc on the end result namely, where the street address may contain keywords that can also be assigned as an attribute.
  • Ex. “123 Street Drive East” whereby neither words are in the expected positions for street and that while “Street’ is a post thoroughfare type, it is also a valid street name as well.

Context-Based Normalization

  • Not as a commonly used methodology for addresses considered to be more complex and difficult to implement.
  • Utilizes lexical and syntactic analysis to interrogate the address string components.
  • A key benefit is the ability to re-order input characters.
  • A multistep process requiring extensive logic and expertise to implement involved scrubbing, character conversion, lexical and syntactic of analysis.
  • Requires highly advanced developer skillsets that are typically the domain of skilled GIS experts, and not a standard IT application developer or SQL Dev.

Probability-Based Normalization

Utilizes advanced statistical methodologies to identify address components.

  • Highly effective at handling complex cases that require expert skill level, utilizing substitution, reordering, amalgamation and data removal.
  • A fully implemented solution will provide data sciences and machine learning to deploy language commonly used is called Backus-Naur form or BNF for short.
  • Dependent upon business objective, scope and budget may be considered overkill for the aforementioned scenario.

An example of an address described in BNF

<postal-address>::= <street-address-part> <locality- part><street-address-part>::= <house-number> <street-name-part> {“,” <suite-number> <suite-type>}<street-name-part>::= {<pre-directional>} <street-name> <street-type> {<post-directional>}<locality-part>::= <town-name> “,” <state-code> <USPS-ZIP-Code> {“+” <ZIP-extension>}

May want to think twice before you join

At this point, you may be thinking “Couldn’t you take a more simplified approach, assuming it may suit the business case at hand, along the lines of creating lookup tables and joins?” Perhaps, but that could just as well be risky business once fully deployed into production, resulting in overall downgraded system performance. Less than ideal table utilization and DB design can degrade performance significantly especially for data structures that are subject to both excessive read and write actions continuously. Placing the onus upon operations to mitigate the issue once in production by staying vigilante about the DB monitoring, management and indexing of servers and applications affected by implementation.

 

 

So, you think you want to Geocode

 

Geocoding done properly is not for the faint of heart. Particularly if the data is relied upon and will be used in mission critical, revenue impacting or moderate to high-risk operational use cases. If address management wasn’t already without its own challenges, and assuming your organization established an address management process in a costly and timely manner while ascending the ladder to world-class Address Management, let’s discuss geocoding.

For the sake of grounding this conversation and preventing going on ad nauseam, I’ll be succinct and focus solely on the geocoding process as it relates to North America or more specifically the TIGER (Topologically Integrated Geographic Encoding and Referencing) dataset format provided by the US Census Bureau. In a subsequent discussion, I’ll speak on other global regions as it relates to the topic of GIS in the Address Management space.

 

Breaking up is not easy

One of the initial steps in geocoding is normalization, otherwise, known as parsing. The primary objective of parsing is to break-up the unformatted input address string into a specifically defined formatted standard, that in the United States is USPS Publication 28.

EXAMPLE -“123”, “N”, “Main”, and “ST”) (U.S. Postal Service, 2009 []).

Just to clarify if you’re going to be “breaking up” you’re going to need input, as in Address data. Now please note, address data, in general, can be fairly abundant. However, the majority of it is of low quality and rarely in its entirety complete as it could be. Which is something you probably knew, assumed or experienced previously? So, if you embark on Geocoding, please be part of the solution and not the problem by propagating low-grade address data, allowing it to proliferate throughout the corporate infrastructure and the data connected world. Just to restate a key point, the higher the quality reference data you start with, the better the odds for your success (remember GIGO).

 

A sense of normalcy

If you accept your mission and decide to stay the course with the plan to Geocode, I salute you (seems only right when you’re going into battle). Here, you’ll begin a 2-step process of normalization and standardization. Where you get the chance to take those “dirty” addresses and clean them up and hopefully make them into something respectable when it’s all said and done; however, there’s still a ways to go before you’ll get there. In short, the goal is to map address text with address attributes, again not nearly as simple as it may sound. Recall the 3 previously mentioned Normalization techniques.  Yes, those done correctly can greatly assist in showing your data who’s boss, operative phrase “Done correctly”.

Assuming your normalization yielded high results, you’re ready for standardization. Put simply, standardization converts the normalized data into the correct format expected by the subsequent components of an address processing system, such as a geocoder. In layman’s terms, the Standardization performs a format conversion that’s finally geocoder ready. At this point, you’re not quite out of the woods just yet.  You’re going to need to be equipped–ready to climb a mountain, and then some.

 

Code of the streets

Now, if you’re ready to do some geocoding, you have 2 options. Your first option is to roll your own geocoder. Your second option is to acquire a COTS Geocoder solution.

Option 1 – Roll your own

Creating your own geocoder is an effort that will largely reduce you and your team writing “mapping functions” to translate between normalized form data to target output.  Ideally, the transformations are encoded within the mapping functions that are used for feature matching (points, lines and polygons). In addition, you’ll need to write code for a rules-based matching engine to identify the best matches for reference datasets more specifically known as Point datasets, Linear datasets, and Polygon datasets. Each of the dataset files contains their own data structures which need to be interpolated by the geocoder as well. I’m sure you probably gleaned from that brief summation of the process there is indeed a steep learning curve to endure.

Option 2 – Purchase COTS geocoder

Your second option is to acquire a COTS Geocoder solution that can come with its challenges related to software licensing fees, vendor support fees, hardware cost and vendor managed software updates along with patching and OS requirements and constraints. Oddly, enough Address Management is typically treated organizationally as an operating expense when it should probably be looked at more as a capital expenditure, all things considered.

Layers upon layers

Assuming you’ve taken the steps to plan, design, develop, test and deploy a geocoder, my hat’s off to you. Although giving the domain of expertise is not woefully in abundance in IT organizations outside the GIS software shops, you might end up wearing multiple hats now like database management, application management, platform development, devops, production support, and all around GIS subject matter expert.  Don’t forget your original goal and objective was only Address Management. And all you needed originally were quality addresses so you could move forward and focus company resources on other business challenges within your organization. What just happened?

….Although I believe there may be a 3rd option that may be the needed for the Address Management and Location pain that persists to this day.

Standard Point reference dataset

Name Desc Coverage
E911 Address Points Emergency mgmt points for addresses Portions of US
Postal codes Postal Code centroids US
Census MCD Minor Civil Division centroids US
GNIS Gazetteer of geographic features US
GeoNames (US National Geospatial-Intelligence Agency 2008) Gazetteer of geographic features World, excepting US
ADL (ADL 2008) Gazetteer of geographic features World

 

Standard Linear reference dataset

Attribute Description
Left side street start address number Beginning of the address range for left side of the segment
Right side street start address number Beginning of the address range for left side of the segment
Left side street start address number End of the address range for left side of the segment
Right side street start address number End of the address range for left side of the segment
Street prefix directional Street directional indicator
Feature class code A code representing the Census class of the feature (e.g. FCC)
Street name Name of street
Street type Type of street
Right side ZCTA ZCTA for addresses on right side of street
Left side ZCTA ZCTA for addresses on left side of street
Right side municipality A code representing the municipality for the right side
Left side municipality A code representing the municipality for the left side
Right side county code A code representing the county for the right side
Left side county code A code representing the county for the left side
Feature class A code representig the class of the feature

 

Standard Polygon reference dataset

Name Description Coverage Cost 
TeleAtlas NAVTEQ Building footprints, parcel footprints, 5-digits ZIP codes (US) Worldwide, but sparse Expensive
County or Municipal Assessors Building foorprints, parcel footprinnts US, but sparse Relatively inexpensive but cost and coverage varies by jurisidiction
US Census Bureau Census block groups, census tracts, ZCTA, MCD, Counties, States US Free
US Postal Service (USPS) 5-Digit Postal ZIP codes (vendors estimate boundaries from Census ZIP code Tabulation Arcas) US Free to relatively inexpensive but cost varies by coverage and age of data

 

 

References

1. https://www.ericsson.com/thinkingahead/the-networked-society-blog/2015/03/25/data-is-the-blood-running-through-the-veins-of-the-networked-society/

2. https://www.selikoff.net/2008/11/19/why-too-much-database-normalization-can-be-a-bad-thing/

3. https://www.quora.com/What-is-the-difference-between-normalization-and-joins

4. Goldberg, D., A Geocoding Best Practices Guide. North American Association of Cancer Registries. 1. Available at: http://www.naaccr.org/flesystem/pdf/Geocoding_Best_Practices.pdf, 2008.

5. https://stackoverflow.com/questions/8166973/what-is-the-best-normalization-for-street-address

6. https://github.com/codeforamerica/address-normalizer/issues/3

7. https://aip.scitation.org/doi/pdf/10.1063/1.4958505

8. https://www.census.gov/geo/maps-data/data/tiger.html

9. https://www.census.gov/geo/about/

 

Leave a Reply

Your email address will not be published. Required fields are marked *