Beyond 2020 twitterBeyond 2020 BlogBeyond 2020 Linkedinbeyond 2020 youtube

Tap in to your data’s intelligence

Beyond 20/20 Blog

  • Home
    Home This is where you can find all the blog posts throughout the site.
  • Categories
    Categories Displays a list of categories from this blog.
  • Tags
    Tags Displays a list of tags that has been used in the blog.
  • Bloggers
    Bloggers Search for your favorite blogger from this site.
  • Team Blogs
    Team Blogs Find your favorite team blogs here.
  • Login

Challenge 2: Solving Complex Geography Hierarchies

Posted by on in Blog
  • Font size: Larger Smaller
  • Hits: 13051
  • Subscribe to this entry
  • Print

Issues faced and solutions used to simplify complex geographical data for ease of use

Background Info About the Project


The Census Transportation Planning Products (CTPP) is a special tabulation of the American Community Survey, operated by the US Census. It contains valuable data comprising home and work locations, as well as journey-to-work travel flows, including demographic characteristics, for a variety of state, regional, and local transportation policy and planning efforts. CTPP data also supports corridor and project studies, environmental analyses, and emergency operations management. The source data is contained in more than 15,000 separate files and over 300 tables grouped by state. Once loaded into a database, CTPP represents 2.5 TeraBytes of data.


Beyond 20/20 Inc. was commissioned by the American Association of State Highway and Transportation Officials (AASHTO) to build a web-based data access system that would allow users to interact with the data, perform statistical calculations, build custom aggregations, view charts and maps, and download specific data sets originating from multiple files. 


This article is one of a series that describes some of the key technical challenges we faced in completing CTPP data access system – and describing the solutions we used to conquer the challenges. The first article in the series can be read here: Pushing the Limits of SQL Server to Manage 2.5 TeraBytes of Complex Geographical Data.


You can access the CTPP data driven by our software here:


Challenge 2: Non-Linear Geographic Data


The project presented some key challenges related to the nature of geographic data. Although SQL Server is ideal for handling large volumes of data, it presents certain challenges in managing data that is both voluminous and complex.


The data was delivered in many flat files, in which each record included a code identifying both the geographic level and its unique geographic identifier. For instance, the code C0200US01 represents the state of Alabama, in which “C02” indicates the state level and “01” indicates which state (Alabama).


Geographic levels presented the first instance of complexity. Residence and Workplace had 12 different geographic levels (categories such as “County”, “Metropolitan Statistical Area  (MSA)”, and “Transportation Analysis Zone” (TAZ)), which have orthogonal relationships to each other. For example:


  • A State is divided into multiple “Public Use MicroData Areas” (PUMAs), Counties, and Places, but these three levels have no direct relationships to each other. A County can span multiple PUMAs; and Places can reside within a single County or may span several.
  • Counties and PUMAs are contiguous (every location within a State is within a single County and within a single PUMA), but Places are discrete and do not necessarily aggregate to cover the entire State.
  • Counties, in turn, are subdivided into Minor Civil Divisions (MCD) (in 12 states), and TAZ, which are also independent of each other.

Given the complexity of the geographic data in tables with just Residence or Workplace, you can imagine the complexity involved with tables containing both. These tables are known as “Flow” tables. For these tables, each record has a single code that identifies the flow level, the specific Residence geography, and the specific Workplace geography. Each flow level consists of a Residence level and a Workplace level that are not necessarily at the same level (e.g. state-to-state, county-to-place).  There are a total of 16 different flow levels: 9 where the Residence and Workplace levels are the same, and 7 where they are different.


To present this information to users as a hierarchy would be confusing and difficult, so we needed to design a different approach.  Asymmetrical flows made this problem even more challenging. 


Our solution was to divide each level of geography into a separate cube. As we described in the first article in this series, we established that a small number of cubes was necessary. Now, each of these cubes (identified as “Logical Cubes”) would be split into 12 or 16 “physical cubes”. Logical cubes with a single geography would each be split into 12 different levels, and logical flow cubes would be split into 16 different levels. In addition, in the flow cubes, we created separate Residence and Workplace dimensions rather than keeping the combined code.


This exercise led to a collection of “physical” cubes that contain a single level of a Residence dimension, a single level of a Workplace dimension, or both.


The following diagram demonstrates the concept of how the many source files were split into logical cubes and physical cubes.



Although the data is split across many different physical cubes, there is a requirement for users to be able to interact with data from multiple levels of geography at once. We customized our Perspective software to incorporate special logic so that CTPP users can see data in logical cubes seamlessly, not aware that the different levels are actually in separate physical cubes. We achieved this with a combination of naming conventions and linking attributes. 


The next article in this series will address this in more detail.


For more information, we invite you to contact us.


Rate this blog entry:


  • No comments made yet. Be the first to submit a comment

Leave your comment

Guest Monday, 22 July 2019