Google Cartography
car·tog·ra·phy n. The art or technique of making maps or charts.Google Cartography uses the Google Search API to build a visual representation of the interconnectivity of streets in an area.
This application takes a starting street and finds streets which intersect with it. Traversing the streets in a breadth-first manner, further intersections are discovered. Eventually a connected graph is produced showing the interconnectivity of streets flowing from the starting street.
Example Graphs
Below are some example applets showing the data collected on two of the world's great cities. The data used is collected verbatim from the the full applet below. If you know the streets in the areas shown, you will be able to find inconsistencies introduced by the text parsing process. These graphs are kept small for easy viewing. It is possible to produce considerably more detailed graphs by using more Google API juice than was used to collect the example data.
You require a recent version of the java plugin (1.3.x is NOT new enough). You can download the JRE which includes the java plugin here.
Give it a try
If you have your own Google key (or get one from the Google API page by creating a Google account) you can run the applet against a street of your choice.
You require a recent version of the java plugin (1.3.x is NOT new enough). You can download the JRE which includes the java plugin here.
Be warned: The applet may use large amounts of bandwidth depending on the parameters you enter.
Description
Google Cartography uses the Google API to find web pages referring to street names. Initial street and region criteria are combined to form a search query, which is then executed by the Google API. Each URL from the Google results is fetched and the content of the pages converted into text. The text is then processed using regular expressions designed to capture information relating to the relationship between streets (for example, streets that cross each other or turn into other streets).
For each page a list of vertices is produced where each vertix represents an intersection between two streets. After the results for the initial street have been processed, the list of vertices will hopefully contain some vertices which are intersections with the initial street.
At this point the mapper performs a breadth-first search through the streets that can be reached from the initial street. Each street traversed during this process is subjected to the same process as the initial street, expanding the list of known street intersections. This process continues until no more streets can be reached from the initial street or until the halting criteria is satisfied (for instance, in reaching the maximum amount of Google API key usage).
Once the data collection phase has completed, the intersection vertices are converted into a graph. Each street becomes a vertix of its own, with outgoing edges connecting to the vertices of intersecting streets. The connectivity of this graph is analyzed (using the Jung graph package) to determine the largest maximal subgraph where all pairs of vertices are reachable from each other. This subgraph almost always contains the starting street; the probability of this not occurring should decrease proportionally to the number of streets traversed (which naturally selects for streets that are in the same subgraph as the starting street).
The largest connected subgraph is then visualized using a Radial Layout algorithm provided by the Prefuse graph visualization framework. The graph is initially centered on the start street but will automatically adjust its focus to center around the most recently selected street.
Issues
Ignoring its general lack of usefulness, there are several problems with the current application:- Using regular expressions instead of a custom parser means that parsing mistakes are not uncommon. The online version of the applet has a voting option enabled for particularly error-prone regular expressions, where more than one page must agree on the analysis for it to form part of the final graph. This eliminates some mistakes, but at the cost of many valid intersections. In the future the applet should make this behavior optional.
- Due to the regular expressions used, the application only works with English text and English street naming conventions. In examining the output, it is also obvious that regional variations of English make a difference. American English , British English and Australian English often have slightly different ways of referring to street relationships. I've tried to allow for some of the more common variations.
- The Google API (like Google's standard web interface) has a limit on 10 terms per query. This is limiting in several ways but the inability to effectively filter out previously traversed street names is where the limit stung most. As the street traversal algorithm deliberately tries streets closest to the initial street, the search results returned on those streets often include pages already processed. These duplicate pages are filtered from further processing but getting search results back for already-processed pages wastes precious Google API Key usage juice.
-
Some streets are hard to disambiguate. Highly generic street names such as 'Main Street' or 'High Street' can pollute the connectivity graph. So, for example, if 'Interesting Road' is found to intersect with 'High Street', there may be several 'High Streets' found in the results other than the one coming off 'Interesting Road' - thus producing connectivity via 'High Street' which is not even indirectly connected to 'Interesting Road', and is therefore highly uninteresting.
Having more specific constraints in the search can help, but that will reduce the overall quantity of results.
Another approach would be to prune connections which have low connectivity with other parts of the graph. This will be the case with streets which come off the 'wrong' 'High Street'. Again, this would increase overall quality - but at the cost of quantity, with false positives inevitable.
- The regular expression for street name and type is fairly limited. Examples of streets the current regex will not catch are 'Route N' or 'Highway N' and single name streets like 'Broadway'.
To do
- Bite the bullet and throw away the regex parsing , replacing it with a purpose-built parser. This would be faster overall and has the potential of producing cleaner and more plentiful data from the same corpus.
- In order to move from a curiosity to something useful, the code needs to be extended to extract the spatial relations between streets. For example, the code would need to know that whilst going down street W you will cross street X,Y,Z in THAT ORDER. There is information around that can disambiguate these relationships and I've played with some of it. However, I cannot currently collect enough clean data to disambiguate enough streets to be useful. I may revisit this if I get around to implementing the purpose-built parser.
- Visualizing the graph dynamically as the search proceeds would look cool.
Acknowledgements
Google Cartography relies on several external libraries.- JUNG - The Java Universal Network/Graph Framework. JUNG provides a framework for the modeling, analysis, and visualization of data that can be represented as a graph or network. JUNG was used for making connectedness calculations as well as initial proof-of-concept visualizations.
- Prefuse - Prefuse is a user interface toolkit for building highly interactive visualizations of structured and unstructured data. Prefuse is used to provide the very cool animated radial layout algorithm used to show the final street graph.
- HTMLParser - HTMLParser is a library used to parse HTML documents. HTMLParser was used by the cartography applet to convert web pages into text.


>