List

Imagine you are analyzing data on some administrative level i, as a response to some event that those administrative units experienced differentially. Other than the effect of the event on the unit that was directly affected you might also be interested in the effect the event had on its immediately surrounding neighbors, i.e. nearest neighbors, or on the neighbors of the neighbors, and so on. In order to estimate how the effect propagates across space you would need to first classify each such unit relative to the other units. Are they first degree neighbors, second, third, and so on. This can be done manually for a small number of such units, but as the number of units increases any manual procedure is strictly dominated by an automatic one. And while you could use distance from the treated unit it is important to remember that when the spatial units are of different size and shape the same distance from the center of the treated unit could result in a different number of neighboring units caught in the search radius.

 

I ran across the same problem as I was trying to classify the neighboring relationship of counties in the U.S. and I developed a procedure which starts off with some GIS work done in ArcGIS and then uses those outputs together with some Stata code that I wrote to deliver just that; each county’s relationship to other counties.

Spatial unit of reference in black, first degree neighbors in blue, second degree neighbors in dark cyan, and third degree neighbors in teal. Easy to classify manually for one polygon; but should not be attempted manually for ~3,000 polygons.

In this post I provide the full code and files necessary to reproduce the classification of U.S. counties up to the third degree neighbor, meaning a list of counties with their nearest neighbor, the neighbor of their neighbor, and the neighbor of their neighbor’s neighbor. I also provide the full documentation for the generalized code that can be used on any other administrative setting for whatever number of neighboring degree, as long as you have a shapefile and are willing to get your hands a little dirty in ArcGIS. This post is a long one, and I will probably update it over time once I realize I forgot to explain a step here and there or simply better clarify things based on comments received. While this is a long post, each step by itself is not overly complicated. The following is a step-by-step manual.

 

You will need:

  • ArcGIS 9.3 and above.
  • A shapefile of the polygons for the spatial unit of interest. From this you will produce:
    • A file with the centroids and coordinates of those polygons.
    • A file with the area of each polygon.
    • A file with the distances between the centroids.
    • A file with the lines that construct the polygons (ArcGIS 10 actually has a smoother way of identifying the neighboring relationship but it appears to crash for a large number of units).
  • Stata 10 and above.

First let’s focus on the steps you need to take in ArcGIS. I am assuming you have some rudimentary knowledge of ArcGIS and are familiar with concepts like projections etc. I am also assuming you have a shapefile with the polygons of the spatial unit of interest. Conditional on that proceed to:

  1. Open the ArcGIS Toolbox, go to Data Management >> Features >> Feature To Point. The input features should be the layer of the polygons. Choose a sensible name as the output features and save it in the project’s GIS directory (again, assuming you have one). I advise to save files in a GeoDataBase instead as separate shapefiles but I will leave the convincing to others. Great, we now have the centroids of each polygon.
  2. We want to add the coordinates to each centroid. Right click the layer of the newly created centroids. Choose “Open Attribute Table” and once the table opens up you will see a row of icons at the top of the table window. Click the one at the left and choose “Add Field.” A dialogue box will appear. Fill in the name “longitude” and choose a type of either “float” or “double” but definitely not “integer” as the variable type. Click OK and repeat this step only now create a “latitude” variable (do not forget to change the variable type).
  3. You should now have two empty fields called longitude and latitude. Right click on the column label for the longitude field and choose “Calculate Geometry.” In the “Property:” dropdown menu choose “X Coordinate of Point” and click OK. Repeat for latitude only now choose “Y Coordinate of Point” instead. We now have the centroids with their longitude and latitude coordinates in decimal degrees. The first ingredient is ready.
  4. To calculate the area I suggest converting the projection of the shapefile to a shape preserving projection. I am not going to go into this here but projections matter, a lot, and you should use projections that preserve areas when calculating areas, and projections that preserve distance when calculating distances, and projections that preserve the shape whenever you are making a map. Depending on your geographic area of focus ArcGIS probably has a projection ready that can preserve the needed property. Go the Toolbox >> Data Management Tools >> Projections and Transformations >> Feature >> Project. Use this to project the polygon shapefile to an area preserving projection (useful to save the file as the original file name but with an added suffix of EqArea).
  5. Right click the newly created projected polygons and choose “Open Attribute Table”. Again, choose the icon on the left in the row of icons on the top of the table. Choose “Add Field” and name it “Area” choosing a double or float variable type again. Right click the label of the new field and choose “Calculate Geometry” and notice the default property is “Area.” Verify that the correct (meaning the area preserving) projection is selected, choose your preferred unit, and choose OK. We now have the area of each spatial unit.
  6. Next we want the distance from each centroid the all the other centroids. You should project the polygons to a distance preserving projection (maybe use an “EqDist” suffix), convert the polygons into points (like in step 1 or just convert the projection of the centroids from step 1), and go to Toolbox >> Analysis Tools >> Proximity >> Point Distance. Both the input and near features should be the centroids. Choose the unit to your liking, choose a name for the new file, and hit OK. This step might take a couple of minutes so just let ArcGIS run its course.
  7. A crucial step after the distance table has been created is to properly join the new table with the original points layer, twice. Right click the newly created table. Choose “Join and Relates” and then “Join.” Make sure that “Join attributes from table” is chosen in the dropdown at the top of the dialogue box that just opened. Choose “INPUT_FID” for the option in (1), choose the original layer with the centroids in (2), choose “OBJECTID” in (3). This is matching the information from the original layer to the objects in the distance table (which currently just has arbitrary object ids as assigned by ArcGIS and the distance variable). Again, right click the distance table, choose “Join and Relates” and then “Join.” In (1) choose “NEAR_FID” and choose the same for (2) and (3). Now open the table (right click, open attribute table). You should have that each row is a centroid matched to another centroid (identified both by their object ids and by the data from the centroids layer) with the distance between them.
  8. Finally we are about to create the file that will help us establish neighboring relationships. We need to use Toolbox >> Data Management Tools >> Features >> Polygon to Line. The input features should be the polygon layer of the spatial units, and make sure you have “identify and store polygon neighboring information” checked. Run the tool.
  9. Now we need to export all of the above data into .csv files. The procedure is basically the same for each file. Right click the layer or table you want to export. Choose “Open Attribute Table” or just “Open” if it is already a table. Choose the most left button, and choose “Export.” In the dialogue box choose the file’s destination and choose text file as the file format. Add a “.csv” suffix to the file name, and hit OK. Rinse and repeat for all of the above files we just generated.
  10. You can preform the above steps in a different order and you might even have some of the files ready so you do not need to generate them. It could be that the polygon file already has the area variable but just to make sure it was calculated using a proper projection it is worth repeating that step. In terms of classifying the neighboring relationship only step (8) really matters but for completeness I detail all the steps and the following code delivers output which includes area, distances, and coordinates of each spatial unit.

Check list: you should now have a folder with the following .csv files and some vague recollection of the units used for the area and distances (useful to include them in the file name or variable name):

  1. Coordinates table.
  2. Distance table.
  3. Area table.
  4. Lines that define neighboring relationship (table from step 8.)

Finally, we can start working in Stata. The complete code is available here. The full code (found in the previous link) includes comments that I omit here. This is because I will follow each code segment with explanatory text and removing the comments will help to reduce the length of what is already a too long of a post.

Setting things up.

Lines 1-2 drop any previously saved program and then define it again. On line 4 I define the different arguments the program expects to receive:

  • coordinatesFile – This should be a file name with a .csv suffix (included in the name provided to the program).
  • longitudeVar and latitudeVar – These are the variables in the coordinates table file which identify the coordinates. Your file might have them as X and Y, or as long and lat, instead of longitude latitude. You are asked to provide it only once when calling to the program and this will save you the trouble of renaming variables in your source files or changing the code. Report all variable names in lowercase letters as Stata will convert them to lowercase when it imports the .csv file.
  • coordinatesKEEP_STRINGS – The column numbers in the imported coordinates file that should be preserved as strings even if they contain only numeric values. This is critical if you have leading zeros in the variable that identifies the spatial unit.
  • distanceFile – This is the file with the distances between the different centroids. Should also have the complete file name which includes the .csv suffix.
  • distanceUnit – Tell the program what is the unit of measurement for distance and it will label the variable accordingly.
  • distanceVar – The name of the variable with the distance between centroids, usually it’s “distance.”
  • distanceKEEP_STRINGS – The column numbers in the imported distance file that should be preserved as strings even if they contain only numeric values.
  • areaFile – The file with the different areas.
  • areaUnit – The unit of measurement for the area, this will be used for the variable’s label.
  • areaVar – The name of the of the area variables. If you followed the steps above it should just be “Area.”
  • areaKEEP_STRINGS – The column numbers in the imported area file that should be preserved as strings even if they contain only numeric values.
  • neighboringLinesFile – This is the name of the file from step (8) above.
  • rootDir – The root directory of the project.
  • PATH_GIS – The path from the root directory to the GIS folder.
  • geoid – The variable which identifies the different units, usually “objectid.”
  • idVar – The variable name for the spatial unit identifier (e.g. “fipscode”)
  • maxNeighborDegree – A parameter which sets the number of neighbor degrees the program should return (=1 for nearest neighbor, =2 for neighbor of neighbor, and so on). This is cumulative, meaning that setting it the parameter =2 will return both the nearest neighbor, and the second degree neighbor.

Lines 23-26 define a few temporary files that will be used later. Next we import some of the .csv files.

The above code simply loads the different files which contain the data on the coordinates, distance between centroids, and area of the different spatial units and saves only the relevant data to memory.  Notice that the option stringcols() is used to preserve numeric variables as strings, using the information provided about which columns should be preserved.

 

Notice that in when loading the distance file the first column with the spatial unit identified is given a suffix of “Prime.” This will repeat itself later as well. This simply defines them as the point of reference, meaning that any neighboring degree or distance developed later on is always relative to the prime unit. Next we import the data on the neighboring lines.

We import the .csv file which has the information on the lines that construct the polygons. This file also contains the information from which we will construct the neighboring relationship. The two key variables here are left_fid and right_fid. Whenever they are positive it means that there is another unit to the left or to the right of that line. The value in that variable is the object id which ArcGIS assigns to that unit. This is a bit confusing but the following examples help to grasp what this is representing.

 

In line 4 we keep only the data that helps us identify neighbors, in line 5 we keep only unique data, in lines 6-7 we define the object ids of the spatial units as gisID and the units that are identified to the right of those gisID units as neighborID. We save a temporary file, reload the original data and repeat this to identify the neighboring units to the left of each unit. We then append the two files and save them. This file is now a running list of all the object ids that have neighboring units, gisID, and the object ids of the neighbors, both to the left and to the right, in neighborID. We are going to use the object id to match these data to the other files we imported earlier. We could have preformed a join on the left_fid and right_fid as we did with the distance data, and preform the matching based on the unit identifier. That would have created a bit more work in ArcGIS and as long as all the files originate from the same shapefile then matching based on object ids should not be a problem. Next we are finally going to start with the classification.

We start the process by reloading the data on the coordinates. This serves as the baseline data on which we start layering more information on. In lines 3-6 we add the data about the area of each prime unit. We rename the variables and add to them the “Prime” suffix because everything will be measured relative to them.

 

In line 13 we do a one to many merge because each gisID is unique in the coordinates file but could be connected to multiple neighbors. Keeping only the successfully merged units should not be a problem as long as there are no islands, meaning units that have no other units surrounding them. In lines 17-22 we use the object id of the neighbors to match them to their coordinates data. In lines 24-32 we add the data about the distance between the prime units and the nearest neighbors we just identified, as well as the area of those nearest neighbors. We then finish with renaming the newly created variables with the suffix “Neighbor1.” If all you are interested in are the nearest neighbors then this is basically it, by providing the maxNeighborDegree parameter as 1 the program will basically stop here. But, if you entered maxNeighborDegree>1 then the next steps are going to generalize the procedure.

The first line just defines an empty local where we will store the order of variables on which to sort later. In line 2 we begin to loop over the different neighboring degrees from 2 (because we have already identified the first degree which is the nearest neighbor) up to the maximum number inputted to the program.

 

Inside the the loop, in line 4, we use joinby which creates all the pairwise combinations between the object ids and their neighboring ids. Remember that gisID now holds the object ids for the nearest neighbors. We are now matching the nearest neighbors with their nearest neighbors, meaning the neighbor of the neighbor of the prime unit. In other words, the second degree neighbor. This process is the source of the inefficiency of the code because the pairwise combination is going to create many duplicates that we will have to clean later, but for now let’s focus on replicating the previous steps. Once we have matched the second degree neighbors we go back and add their coordinates data, distance from the prime unit, and area of the second degree units (lines 10-22). We then rename all the new variables with the “Neighbor” suffix which also includes the number which indicates the neighboring degree. Line 30 simply updates the local that will be used in line 33 for sorting at the end of the loop.

 

Assume we specified maxNeighborDegree=3. We should now have the prime units, with their nearest neighbors, their second degree neighbors, and their third degree neighbors; together with the coordinates, distance to the prime unit, and area. The only problem is that I wrote this in a very fast and dirty way and we have a lot of duplicates, meaning units that are repeating themselves and units that are identified as their own neighbor. This is far from ideal and the next steps are going to remove them. Unfortunately, this is the part that scales badly. As we go up in the max neighboring degree we will be going from a runtime of a few hours, to a about a day, and well into “just leave it running for a week or two” realm. If there was any proof needed that I am an ill-trained programmer, this is probably it.

The first loop just drops all the cases where a prime unit was assigned as its own neighbor. This happens because A has A1, A2,… at its neighbors and when we match the neighbors of A1 then A is in that set. Clearly, that is not useful information so we can quickly get rid of it.

 

Next we repeat the elimination for the neighbors, meaning the first degree neighbors should not appear as second degree neighbors as well, and so on. This is much trickier because we build those relationships through pairwise combinations. We need to hold a prime unit constant and scan through all of second degree neighbors and see if they have any unit which was already identified as a first degree neighbor, and if so, drop it. Then scan through all of the third degree neighbors, again, while holding the prime unit constant, and drop all the units that were already identified as second degree units. And this keeps going.

 

We start by looping over the different values of the variable which identifies the different prime units, obtained in line 6 (“qui” just suppresses the output) . We enter a second loop for the number of neighboring degrees, notice we stop one level before the last level. For that prime unit we get all the unit values of the current degree neighbor and enter a third loop where we loop over those values. We then check each value to see if it appears in the next neighboring degree, conditional on the prime unit, and drop it if it does. Notice that there will still be duplicates in the levels up to the highest degree level, but at least we will not have the same unit appearing on several levels. Meaning that first and second degree neighbors might appear on multiple rows for the same prime unit, but the third degree neighbors will be unique (assuming that is the highest level). Since this process might take a while I include code in lines 7-8 and 18-20 that produces a progress report in the console. Now we can wrap up.

Now we just create a variable that will make it easier to merge this file with other files based on the identifying variable, save the temporary file to memory (you can simply change that such that it will save to disk or add a line after running the program which does that), and finally close the program’s defintion.

 

You now have a mapping of the spatial relationship of each unit to its surrounding units based on the neighboring relationship. This includes coordinates data if you wish to spatially adjust your standard errors; includes distance if you want to restrict the sample, create variables based on distance, or calculate inverse distance weights; includes area in case you want to normalize by it or include it as a regressor. I hope this is flexible enough for your needs and I would love to hear about any bugs etc.

 

To conclude, I am attaching the link to the complete .do file which defines the program, the .csv files for the contiguous U.S., the file that executes the program using those files and up to a third degree neighboring degree, and finally, the file that results from running all that so you do not have to run it from scratch. Available as a zipped folder.

 

One final note. VALIDATE! VALIDATE!! VALIDATE!!! Make sure you manually go through some of the results and check that they make sense. Just pick two or three spatial units and see that the final file correctly captures all their neighbors at the different degrees, that none are missing, and that there are no included neighbors that are not neighbors at all.

If you found this useful, please let me know. I need something to justify to myself the work that went into generalizing this program and writing the documentation.

Leave a Reply

Your email address will not be published. Required fields are marked *

  Posts

1 2
November 14th, 2016

Code Riffs: Stata and Regression Tables

September 16th, 2016

Trade Ban on Ivory: Are We Getting it Right?

May 27th, 2016

#WildForLife Campaign Aims to Reduce Demand for Wildlife Trade Prodcuts

May 1st, 2016

Recent Examples of the Large Number of Species Traded Globally

March 18th, 2016

A Bats Housing Boom

February 23rd, 2016

Scraping a Web Form with Multiple Pages Using Selenium – A Neat Trick

December 19th, 2015

Scraping Web Forms with Selenium

December 3rd, 2015

Watch Racing Extinction

December 1st, 2015

Classifying Neighboring Spatial Units

November 19th, 2015

Bacterium Trumps Fungus, Fungus Trumps Bats