GIS Data Formats and My Stubborn Opinons

Taking this break I’ve been looking over my spatial data and trying to figure out how to best organize it. The largest public project I manage is the GeoJSON Ballparks and this one is easy to manage as it is just a Git repository with text files. GeoJSON makes sense here because it is a very simple dataset (x/y) and it has been used for mapping projects mostly which makes the GeoJSON format perfect. I used to maintain a Shapefile version of it in that repository but nobody ever downloaded it so I just killed it eventually.

But my other data projects, things I’ve mapped or worked on the past are in a couple of formats:

VECTOR

  • Shapefile
  • File Geodatabase
  • Personal Geodatabase
  • GeoJSON
  • KML
  • SpatiaLite

RASTER

  • TIFF (mostly GeoTIFF)
  • Esri Grid

Now you can tell from some of these formats, I haven’t touched these datasets in a long time. Being Mac centric, the Personal Geodatabase is dead to me and given the modification dates on that stuff is 2005-2007 I doubt I’ll need it anytime soon. But it does bring of the question of archival, clearly PGDB isn’t the best format for this and I probably should convert it soon to some other format. Bill Dollins would tell me GeoPackage would be the best as Shapefile would cause me to lose data given limits of DBF, but I’m not a big fan of the format mostly because I’ve never needed to use it. Moving the data to GeoJSON would be good because who doesn’t like text formats, but GeoJSON doesn’t handle curves and while it might be fine for the Personal Geodatabase data, it doesn’t make a ton of sense for more complex data.

This is as close to a shapefile icon as I could find, tells you everything doesn’t it?

I’ve thought about WKT as an archival format (specifically WKB) which might make sense for me given the great WKT/WKB support in databases. But again, could I be just making my life harder than it needs to be just to not use the GeoPackage? But there is something about WKT/WKB that makes me comfortable for storing data for a long time given the long term support of the standard among so many of those databases. The practical method might be everything in GeoJSON except curves and those can get into WKT/WKB.

Raster is much easier given most of that data is in two fairly open formats. GeoTIFF or TIFF probably will be around longer than you or I and Esri grid formats have been well support through the years making both fairly safe. What are some limits to data formats that I do worry about?

  1. File size, do they have limits to how large they can be (e.g. TIFF and 32-bit limit)
  2. File structure, do they have limits to what can be stored (e.g. GeoJSON and curves)
  3. File format issues (e.g. everything about the Shapefile and dbf)
  4. OS centric formats (PGDB working only on Windows)

I think the two biggest fears of mine are the last two, because the first to can be mitigated fairly easily. My plan is the following; convert all vector data into GeoJSON, except where curves are required, I’m punting curves right now because I only have 3 datasets that require them and I’ll leave them in their native formats for now. The raster data is fine, TIFF and grid is perfect and I won’t be touching them at all. The other thing I’m doing is documenting the projects and data so that future James (or whomever gets this hard drive eventually) knows what the data is and how it was used. So little of what I have has any documentation, at least I’m lucky enough the file names make sense and the PDFs help me understand what the layers are used for.

One thing I’ve ignored through this, what to do with those MXDs that I cannot open at all? While I do have PDF versions of those MXDs, I have no tool to open them on Mac and even if I could, the pathing is probably a mess anyway. It bring up the point that the hardest thing to archive is cartography, especially if it is locked in a binary file like an MXD. At least in that case, it isn’t too hard to find someone with a license of ArcMap to help me out. But boy, it would be nice to have a good cartography archival format that isn’t some CSS thing.

SpatialTau v1.2 – Tilting at the Shapefile

SpatialTau is my weekly newsletter that goes out every Wednesday. The archive shows up in my blog a month after the newsletter is published. If you’d like to subscribe, please do so here.


Tilting at the Shapefile

Now I’m sure if I went back to my blog and searched for how many times I’ve tried to kill off the shapefile even I would be surprised at how many times I’ve blogged about it.  Thus it seems about for the second newsletter I’ve ever written to focus on the “Shapefile Problem”

The Problem

So what exactly is this problem?  I mean what is so bad about a well supported, somewhat open file format?  I’ve told this story before but it never hurts to repeat.  My dad was borrowing my laptop a couple years ago and commented about all these DBF files all over my desktop.  He wondered why on earth would I have a format that he used in the late 80’s and outgrew because of it’s limitations.  Well I proceeded to explain to him the shapefile and how it worked and he just laughed.  That’s right, my 72 year old dad laughs at us wankers and our shapefile.  The DBF is only half the problem with the shapefile.  It doesn’t understand topology, only handles simple features (ever try and draw a curve in a shapefile?), puny 2GB file size limitation and not to mention you can’t combine points, polygons and lines in one file (hence every shapefile name has the word point, line or poly in it).

Oh and it’s anywhere between 3 and 15ish file types/extensions.  Sure 3 are required but the rest just clutter up your folders.  I love the *.shp.xml one especially because clearly they thought so much about how to render metadata.  If I had a penny for every time someone emailed me just the *.shp file without the other two I’d be a rich man.  Heck just the other day I got the *.shp and *.dbf but not the *.shx.  Just typing the sentence makes me cringe.

The Contenders

  1. The File Geodatabase (FGDB):  Esri’s default format for their tools.  It’a spatial database in a folder format.  The less mentioned about the Personal Geodatabase, the better.  But unlike most companies in the past 5 years, it isn’t built on SQLite, but Esri proprietary geodatabase format.  There isn’t anything inherently wrong with Esri taking this path but it means you’re stuck using their software or their APIs to access the file format.  To me this severely limits the FGDB to me an interchange file format and I think that is perfectly fine with Esri as they don’t really care too much if the FGDB doesn’t work with other’s software.  I’d link to an Esri page that describes the FGDB but there isn’t one. It’s a secret proprietary format that even Esri doesn’t want to tell you about.
  2. SpatiaLite: SpatiaLite has everything going for it.  It’s a spatial extension to SQLite which means at its core it’s open.  It’s OGC Simple Features compliant.  It is relatively well supported by GIS software (even Esri technically can support it with the help of Safe Software).  Plus it supports all those complex features that the shapefile can’t.  Heck OGC even chose it as the reference implementation for the GeoPackage (assuming people still care about that).  Heck supports rasters too!  But honestly, SpatiaLite was released in 2008 and hasn’t really made a dent into the market.  I can’t ever remember downloading or being sent a SpatiaLite file.  I’m guessing you can’t either.  I mean we all want a format that is similar to PostGIS and easily transferable (one file).  On paper that’s SpatiaLite.  But I think we have to chalk this up as Esri not supporting the format and it is relegated to niche use.
  3. GML/KMLRon Lake probably loves I grouped these together but honestly they’re so similar in basic structure I’ve really just left them together.  My company uses KML quite a bit to share georeferenced photos.  That’s about it, pretty low use.  There is a ton of KML out there but it is mostly points.  There might be a ton of GML out there but I’m not Ron Lake.  KML is nice in the sense it has visualization included in the spec (you can make a line yellow) but it isn’t enough to get excited about.  It’s an OGC standard but as with SpatiaLite that doesn’t really seem to matter in the real world.  Don’t even try and use a different projection.  They have their use in specific cases but the limits of the formats means you’ll never see it being an interchange format.  Plus XML?  Oh and feel free to email me how GML is powerful because it supports OGC Simple features, I’ll still include it with KML.
  4. GeoJSON: It’s an open standard, so open in fact that OGC isn’t involved.  That’s a huge plus because mostly standards organizations do is make complex file formats for simple uses.  That’s not what GeoJSON is.  It can be many types of projections, it can be points, polygons and lines (with variations of many), it supports topology with the TopoJSON format and it’s JSON so it’s human readable.  But alas it isn’t supported by Esri so we run into the same problem as SpatiaLite.  BUT, Esri has shown interest in GeoJSON so there is hope that it will be well supported soon.  As with the shapefile/KML and unlike SpatiaLite it won’t support curves and other complex geometry or rasters and never will.  Thus it is not well suited as a shapefile replacement.
  5. Well Known Text (WKT): This comes out of the OGC and is used by software such as PostGIS for storage.  WKT supports lots of geometric objects (curves!) and TINs.  I’ve never been limited by WKT for vector files (you can almost feel where the end of this is going though) and many spatial databases from PostGIS and Oracle to SpatiaLite and SQL Server use the WKB (Well Known Binary) equivalent to store information.  But alas, we still don’t support rasters.  It’s a vector format for vector data.  SpatiaLite and the File Geodatabase both support rasters.

There are many other formats but I think these are the only ones that really have any traction.  I could list formats such as GeoTIFF and say you could use that for rasters but you are limited to 4GB of data.  The vector guy in me wants to just say the heck with it all and use GeoJSON and WKT to solve the problem but given I’m still writing about this subject in December 2014 neither is a good solution.  We’re left with one simple truth…

The Verdict

The shapefile will outlive us all.  Unless Esri stops supporting it with their software at the same time as QGIS, Autodesk, etc it will continue to be the format that everyone uses.  In 2014 I’d wager 80% of all production geospatial data (I’m sandbagging here, probably this number is 95%) is stuck in the shapefile format where it resides comfortably.  Personally I’m a big fan of GeoJSON but I’ve started to get back into WKT lately and love the complex geometry support. If there is one thing I’ve learned in the past 20 years of “professional GIS” I’ve done, the shapefile is king.

Virtual Earth + Shapefile Reader + MSN Messenger

I was just thinking about a couple things today while watching my laptop struggle to execute Kriging.

How cool would it be if someone took the Virtual Earth Shapefile Reader and mashed it up with the MSN Messenger Virtual Earth plugin? Then anyone could send a shapefile to anyone else via an instant message and have it already viewable inside a small browser window. Sounds cool to me. ESRI’s GeoChat seems to be similar to this, but I believe that one requires ArcGIS to work.

Virtual Earth + Shapefile Reader + MSN Messenger

I was just thinking about a couple things today while watching my laptop struggle to execute Kriging.

How cool would it be if someone took the Virtual Earth Shapefile Reader and mashed it up with the MSN Messenger Virtual Earth plugin? Then anyone could send a shapefile to anyone else via an instant message and have it already viewable inside a small browser window. Sounds cool to me. ESRI’s GeoChat seems to be similar to this, but I believe that one requires ArcGIS to work.

Putting Shapefies into Virtual Earth

Link – Virtual Earth Shapefile Viewer – via Virtual Earth Blog

Interesting and it works pretty well. Upload any shapefile to the Internet and then just paste the URL into the form and submit. There isn’t any description yet on how this is done or what you need to do to your shapefiles to get them ready to inclusion into Virtual Earth, but it is impressive non the less. Hover over the centroid to get a pop-up id of each record. The GIS community has pretty much ignored Virtual Earth since day one, but maybe this is the start of something new.

Ve shapefile

Update – Brian Flood post the following in the comments.

pretty slick. for the record:

  1. background transfer of XML encoded point,polyline,polygon shapefile data. I’m not sure if its GML or just some quick and dirty xml
  2. javascript (js) parses the xml and either uses a custom class MPolyline to create VML (IE only) for polylines/polygons. For points, it just uses the VE AddPin method. Translation between the XML coords to map coords is handled with VE GetX()/GetY() methods
  3. local javascript from the site adds prototype handlers to the main VE_MapControl that handle the VML placed on top of it
  4. Looks like some symbology is randomly generated.

nice work, whoever they are 😉

ESRI Shapefile to KML

Link – Shape2KML v1.0 via ArcScripts

Shape2KML works within ArcMap 9.x to convert points, lines, and polygons to KML for viewing and manipulation in Google Earth.

Seems like a very simple method to convert shapefiles to KML. Unfortunately there isn’t any source code to see how this was done and make improvements, but I’m sure any feedback to the author would be appreciated. I’m still on vacation and not near my license manager to check this out so anyone who’s tried it, post in the comments what you think.

Thanks to Ray Carnes for pointing this out to me. Good eye!

**Update – Mike points out in my comments that a VBA script that converts a feature layer into a KML file. It only exports linear features right now (input can be point, poly, line), but at least this one has the souce code. 🙂 **

Email Shapefiles or Geodatabases? Nah, give me e00!

I’ve grown to really dislike emailing datasets to people. Shapefiles have always been a pain as you either have to attach at least 3 files to an email or “zip” it up to ensure that the files are readable on the other end. The Geodatabase did better as it was one file containing one or more datasets, but alas these days emailing a Microsoft Access file is just about worthless as most email systems (and even Outlook to an extent) strip out anything with a *.mdb extension. Yea, one could always use FTP or some other web based system, but email is still the easiest and quickest way to send files.

One format that was never difficult to send (though half the time people didn’t know what to do with it) was the ESRI Export Interchange file (what we mostly now call e00). This single ASCII file was almost always accepted with email systems and could store many different dataset types. Why is it we always take a couple steps back as we move forward. I just can’t stand having to change a Geodatabase extension to _.txt or something else just to get an email system to accept it, or remind people that the need to at least include the _.shx and *.dbf with that shapefile they sent. I’d love to see a new interchange format from ESRI, or just update the existing Export Interchange format to handle the newer data types the ESRI is supporting. In retrospect, using a Microsoft Access file format probably wasn’t the best idea for many reasons, but if we had an up to date interchange format, that wouldn’t matter at all.

Open Source Vs Proprietary

We’ve loaded up PostgreSQL and PostGIS up on our Linux server to start playing around with it and I’ll post some thoughts I have of things so far.

Getting PostgreSQL installed wasn’t too much trouble, but PostGIS was a pain. It has to be compiled before installing. My database programmer got it working after a couple hours, but after using ArcSDE for so many years it was an eye opener. I’m sure they will get a compiled version up, but for now we had to do it ourselves. My next thought was to see if ArcCatalog could connect to it. We tried an ODBC driver and an OleDb driver but had no luck. Databases are not my strong point and while we were able to get them to connect to PostgreSQL, we couldn’t seem to connect to PostGIS. There must be something in how PostGIS handles the spatial data that these drivers can’t handle. This is somewhat of a big deal for us as most of our data is in either ArcSDE or Personal Geodatabases and Post GIS only allows loading of data via shapefiles. I was hoping to use ArcCatalog to perform the loading, I guess it is export to shapefiles and then use the shp2pgsql command. It appears that the ArcGIS Data Interoperability Extension supports PostGIS, but if you have to spend over two grand on an extension, what is the point of going to PostGIS. I’m sure we can script something, but I would have rather had the ArcCatalog option open to everyone.

So what does this mean to our development? Probably not too much except it is a strike against open source GIS. If PostGIS had a windows driver that allowed ArcCatalog access, things might be different and we could recommend it to our clients, command line isn’t a user friendly proposition. Open source GIS seems to mimic open source in general. Its getting better, but you still need command line experience to truly get value from it. Since ArcGIS 8, ESRI has really pushed the GUI for GIS giving even the most greenhorn GIS specialist commands that 10 years ago where run by very experienced GIS Analysts on UNIX.

It is very easy to criticize ESRI for their products but they have really taken the GUI to places where open source is at least 5 if not 10 years away from being. I still think that open source GIS has a place on the server side, but we need to figure out ways to get data loaded from industry standard programs such as ArcCatalog before it will start to take off.