The Importance of Location Independence

When I took Operating Systems in college, one of the concepts the course covered was "location independence" in filesystems. A filesystem is "location independent" if file paths do not reveal the physical location of the file.

Microsoft Windows does not have location independence, since absolute file paths have to begin with a drive letter. You know that C:\WINDOWS is located on the C:\ disk drive, just by reading the path.

UNIX, on the other hand, does have location independence. You can't figure out what drive /usr/bin/ls is on just by reading the path. In fact, it might not even be on a drive at all. Perhaps it is on a ramdisk, or it is a network filesystem. You have to look at the mount table to find out.

You could probably quibble with this distinction. You can reassign drive letters in Windows, so that gives you a little bit of independence. It's a clunky and little-used process, however. Also, at least to my knowledge, you can't mount a filesystem within another filesystem as you can with UNIX.

I never understood why my OS textbook spent so much time on this concept. Wasn't it obvious that location independence was a good thing?

msdos

Reinventing the C:\ Drive

As it turns out, not everyone got the memo about location independence. People have continued to design systems where the path name contains information about where the files are located.

One such system, unfortunately, is Hadoop. Hadoop doesn't use UNIX-style paths. Instead, it uses "uniform resource locators"-- that's URLs, for short. URLs are better known as the thing you type into your web browser. A path in the Hadoop Distributed Filesystem might have a URL like hdfs://example.com:6000/user/cmccabe/stuff/file.txt

URLs sound very modern and Web 2.0, but there's a big problem. That very first part of the URL, the "scheme", is encoding the fact that we are using hdfs. The second part, example.com:6000, is called the "authority," for reasons that have something to do with web browsers, and nothing to do with Hadoop. The "authority" encodes which server and port we are using. Because we can learn the server and the port from the path, we have lost location independence.

The problem with these URLs is that they expose things to the application that would be better off hidden. Why should the application care that the server is on port 6000 of example.com? That should all be handled by a client-side mount table.

Hadoop URLs have led to a lot of problems. For example, when we were implementing HDFS high availability, we realized that there would no longer be just one metadata server that the client needed to talk to. The client needed to know the hostname and port of both the primary and standby NameNodes. If one namenode was inaccessible, the client needed to know where to find the other namenode. The obvious solution was to put both of the hostnames into the authority. However, there was too much legacy software parsing URLs. This legacy software expected to see an authority of the form hostname:port, and often would misbehave or fail to work if it wasn't present. So we came up with a hack. The authority would contain a fake hostname-- henceforth called the namesystem id-- and optionally, a fake port which would be ignored.

URLs also create problems for symlinks. Prior to symlinks being introduced, software could assume that paths that started with hdfs://example.com would access only files found on the HDFS NameNode at example.com However, what if hdfs://example.com/foo is a symlink to file:///secret.txt? Issues like this can be a big problem for server processes that run with elevated privileges.

The "fun" doesn't end there. URLs have more fields that just scheme, authority, and path. For example, they also have fragment, which is that thing that appears after a hash mark. This and other extended URL fields are handled very inconsistently in Hadoop. Some functions strip them off of the Paths they're given. Others don't, passing them back to the user. This creates compatibility headaches whenever you're dealing with path. Did you change a function to strip off the fragment, whereas it didn't before? You might break someone.

Parsing URLs is a complex task. The inefficiency, while galling, isn't the worst part. It's the cognitive overhead for both developers and users of dealing with all the corner cases in the spec. For example, how do you handle paths that include metacharacters such as the pound sign and question mark? And so forth. At least everyone know how to parse C:\WINDOWS. I sometimes wonder if anyone fully understands the URL parsing rules.

Conclusion

These problems are all the result of putting too much information about HDFS's inner workings into the path. In my opinion, the path should just be a plain old string containing path components separated by slashes. We need to learn from the simplicity and elegance of UNIX, not from the kludges of MS-DOS.