Feb 27, 2020

jk diary: filesystem walk

This describes, as best I can remember, the thought process behind the walk procedure in jk’s standard library.

Aim

Required:

A walk procedure which will recursively walk the filesystem and tell you about all the files.

Don’t consume stack (i.e., use JavaScript tail call elimination a.k.a. loops).

Apparatus

You have (existing std library procedures):

info, which gives you the name and path of a file, and if it’s a directory
dir, which gives you the contents of a directory (an info for all the files in it)

Method

Pick some motivating uses:

Find the path to all YAML files under a directory
Print a tree of the directories and their files

Pick some implementations and analyse according to how good at the uses they are
Weigh up which bits are the best

Background data

walk for NodeJS, and walk for NodeJS as a library:

https://gist.github.com/lovasoa/8691344 https://www.npmjs.com/package/walk

There are idiomatic NodeJS, in the sense that you pass callbacks and have to rely on side-effects if you want to calculate a result.

In ES6 we can do better than EventEmitters, since we have both Promises and generators.

os.walk for Python

os.walk(top, topdown=True, onerror=None, followlinks=False)

You start it off with a directory and optionally tell it top-down or bottom-up. It gives you a generator of (dirpath, dirs, files). If operating top-down, you can remove things from dirs to prevent it recursing (but not if bottom-up, since it will already have recursed by the time you see it).

This works nicely since you can compose your own generator on top of it:

def yamels(dir):
    for base, dirs, files in os.walk(dir):
        for f in files:
            if f.endswith('.yaml') or f.endswith('.yml'):
                yield os.path.join(base, f)

Mutating an argument to control the recursion is alittle distasteful (to me anyway); but quite practical, since it gives you a lot of fine control.

In general, to keep track of where you are in the tree, you have to do a calculation on the path.

filepath.Walk in Go

Walk(root, func(string, FileInfo, error) error) error

Files are visited in lexicographic order. To prevent recursing into a particular directory, you return a filepath.SkipDir from your callback. You also get to choose what to do with any problems the driving procedure encounters, which are passed to your callback as an argument.

This has the advantage of being simple and user-pays – you do all the bookkeeping.

Java FileVisitor interface

walkFileTree(Path, FileVisitor)

This is a callback API with a menu of three callbacks:

FileVisitor:
 - preVisitDirectory
 - postVisitDirectory
 - visitFile

You can return one of several sentinel values from preVisitDirectory to tell the library to skip that directory, exit the walk, and other variations.

This is so Java! But { pre, post, visit } give you a lot of control to, e.g., the capability to skip a directory or do some bookkeeping when unwinding the recursion.

As with Go, you must rely on side-effects to build any data structure as you go.

Results

In Python os.walk gives you whole directories at a time, but doesn’t tell you whether it’s going down or up the tree (you need to look at the path for that).

In Go filepath.Walk visits each file, and you are told about directories (and can control recursion into them) as they are encountered; as with Python, you have to figure out where you are by looking at the path.

The Java java.nio.files.Path#walkFileTree procedure uses a callback interface, rather than a callback invoked in different modes, but is pretty similar to the Go formulation otherwise. However it does provide for one thing the others don’t: a post visit hook, so you can know when you are exiting a directory.

Attempt one – naive depth-first traversal

I decided to try using generator functions. It would be a major convenience to be able to just loop over the files.

Putting directories on a stack as we encounter them gives us a variety of depth-first traversal (we’ll visit a directory’s contents before we visit sibling directories).

function* walk(path) {
    const stack = [path];
    while (stack.length > 0) {
        const d = dir(stack.pop());
        for (const f in d.files) {
            if (f.isdir) stack.push(f.path);
            yield f;
        }
    }
}

This has the benefit of being very simple. You can iterate over it to see every file, and filter as you please:

for (const f of walk('.')) {
    if (f.name.endsWith('.yaml') || f.name.endsWith('.yml')) {
        log(f.path);
    }
}

These are trickinesses:

how do you control recursion?
how do you know when you’ve been recursed? Each directory is put on the stack for later, but also yielded; so you see a directory before you see its files, but you don’t know when those files start (without, say, doing some calculation based on the path)

Attempt two – be more Python.

function walk(path) {
    const stack = [path];
    while (stack.length > 0) {
        const d = dir(stack.pop());
        for (const f of d.files) {
            if (f.isdir) stack.push(f.path);
        }
        yield d.files;
    }
}

This differs from os.walk because it just returns all the files (i.e., not files and directories separately). Yielding the array before pushing subdirectories on the stack would let you remove entries to avoid recursing into them, like os.walk does.

The mode of use doesn’t really differ from the first attempt; it just requires a little more work:

for (const files in walk('.')) {
    for (const f in files) {
        if (isYAML(f.name)) log(f.path);
    }
}

An interesting little thing: if you change yield to yield* it’s the same as the first attempt.

Trickinesses:

now you have to do your own iteration over the files (not that it’s difficult)
you still don’t see the directory just before the files in it, as in attempt one.

Attempt three – preorder your walk procedure today

The previous attempts suffered from not knowing when diving into a directory; so you can’t tell when the new file (or files) are under the previous directory, or sibling, or a sibling of the parent.

At least we can do a proper preorder, so that e.g., in the tree:

the files are visited in the order A, B, D, E, C. This way, the contents of a directory follow straight after it.

function* walk(path) {
  const top = dir(path);
  const stack = [];
  let next = top.files;
  while (next !== undefined) {
    for (let i = 0; i < next.length; i++) {
      const f = next[i];
      if (f.isdir) {
        // whenever we see a directory, yield it,
        // and put the remainder on the stack
        stack.push(next.slice(i+1));
        yield f;
        break;
      }
      yield f;
    }
    next = stack.pop();
  }
}

The stack holds lists of files now; they represent the remainder of the current directory’s contents, rather than the directory itself, since it recurses into each directory as it encounters it.

As a side note, if I wasn’t worried about consuming program stack, a preorder walk could be written like this:

function* walk(path) {
    for (f of dir(path).files) {
        yield f;
        if (f.isdir) {
            walk(f.path, opts);
        }
    }
}

You can see the difference using your own stack makes.

you need to encode recursion: in the simple version, you just invoke the function; and in the complex version, you have to push on the stack, then break to change the flow of control.
you need to encode return: in the simple version, this is just falling off the end after the loop, and in the complicated version, it’s popping from the stack.

Trickinesses:

you do get each directory before the files in it, but you don’t know when you’re popping from the stack, so you still have to do work to determine where you are in the tree.
there’s still no way to control the recursion.

Attempt four – fix it up in post

For at least the purpose of printing a tree, it would be convenient to know when the walk is entering a directory, and when it’s leaving a directory. This, and controlling recursion, can be done with post and pre hooks, a bit like the Java walk API.

function walk(path, opts = { pre = always, post = nop }) {...}

The visit part of the interface is still the generator, so there’s a mixed mode of use.

function* walk(path, opts = {}) {
  const { pre = always, post = noop } = opts;
  const top = dir(path);
  // the stack is going to keep lists of files to examine
  const stack = [];
  let next = top.files;
  while (next !== undefined) {
    let i = 0;
    for (; i < next.length; i += 1) {
      const f = next[i];
      if (f.isdir && pre(f)) {
        const d = dir(f.path);
        stack.push(next.slice(i + 1));
        stack.push(d.files);
        yield f;
        break;
      }
      yield f;
    }
    // If we've exhausted the slice, we're popping a directory
    if (i === next.length) post();
    next = stack.pop();
  }
}

I prefer this to supplying three callbacks, since the common mode of use is simple iteration. When a pre callback is needed, it’s often sufficient to have a (stateless) predicate. For example, skipping dotted directories:

for (const f of walk('.', { pre: (f) => !f.name.startsWith('.') })) {
    print(f.path);
}

For bookkeeping, you also get told when the walk is leaving a directory. This is useful if you’re printing a tree structure – you indent when you see a directory, and outdent when you’ve seen all its files.

let indent = '';
const notdotted = (f) => !f.name.startsWith('.')
const outdent = () => indent = indent.substring(2);

for (const f of walk('.', { pre: notdotted, post: outdent })) {
  if (notdotted(f)) print(indent + f.name);
  if (f.isdir) indent = indent + '  ';
}

Why doesn’t it indent in pre? Because the directory is yielded after pre is called, so its name would appear indented (could it be yielded before? Actually, yes).

Conclusions

Attempt four is more or less what I ended up using as the formulation of walk in @jkcfg/std/fs. I like the ergonomics of it, although you have to hold the model in your head if you are doing something that needs bookkeeping.

My two motivating examples come out fairly succinctly, in part because being able to loop over results does a lot of lifting.

Here’s the tree printing using only callbacks:

let indent = '';
const notdotted = (f) => !f.name.startsWith('.')
const outdent = () => indent = indent.substring(2);

walk('.', { pre: notdotted, post: outdent, visit: (f) => {
  if (notdotted(f)) print(indent + f.name);
  if (f.isdir) indent = indent + '  ';
} });

.. which is not that different, truth be told. But if you’re doing something where you might want to abandon the walk, that would have to be built into its protocol (like the Java API); whereas, if you’re looping, you can just break.