jk diary: filesystem walk
This describes, as best I can remember, the thought process behind the
walk
procedure in jk
’s standard library.
Aim
Required:
A walk
procedure which will recursively walk the filesystem and tell
you about all the files.
Don’t consume stack (i.e., use JavaScript tail call elimination a.k.a. loops).
Apparatus
You have (existing std library procedures):
info
, which gives you the name and path of a file, and if it’s a directorydir
, which gives you the contents of a directory (an info for all the files in it)
Method
- Pick some motivating uses:
- Find the path to all YAML files under a directory
- Print a tree of the directories and their files
-
Pick some implementations and analyse according to how good at the uses they are
-
Weigh up which bits are the best
Background data
walk for NodeJS, and walk for NodeJS as a library:
https://gist.github.com/lovasoa/8691344 https://www.npmjs.com/package/walk
There are idiomatic NodeJS, in the sense that you pass callbacks and have to rely on side-effects if you want to calculate a result.
In ES6 we can do better than EventEmitters, since we have both Promises and generators.
os.walk for Python
os.walk(top, topdown=True, onerror=None, followlinks=False)
You start it off with a directory and optionally tell it top-down or
bottom-up. It gives you a generator of (dirpath, dirs, files)
. If
operating top-down, you can remove things from dirs
to prevent it
recursing (but not if bottom-up, since it will already have recursed
by the time you see it).
This works nicely since you can compose your own generator on top of it:
def yamels(dir):
for base, dirs, files in os.walk(dir):
for f in files:
if f.endswith('.yaml') or f.endswith('.yml'):
yield os.path.join(base, f)
Mutating an argument to control the recursion is alittle distasteful (to me anyway); but quite practical, since it gives you a lot of fine control.
In general, to keep track of where you are in the tree, you have to do a calculation on the path.
filepath.Walk in Go
Walk(root, func(string, FileInfo, error) error) error
Files are visited in lexicographic order. To prevent recursing into a
particular directory, you return a filepath.SkipDir
from your
callback. You also get to choose what to do with any problems the
driving procedure encounters, which are passed to your callback as an
argument.
This has the advantage of being simple and user-pays – you do all the bookkeeping.
Java FileVisitor interface
walkFileTree(Path, FileVisitor)
This is a callback API with a menu of three callbacks:
FileVisitor:
- preVisitDirectory
- postVisitDirectory
- visitFile
You can return one of several sentinel values from preVisitDirectory to tell the library to skip that directory, exit the walk, and other variations.
This is so Java! But { pre, post, visit }
give you a lot of
control to, e.g., the capability to skip a directory or do some
bookkeeping when unwinding the recursion.
As with Go, you must rely on side-effects to build any data structure as you go.
Results
In Python os.walk
gives you whole directories at a time, but doesn’t
tell you whether it’s going down or up the tree (you need to look at
the path for that).
In Go filepath.Walk
visits each file, and you are told about
directories (and can control recursion into them) as they are
encountered; as with Python, you have to figure out where you are by
looking at the path.
The Java java.nio.files.Path#walkFileTree
procedure uses a callback
interface, rather than a callback invoked in different modes, but is
pretty similar to the Go formulation otherwise. However it does
provide for one thing the others don’t: a post visit hook, so you
can know when you are exiting a directory.
Attempt one – naive depth-first traversal
I decided to try using generator functions. It would be a major convenience to be able to just loop over the files.
Putting directories on a stack as we encounter them gives us a variety of depth-first traversal (we’ll visit a directory’s contents before we visit sibling directories).
function* walk(path) {
const stack = [path];
while (stack.length > 0) {
const d = dir(stack.pop());
for (const f in d.files) {
if (f.isdir) stack.push(f.path);
yield f;
}
}
}
This has the benefit of being very simple. You can iterate over it to see every file, and filter as you please:
for (const f of walk('.')) {
if (f.name.endsWith('.yaml') || f.name.endsWith('.yml')) {
log(f.path);
}
}
These are trickinesses:
- how do you control recursion?
- how do you know when you’ve been recursed? Each directory is put on the stack for later, but also yielded; so you see a directory before you see its files, but you don’t know when those files start (without, say, doing some calculation based on the path)
Attempt two – be more Python.
function walk(path) {
const stack = [path];
while (stack.length > 0) {
const d = dir(stack.pop());
for (const f of d.files) {
if (f.isdir) stack.push(f.path);
}
yield d.files;
}
}
This differs from os.walk
because it just returns all the files
(i.e., not files and directories separately). Yielding the array
before pushing subdirectories on the stack would let you remove
entries to avoid recursing into them, like os.walk
does.
The mode of use doesn’t really differ from the first attempt; it just requires a little more work:
for (const files in walk('.')) {
for (const f in files) {
if (isYAML(f.name)) log(f.path);
}
}
An interesting little thing: if you change yield
to yield*
it’s
the same as the first attempt.
Trickinesses:
- now you have to do your own iteration over the files (not that it’s difficult)
- you still don’t see the directory just before the files in it, as in attempt one.
Attempt three – preorder your walk procedure today
The previous attempts suffered from not knowing when diving into a directory; so you can’t tell when the new file (or files) are under the previous directory, or sibling, or a sibling of the parent.
At least we can do a proper preorder, so that e.g., in the tree:
A
/ \
B C
/ \
D E
the files are visited in the order A, B, D, E, C. This way, the contents of a directory follow straight after it.
function* walk(path) {
const top = dir(path);
const stack = [];
let next = top.files;
while (next !== undefined) {
for (let i = 0; i < next.length; i++) {
const f = next[i];
if (f.isdir) {
// whenever we see a directory, yield it,
// and put the remainder on the stack
stack.push(next.slice(i+1));
yield f;
break;
}
yield f;
}
next = stack.pop();
}
}
The stack holds lists of files now; they represent the remainder of the current directory’s contents, rather than the directory itself, since it recurses into each directory as it encounters it.
As a side note, if I wasn’t worried about consuming program stack, a preorder walk could be written like this:
function* walk(path) {
for (f of dir(path).files) {
yield f;
if (f.isdir) {
walk(f.path, opts);
}
}
}
You can see the difference using your own stack makes.
- you need to encode recursion: in the simple version, you just
invoke the function; and in the complex version, you have to push
on the stack, then
break
to change the flow of control. - you need to encode
return
: in the simple version, this is just falling off the end after the loop, and in the complicated version, it’s popping from the stack.
Trickinesses:
- you do get each directory before the files in it, but you don’t know when you’re popping from the stack, so you still have to do work to determine where you are in the tree.
- there’s still no way to control the recursion.
Attempt four – fix it up in post
For at least the purpose of printing a tree, it would be convenient to
know when the walk is entering a directory, and when it’s leaving a
directory. This, and controlling recursion, can be done with post
and pre
hooks, a bit like the Java walk API.
function walk(path, opts = { pre = always, post = nop }) {...}
The visit part of the interface is still the generator, so there’s a mixed mode of use.
function* walk(path, opts = {}) {
const { pre = always, post = noop } = opts;
const top = dir(path);
// the stack is going to keep lists of files to examine
const stack = [];
let next = top.files;
while (next !== undefined) {
let i = 0;
for (; i < next.length; i += 1) {
const f = next[i];
if (f.isdir && pre(f)) {
const d = dir(f.path);
stack.push(next.slice(i + 1));
stack.push(d.files);
yield f;
break;
}
yield f;
}
// If we've exhausted the slice, we're popping a directory
if (i === next.length) post();
next = stack.pop();
}
}
I prefer this to supplying three callbacks, since the common mode of
use is simple iteration. When a pre
callback is needed, it’s often
sufficient to have a (stateless) predicate. For example, skipping
dotted directories:
for (const f of walk('.', { pre: (f) => !f.name.startsWith('.') })) {
print(f.path);
}
For bookkeeping, you also get told when the walk is leaving a directory. This is useful if you’re printing a tree structure – you indent when you see a directory, and outdent when you’ve seen all its files.
let indent = '';
const notdotted = (f) => !f.name.startsWith('.')
const outdent = () => indent = indent.substring(2);
for (const f of walk('.', { pre: notdotted, post: outdent })) {
if (notdotted(f)) print(indent + f.name);
if (f.isdir) indent = indent + ' ';
}
Why doesn’t it indent in pre
? Because the directory is yield
ed
after pre
is called, so its name would appear indented (could it
be yielded before? Actually, yes).
Conclusions
Attempt four is more or less what I ended up using as the formulation
of walk
in @jkcfg/std/fs
. I like the ergonomics of it, although
you have to hold the model in your head if you are doing something
that needs bookkeeping.
My two motivating examples come out fairly succinctly, in part because being able to loop over results does a lot of lifting.
Here’s the tree printing using only callbacks:
let indent = '';
const notdotted = (f) => !f.name.startsWith('.')
const outdent = () => indent = indent.substring(2);
walk('.', { pre: notdotted, post: outdent, visit: (f) => {
if (notdotted(f)) print(indent + f.name);
if (f.isdir) indent = indent + ' ';
} });
.. which is not that different, truth be told. But if you’re doing
something where you might want to abandon the walk, that would have to
be built into its protocol (like the Java API); whereas, if you’re
looping, you can just break
.