*nix file systems

Attachments:
Message as email (text/plain)

Author: Victor Odhner
Date:
Subject: *nix file systems

Trent Shipley wrote:
>>Good suggestion. Another way (I did this in a C++ routine) is to keep a
>>list of all the directory inodes you process. Every time you open a new
>>one, you check it against the list. If it matches anything in the list,
>>you skip that one. Of course this would only work for traversing a
>>reasonbly sized file tree.
>>
>>Vaughn
>
>
> Was that an ordered or unordered list? (My instinct would be to use an
> ordered or perhaps an indexed list.)
>
> Actually what you want is a perfect hash. Any collision means that you've
> visited the node already. (When I last delt with this problem I was working
> in PERL and got hashes for free.)
>
> For a modest set of I-nodes you might use an array.
>
> There should also be a way to do it very nicely with a combination of array
> and bit-mask.

Is this all on one filesystem? Symbolic links can
cross filesystems. Either you have to prune any branches that
leave your filesystem, or your checklist has to be referenced
by a filesystem ID + inode, not just the inode by itself.

I gather the task here is to visit each eligible node once.

The most efficient solution here is very likely to be the
most brutally dumb one: a LINEAR SEARCH of all the inodes
encountered in the current branch, and throw away your list
each time you hit a real file. A given symlink can't be part
of more than one chain, so when the chain hits a real file,
you're done with the set of links you just traversed.

Linear searches that average more than a few can add up to
some serious processing time, but I believe the nature of
this situation is that a chain of *any* length is an
exception. Your typical chain of symlinks is not going to
exceed 2. So allow up to 30 in a chain -- you'll never get
there. Each time you push a symlink onto the list, search
backwards to ensure that inode isn't already in the list.
(Backwards reflects my guess that there may be a few
links in the chain, but it'll be the last two that are
pointing to each other.) When you get to a real file,
erase that list and start over.

As far as speed is concerned, doing a linear search of your
2 or 20 inodes in a list will take many orders of magnitude
less time than the I/O you'll be doing in this process. In
the old days engineers went to tremendous trouble to design
a fast multiply operation. When they finally hooked up
probes to figure out how much of its time the computer was
multiplying, the answer was virtually no time at all.
The moral is: First, design it simple. If performance
becomes a problem, then grease the slow spots. (Of course
you have to think about performance from the beginning, but
it is *very* hard to anticipate where the performance
problems are going to be.)

I think you'll find this is dazzlingly fast. If not, you
have invested no time in the simple approach, and you can
always move on to hashes or bit arrays. So let me
put on my propeller beanie, and ...

In Perl (which is all I use nowadays) of course the hash is
for free, and you probably have room for a lot of inodes.
Just store $inodes{$inode} = 1. Again, once you get to
a real file, you can set %inodes = (), just forget that
you ever saw these guys.

In a "real" programming language like C, you could use a bit-array
as the most straightforward solution ... you know, a combination of
character array and bit-masking, directly indexed by the inode.
This would be blazingly fast if you have the memory. However,
modern *nix systems allow for a *lot* of inodes -- I don't know
how many. Divide that by 8, can you spare that many bytes to map
all the possible inodes in the system? Of course this would be
multiplied by the number of filesystems your search could span.
... and any time you wanted to purge this bitmap, that would eat
some time. So I don't think the bitmap is really any good for
this particular problem.

Vic

This message is part of the following thread:
	the complete thread tree sorted by date
	Vaughn Treude at