2008-11-06 01:05 -!- bushman(~marcin@c-76-23-106-132.hsd1.sc.comcast.net) has joined #tux3 2008-11-06 01:47 -!- pgquiles(~pgquiles@228.Red-81-35-100.dynamicIP.rima-tde.net) has joined #tux3 2008-11-06 02:13 -!- MaZe(~MaZe@c-24-6-86-168.hsd1.ca.comcast.net) has joined #tux3 2008-11-06 08:42 -!- tim_dimm(~timothyhu@cpe-76-90-98-247.socal.res.rr.com) has joined #tux3 2008-11-06 09:08 -!- mingming(~mingming@32.97.110.51) has joined #tux3 2008-11-06 11:21 -!- MaZe(~MaZe@216-239-45-4.google.com) has joined #tux3 2008-11-06 12:51 -!- pgquiles(~pgquiles@228.Red-81-35-100.dynamicIP.rima-tde.net) has joined #tux3 2008-11-06 13:21 -!- ajonat(~ajonat@190.48.97.229) has joined #tux3 2008-11-06 14:06 -!- tim_dimm(~timothyhu@cpe-76-90-98-247.socal.res.rr.com) has joined #tux3 2008-11-06 15:22 -!- pgquiles_(~pgquiles@62.43.226.52.static.user.ono.com) has joined #tux3 2008-11-06 15:45 -!- pgquiles__(~pgquiles@228.Red-81-35-100.dynamicIP.rima-tde.net) has joined #tux3 2008-11-06 15:50 hmm, I just hatched the bright idea of changing my "phase" terminology to "delta" 2008-11-06 15:50 so Tux3 syncs to disk by generating a series of deltas from one atomic state to another 2008-11-06 15:50 ah, this feels right 2008-11-06 15:52 hi 2008-11-06 15:52 sounds good 2008-11-06 15:54 thanks for the vote :-) 2008-11-06 15:54 it also gets me closer to giving the new atomic sync algorithm a name 2008-11-06 15:54 tentatively "delta sync" 2008-11-06 15:55 how difference with normal sync? 2008-11-06 15:56 normal sync provided by the vfs, or as implemented by various filesystems? 2008-11-06 15:56 the semantics of sync are defined by Posix 2008-11-06 15:57 so delta sync is what does it? 2008-11-06 15:57 atomicity is not a requirement of posix 2008-11-06 15:57 ah, i see 2008-11-06 15:57 a delta transitions from one consistent state of a filesystem to the next 2008-11-06 15:58 is there any progress to implement? 2008-11-06 15:58 I'd like to do something 2008-11-06 15:58 I just solved a big design issue I think, and am describing the solution 2008-11-06 15:59 I also need to check in my timestamp code 2008-11-06 15:59 later tonight 2008-11-06 15:59 so... serious work in implementing atomic sync starts about now 2008-11-06 15:59 will it post to tux3-ml? 2008-11-06 15:59 yes 2008-11-06 15:59 let me talk about it now for a bit 2008-11-06 15:59 helps me write 2008-11-06 16:00 yes 2008-11-06 16:00 the issue is: I had assumed there is a nice clean separation between the cache that each filesystem operation updates, and what has to be transferred to disk for a delta 2008-11-06 16:01 and that all I needed to make the separation perfect was the "fork" operation 2008-11-06 16:01 that does a copy on write of a block that a filesystem operation wants to update, if the block is part of a delta not yet transferred to disk 2008-11-06 16:02 this turned out not to work with namespace operations such as file create, delete, rename 2008-11-06 16:02 the difficulty is with inode table blocks 2008-11-06 16:03 why is it difficult? 2008-11-06 16:03 the "top end" filesystem operation for a create needs to change two blocks: a dirent block and an inode table block 2008-11-06 16:04 yes 2008-11-06 16:04 the "flush" operation at a delta transition (used to be called phase transition) also changes the inode table block 2008-11-06 16:05 the question is: what happens if another file create comes in that wants to change the same inode table block? 2008-11-06 16:05 we can't just "fork" the inode table block in cache 2008-11-06 16:05 um.. 2008-11-06 16:05 because the "flush" hasn't completed yet, it may not have stored pointers to the new data extents in the inode table block yet 2008-11-06 16:06 so therefore the file create needs to wait for any flush in progress to complete 2008-11-06 16:06 not to be transferred to disk, but to transfer all necessary information to the inode table block in cache 2008-11-06 16:08 there are two things I don't like about that: 1) it's extra complexity to implement that synchronization between top end namespace operations and the bottom end flush 2) it causes a "bump" while waiting for the flush step to complete 2008-11-06 16:08 now, to be fair, I doubt that this bump would be worse than existing filesystems, which wait on all kinds of things 2008-11-06 16:09 but I'd still rather get rid of it 2008-11-06 16:09 so what I'm proposing is to defer namespace operations in much the same way as we defer write allocations 2008-11-06 16:10 we just let the namespace operation be recorded in cache, as it already is in the stub tux3 kernel code 2008-11-06 16:10 wait... I left out an important detail 2008-11-06 16:11 what is it? 2008-11-06 16:11 we have to select an inode number before making a new entry in a dirent block 2008-11-06 16:12 well, we could possibly make the entry without an inode number, then patch in the inode number later, but that would be pretty messy 2008-11-06 16:12 yes 2008-11-06 16:13 so now, I will propose to just let a file create, create the file in the dentry cache, and the sys_open will just check to make sure the name does not already exist and return as soon as that is done 2008-11-06 16:14 no change to any inode table block 2008-11-06 16:14 what was stored to dentry->d_inode? 2008-11-06 16:14 pointer to an inode as usual 2008-11-06 16:15 the inode does not have to have an inode number, as you can see from the fact that ramfs doesn't need it 2008-11-06 16:15 in the case of nfs, and inode number is required in order to resolve a filehandle 2008-11-06 16:16 actually unique number, not inode number? 2008-11-06 16:16 it has to be globally stable across reboots 2008-11-06 16:16 ah, yes 2008-11-06 16:16 so the only convenient way to get that is make it be the inode number 2008-11-06 16:17 anyway, any nfs filehandle operations has to wait for the "flush" to be completed before it can resolve the file handle 2008-11-06 16:17 but that is not a problem 2008-11-06 16:17 because nfs already has to wait for a sync when it creates a new file 2008-11-06 16:18 I'm not sure about "async" option, however maybe yes 2008-11-06 16:18 ok, so the way tux3 remembers what file it is supposed to create is to keep a pointer to the dentry 2008-11-06 16:19 indeed, async is an interesting question and I considered it... but don't immediately remember the answer except that I thought it would work fine ;) 2008-11-06 16:20 well, I'd like to think this today more 2008-11-06 16:20 yes, needs a lot of thought 2008-11-06 16:20 I have given it a lot of thought over the last week 2008-11-06 16:21 current issue is only this? 2008-11-06 16:21 yes 2008-11-06 16:21 great 2008-11-06 16:23 sys_open(CREATE) needs to do three things: check the name doesn't already exist in a direct block; remember the dentry for later dirent creation; be sure that the later dirent creation will succeed 2008-11-06 16:23 if sys_open(CREATE) is deferred, then sys_unlink must also be deferred 2008-11-06 16:25 ah, I mispoke above, sys_open doesn't have to remember the dentry, but the inode 2008-11-06 16:25 very slight distinction, easier to implement 2008-11-06 16:25 hrm, no I was right the first time 2008-11-06 16:25 has to remember the dentry 2008-11-06 16:26 because it needs to know the new name linking to the inode 2008-11-06 16:26 sys_unlink similarly remembers the dentry 2008-11-06 16:26 what happen to readdir? 2008-11-06 16:26 oh yes 2008-11-06 16:26 thanks for reminding me 2008-11-06 16:27 looks like readdir is most complex 2008-11-06 16:27 we have to flush any pending creates and deletes before starting the readdir 2008-11-06 16:27 it's not a big problem I think 2008-11-06 16:28 just a "flush" (really really need better terminology) at the beginning of the readdir 2008-11-06 16:28 I think the result of doing this namespace deferring will be pretty nice 2008-11-06 16:29 eh, "flush" is not "write out"? 2008-11-06 16:29 no, it is setting up the blocks for writeout 2008-11-06 16:29 ah, it's like ->write_inode? 2008-11-06 16:30 assignment of physical extent locations, move of cached attributes into inode table blocks, and now adding cached namespace operations to dirent blocks and inode table blocks 2008-11-06 16:30 not even like a write_inode 2008-11-06 16:30 it moves things from one place to another in cache 2008-11-06 16:31 for any given block, it has to be completely set up in cache before we submit it 2008-11-06 16:32 i see 2008-11-06 16:33 I will go for my skate, and think up a new name for "flush" now 2008-11-06 16:33 thanks for following along with this, very accurately as usual 2008-11-06 16:33 as for the entry without the inode number, is it possible to handle like you do with forward logging? 2008-11-06 16:34 tim_dimm, except that we _must_ assign an inode number before we can deal with an nfs handle 2008-11-06 16:34 a solution I considered, to be sure 2008-11-06 16:34 so pre-assign it 2008-11-06 16:35 have the next inode number always ready 2008-11-06 16:35 tim_dimm, yes 2008-11-06 16:35 that works 2008-11-06 16:35 except what I'm proposing will be nicer 2008-11-06 16:36 the inode number gets assigned quite quickly, which will be nice for nfs 2008-11-06 16:36 and the top end filesystem operation returns to user very quickly, because the only real work it has to do is check to see the file doesn't exist in the case of create 2008-11-06 16:38 ok, I better get out for my skate 2008-11-06 16:38 enjoy 2008-11-06 16:38 gets dark really soon/fast these days 2008-11-06 16:38 see you 2008-11-06 16:38 see you 2008-11-06 18:04 -!- tim_dimm(~timothyhu@cpe-76-90-98-247.socal.res.rr.com) has joined #tux3 2008-11-06 19:50 -!- RalucaM(~ral@pool-151-196-126-202.balt.east.verizon.net) has joined #tux3 2008-11-06 19:50 hi 2008-11-06 19:59 hi 2008-11-06 20:00 hi 2008-11-06 20:00 -!- tim_dimm(~timothyhu@cpe-76-90-98-247.socal.res.rr.com) has joined #tux3 2008-11-06 20:00 hirofumi, still here? 2008-11-06 20:00 hi 2008-11-06 20:01 oh, right 2008-11-06 20:01 ok, first thing is to wrap up a little bit on tuesday's iget investigation 2008-11-06 20:02 first a confession: I was not functioning very well that day, as you may have noticed... just getting over a pretty severe stomach flu 2008-11-06 20:03 anyway, first thing I want to do is clear up the question of why iget5_ sometimes returns locked, sometimes not 2008-11-06 20:03 it returns locked only in the case where it just added the inode to the hash 2008-11-06 20:03 yes 2008-11-06 20:03 and that is just so that filesystems can proceed to fill in the inode with attributes from backing store 2008-11-06 20:03 that's the conclusion we came to , right? 2008-11-06 20:04 if so, good 2008-11-06 20:04 otherwise, the inode is returned unlocked, but with a reference count on it 2008-11-06 20:04 this is a mem in Linux kernel 2008-11-06 20:05 yes 2008-11-06 20:05 the object can't be operated on safefy when it is not locked, but because of the reference count, the caller knows the object will not suddenly disappear 2008-11-06 20:05 so you can't modify it, but you can read it 2008-11-06 20:05 at that point, only two operations can be performed on the object: 1) lock it or 2) drop the reference count 2008-11-06 20:06 you can't read it either 2008-11-06 20:06 because somebody else might be modifying it 2008-11-06 20:06 well, depends on the locking semantics being used 2008-11-06 20:06 yes 2008-11-06 20:06 the situation I described is typical in linux 2008-11-06 20:06 you will find it used for a number of different kinds of objects 2008-11-06 20:07 you might say, there are not a lot of alternatives to this 2008-11-06 20:07 care to give an alternative example? 2008-11-06 20:08 and alternative to the refcount + lock object strategy? 2008-11-06 20:08 yes 2008-11-06 20:08 a typical alternative is garbage collect, which makes writing new code easier, but isn't suitable for kernel 2008-11-06 20:09 is less efficient and sometimes has long lags 2008-11-06 20:09 ok, I was thinking something else usable in kernel ;-) 2008-11-06 20:10 the refcount could be done away with, and you have a special lock state instead, where the owner of the lock uses its own knowledge to determine if the object can be discarded 2008-11-06 20:10 anyway, the refcount strategy is pretty general, which tends to allow for unforseen new applications 2008-11-06 20:10 and I'm going to take a look of one of those right now 2008-11-06 20:11 this is in the area I've been talking about today 2008-11-06 20:11 I'll call it deferred namespace operations for now 2008-11-06 20:12 the proposal is to handle sys_open(..., CREATE) as a deferred operation 2008-11-06 20:12 the file will just be created in dentry cache as with ramfs 2008-11-06 20:13 not actually placed in a dirent block 2008-11-06 20:13 so today I'd like to walk through sys_create and do a reality check on that 2008-11-06 20:13 so, here's a question to begin with: from POSIX semantics, when must such operations actually make it to disk? do they have to be ordered correctly with regard to other operations? 2008-11-06 20:14 answer: only on sync 2008-11-06 20:14 fsync of the fd? 2008-11-06 20:14 sync can be caused in a number of ways 2008-11-06 20:14 fsync, sync command, umount, O_SYNC 2008-11-06 20:15 what about file close? 2008-11-06 20:15 http://lxr.linux.no/linux+v2.6.27/fs/namei.c#L1503 <- vfs_create( 2008-11-06 20:15 not file close 2008-11-06 20:16 in fact, it is possible for a file to be created and deleted without ever touching disk 2008-11-06 20:16 and what about stuff like mkdir? 2008-11-06 20:16 ext* can't do that 2008-11-06 20:16 but it's possible 2008-11-06 20:16 also never needs to touch disk if it's immediately unlinked 2008-11-06 20:17 what we _must_ do to satisfy Posix is return the right error codes from sys_open 2008-11-06 20:17 EEXIST in particular 2008-11-06 20:17 so who takes care of the appropriate sync semantics? 2008-11-06 20:17 the vfs? 2008-11-06 20:17 how does it tell us when and what needs to be synced/flushed? 2008-11-06 20:17 vfs only takes care of it for dumb filesystems like Ext2 and VFAT 2008-11-06 20:18 you will see oddities in there like ->assoc_buffers 2008-11-06 20:18 a field in the inode that points at metadata associated with a particular inode, in ext2 those are the dirty index blocks 2008-11-06 20:19 this mechanism is pretty much useless for anything but a filesystem as dumb as ext2 2008-11-06 20:19 there is handler for fsync, iirc 2008-11-06 20:19 yes 2008-11-06 20:19 fsync will go and do a series of steps that all filesystems need 2008-11-06 20:20 sometimes more than some filesystems need 2008-11-06 20:20 well 2008-11-06 20:20 sync_super and functions like that 2008-11-06 20:20 we can go look there now instead of rename if you like 2008-11-06 20:20 let's take a detour 2008-11-06 20:21 I'd like to understand how this affects other operations - how do I cause a mkdir to get flushed to disk? 2008-11-06 20:21 must I sync the entire fs? 2008-11-06 20:21 ACTION oops, I have to reboot 2008-11-06 20:21 fsync on the parent directory 2008-11-06 20:22 http://lxr.linux.no/linux+v2.6.27/fs/super.c#L465 ->sync_fs is a per-filesystem method 2008-11-06 20:23 a look, this is interesting 2008-11-06 20:24 249void __fsync_super(struct super_block *sb) 2008-11-06 20:24 258 sb->s_op->sync_fs(sb, 1); 2008-11-06 20:24 so you can sync the whole filesystem by calling fsync on the superblock 2008-11-06 20:25 I wonder if that's exported to userspace somehow 2008-11-06 20:26 __fsync_super starts off by writing out the superblock 2008-11-06 20:26 not intuitive 2008-11-06 20:26 you'd expect that to be the last thing it does 2008-11-06 20:26 -!- hirofumi(~hirofumi@210.171.168.39) has joined #tux3 2008-11-06 20:27 a look, this is interesting 2008-11-06 20:27 249void __fsync_super(struct super_block *sb) 2008-11-06 20:27 258 sb->s_op->sync_fs(sb, 1); 2008-11-06 20:27 so you can sync the whole filesystem by calling fsync on the superblock 2008-11-06 20:27 I wonder if that's exported to userspace somehow 2008-11-06 20:27 __fsync_super starts off by writing out the superblock 2008-11-06 20:27 not intuitive 2008-11-06 20:27 I don't think at the end you're guaranteed to have a no-dirty-buffers in memory situation 2008-11-06 20:27 you'd expect that to be the last thing it does 2008-11-06 20:27 (for hirofumi's benefit) 2008-11-06 20:27 no 2008-11-06 20:27 (08:21:10 PM) ***hirofumi oops, I have to reboot 2008-11-06 20:27 (08:21:25 PM) hirofumi left the room (quit: Remote host closed the connection). 2008-11-06 20:27 (08:21:33 PM) flips: fsync on the parent directory 2008-11-06 20:27 (08:22:56 PM) flips: http://lxr.linux.no/linux+v2.6.27/fs/super.c#L465 ->sync_fs is a per-filesystem method 2008-11-06 20:27 you're only guaranteed that everything that was dirty at the time of the call gets written 2008-11-06 20:27 thanks 2008-11-06 20:28 thanks 2008-11-06 20:28 brb 2008-11-06 20:28 246 * device. Takes the superblock lock. Requires a second blkdev 2008-11-06 20:28 247 * flush by the caller to complete the operation. 2008-11-06 20:28 syncing has always been pretty messed up in Linux 2008-11-06 20:29 it's not obvious why two blkdev flushes should be required 2008-11-06 20:29 http://lxr.linux.no/linux+v2.6.27/fs/super.c#L249 2008-11-06 20:31 ok, let's determine if __fsync_super is a filesystem library call or whether it is called directly from vfs 2008-11-06 20:32 sigh, it's too bad I can't trust lxr to return all the uses, I wonder what happened to it 2008-11-06 20:33 surprisingly, __fsync_super and fsync_super are hardly used at all 2008-11-06 20:33 -!- tim_dimm(~timothyhu@cpe-76-90-98-247.socal.res.rr.com) has joined #tux3 2008-11-06 20:34 http://lxr.linux.no/linux+v2.6.27/fs/sync.c#L24? 2008-11-06 20:34 yes, the main use 2008-11-06 20:35 this particular nest of functions always leaves me with a headache 2008-11-06 20:37 ah, in fact it isn't necessary for fsync to write the superblock last the way an atomic committing filesystem has to 2008-11-06 20:37 because the only thing that is promised is that everything dirty is written 2008-11-06 20:37 not that the result is consistent 2008-11-06 20:38 well 2008-11-06 20:38 that's not quite the right statement 2008-11-06 20:39 it's not promised that the filessystem is clean 2008-11-06 20:40 only if it does not have journal or something? 2008-11-06 20:40 ;-) 2008-11-06 20:41 it's umount that will mark the filesystem clean 2008-11-06 20:52 -!- tim_dimm(~timothyhu@cpe-76-90-98-247.socal.res.rr.com) has joined #tux3 2008-11-06 20:52 -!- RalucaM(~ral@pool-151-196-126-202.balt.east.verizon.net) has joined #tux3 2008-11-06 20:52 -!- pgquiles__(~pgquiles@228.Red-81-35-100.dynamicIP.rima-tde.net) has joined #tux3 2008-11-06 20:52 -!- MaZe(~MaZe@216-239-45-4.google.com) has joined #tux3 2008-11-06 20:52 -!- bushman(~marcin@c-76-23-106-132.hsd1.sc.comcast.net) has joined #tux3 2008-11-06 20:52 -!- pranith(ca4bcee2@webchat.mibbit.com) has joined #tux3 2008-11-06 20:52 -!- mlankhorst_(~m@fw1.astro.rug.nl) has joined #tux3 2008-11-06 20:52 -!- konrad(~konrad@sfr.cs.washington.edu) has joined #tux3 2008-11-06 20:52 -!- bh(~billh@ip68-107-26-122.sd.sd.cox.net) has joined #tux3 2008-11-06 20:52 -!- data(~data@echo489.server4you.de) has joined #tux3 2008-11-06 20:52 -!- vcgomes[away](~vcgomes@li17-238.members.linode.com) has joined #tux3 2008-11-06 20:52 -!- flips(~phillips@phunq.net) has joined #tux3 2008-11-06 20:52 262 * The whole writeout design is quite complex and fragile. <- there, somebody agrees with me 2008-11-06 20:53 heh 2008-11-06 20:53 hirofumi, what was the last line you saw? 2008-11-06 20:54 315 * how much sense this makes. Presumably I had a good 2008-11-06 20:54 316 * reasons for doing it this way, and I'd rather not 2008-11-06 20:54 317 * muck with it at present. 2008-11-06 20:54 318 */ 2008-11-06 20:55 which is last line? 2008-11-06 20:56 you split off the network a while ago 2008-11-06 20:56 yes 2008-11-06 20:56 not much to worry about 2008-11-06 20:56 http://lxr.linux.no/linux+v2.6.27/fs/fs-writeback.c#L285 2008-11-06 20:56 (08:41:28 PM) flips: it's umount that will mark the filesystem clean 2008-11-06 20:56 (08:42:03 PM) flips: anyway... sync_filesystems first calls a per-filesystem sync_fs method 2008-11-06 20:56 (08:42:20 PM) flips: then goes and syncs each dirty inode 2008-11-06 20:56 (08:42:35 PM) flips: a modern filesystem should do the whole job in its sync_fs method 2008-11-06 20:56 (08:42:49 PM) MaZe: everything that was dirty before the sync will be written out, but the end-state (even assuming no-one else mucks around) can be dirty 2008-11-06 20:56 (08:43:06 PM) flips: so let's take a look and see if the functionality provided by sync_inodes is optional 2008-11-06 20:56 (08:43:49 PM) flips: dirty or even inconsistent in the case of ext2, if there is parallel writing going on 2008-11-06 20:56 (08:44:34 PM) flips: http://lxr.linux.no/linux+v2.6.27/fs/fs-writeback.c#L648 2008-11-06 20:56 (08:44:45 PM) flips: __sync_inodes 2008-11-06 20:56 (08:45:26 PM) flips: if (sb->s_root) <- I wonder what that's for 2008-11-06 20:56 (08:46:04 PM) flips: this function does the blockdev sync on behalf of the filesystem 2008-11-06 20:56 (08:46:20 PM) flips: no choice about that, though I am not sure why a filesystem would want a choice 2008-11-06 20:56 (08:47:47 PM) hirofumi left the room (quit: charon.oftc.net larich.oftc.net). 2008-11-06 20:56 (08:47:47 PM) tux3bot left the room (quit: charon.oftc.net larich.oftc.net). 2008-11-06 20:56 (08:47:47 PM) shapor left the room (quit: charon.oftc.net larich.oftc.net). 2008-11-06 20:56 (08:47:47 PM) ceatinge left the room (quit: charon.oftc.net larich.oftc.net). 2008-11-06 20:56 (08:47:56 PM) flips: http://lxr.linux.no/linux+v2.6.27/fs/fs-writeback.c#L531 sync_sb_inodes 2008-11-06 20:56 (08:48:09 PM) flips: 442void generic_sync_sb_inodes(struct super_block *sb, 2008-11-06 20:56 (08:49:16 PM) flips: is generic_sync_sb_inodes interesting at all? 2008-11-06 20:56 (08:49:33 PM) flips: I'm mainly interested in determining if there's a way to avoid using it at all 2008-11-06 20:56 (08:50:07 PM) MaZe: lol 2008-11-06 20:56 (08:51:20 PM) flips: 376__writeback_single_inode(struct inode *inode, struct writeback_control *wbc) 2008-11-06 20:56 (08:52:27 PM) flips: 269__sync_single_inode(struct inode *inode, struct writeback_control *wbc) 2008-11-06 20:56 (08:52:53 PM) hirofumi [~hirofumi@210.171.168.39] entered the room. 2008-11-06 20:56 (08:52:53 PM) shapor [~shapor@yzf.shapor.com] entered the room. 2008-11-06 20:56 (08:52:53 PM) tux3bot [~tux3bot@yzf.shapor.com] entered the room. 2008-11-06 20:56 (08:52:53 PM) ceatinge [~ceatinge@veryclever.net] entered the room. 2008-11-06 20:56 (08:52:57 PM) flips: 262 * The whole writeout design is quite complex and fragile. <- there, somebody agrees with me 2008-11-06 20:56 (08:53:06 PM) MaZe: heh 2008-11-06 20:56 (08:53:14 PM) flips: hirofumi, what was the last line you saw? 2008-11-06 20:56 (08:54:19 PM) flips: 315 * how much sense this makes. Presumably I had a good 2008-11-06 20:56 (08:54:19 PM) flips: 316 * reasons for doing it this way, and I'd rather not 2008-11-06 20:56 (08:54:19 PM) flips: 317 * muck with it at present. 2008-11-06 20:56 (08:54:19 PM) flips: 318 */ 2008-11-06 20:56 (08:55:22 PM) hirofumi: which is last line? 2008-11-06 20:56 (08:56:00 PM) flips: you split off the network a while ago 2008-11-06 20:56 (08:56:12 PM) hirofumi: yes 2008-11-06 20:56 (08:56:13 PM) flips: not much to worry about 2008-11-06 20:56 (08:56:23 PM) flips: http://lxr.linux.no/linux+v2.6.27/fs/fs-writeback.c#L285 2008-11-06 20:56 for tux3 irc logger 2008-11-06 20:57 since that's apparently the part it missed 2008-11-06 20:57 ah 2008-11-06 20:57 ok, into do_writepages 2008-11-06 20:58 at least we have the option of providing our own there instead of generic_writepages() 2008-11-06 20:58 so that answers my question, sort of 2008-11-06 20:59 "can we avoid this whole massive mess of a half though through sync mechanism" 2008-11-06 20:59 I'm assuming if we need to override more stuff we can add additional records to the filesystems struct (or elsewehere)? 2008-11-06 20:59 the answer is: we can avoid the real work, which is the writepages 2008-11-06 20:59 filesystems struct? 2008-11-06 20:59 filesystem_operations or whatever it's called 2008-11-06 21:00 I didn't see a lot of overrides as we drilled down 2008-11-06 21:00 was looking for them 2008-11-06 21:00 but I could have missed something 2008-11-06 21:00 let's see what generic_writepages does 2008-11-06 21:01 that's what I meant by 'add' 2008-11-06 21:01 than see whether any journalling filesystems use it 2008-11-06 21:01 as in add-in the possibility for new overrides 2008-11-06 21:01 you mean fix the vfs? 2008-11-06 21:01 sure, can take a long time to convince akpm to take stuff like that 2008-11-06 21:02 if there's a way to make the old crap work, if only by just putting up with the uneeded things it does, that is generally the preferred route 2008-11-06 21:02 well, don't think I meant fix, more like add in one-or-two more hooks(hacks) 2008-11-06 21:02 still is an uphill battle, but one can always try 2008-11-06 21:02 although I guess that also leads to spaghetti code 2008-11-06 21:03 provide a core kernel patch, plus benchmarks that show how it makes your fs run better 2008-11-06 21:03 then get ready for everybody to challenge you to prove it also makes ext3 run better 2008-11-06 21:03 lol 2008-11-06 21:04 I would understand not wanting to make existing stuff run slower... but why require it to make existing stuff run faster? 2008-11-06 21:04 994 /* deal with chardevs and other special file */ <- gross 2008-11-06 21:05 which file? 2008-11-06 21:05 866int write_cache_pages(struct address_space *mapping, <- ok I'm here 2008-11-06 21:05 852 * write_cache_pages - walk the list of dirty pages of the given address space and write all of them. 2008-11-06 21:05 just a sec 2008-11-06 21:05 http://lxr.linux.no/linux+v2.6.27/mm/page-writeback.c#L852 2008-11-06 21:06 this calls ->writepage on each dirty page cache page 2008-11-06 21:07 I strongly suspect that we do not want to do things this way in tux3 2008-11-06 21:07 hey flips 2008-11-06 21:07 but we want to do our own page cache walk, like I do it in tux3/user/filemap.c 2008-11-06 21:07 ACTION scans the backlog 2008-11-06 21:08 hope there's enough for you ;) 2008-11-06 21:09 so, a quick scan through generic_writepages shows that most of what it worries about is irrelevant to tux3 2008-11-06 21:09 things like the possibility that the fs is actually tmpfs 2008-11-06 21:09 the write congestion model is also very suspect 2008-11-06 21:10 still, if we had to, we could let this function drive tux3 writeout 2008-11-06 21:10 yeah, it's always good to get a core person talking about stuff with clarity 2008-11-06 21:10 you can learn the most from folks like that 2008-11-06 21:10 basically by not really writing pages inside our tux3 ->writepage 2008-11-06 21:11 bh, to be honest, nobody can talk about this stuff with clarity, except to say it's kind of unnatural 2008-11-06 21:11 but, those will trigger some dela trasision? 2008-11-06 21:12 hirofumi, I was just wondering about that ;) 2008-11-06 21:12 let's see if we can answer that, and wind up for tonight 2008-11-06 21:14 there's two types of syncs: those done for data integrity reasons (someone called fsync) and those done to free memory (clear out dirty pages/buffers) 2008-11-06 21:14 there actually behaviour could potentially be vastly different 2008-11-06 21:14 umm... those have to block process to avoid fill all memory with dirty pages 2008-11-06 21:14 their :-( 2008-11-06 21:14 yes 2008-11-06 21:14 maze, vm should not require a big difference so long as some cache is cleaned 2008-11-06 21:15 free memory - write out biggest block of successive dirty pages to one region of disk that you can (potentially many such blobs) 2008-11-06 21:15 data integrity, needs to write out specific data in a specific order, potentially writing out additional data if it's convenient 2008-11-06 21:19 well, it looks like sync_inodes_sb is the last thing __fsync_super does, and generic_writepages is in turn the last thing that happens 2008-11-06 21:19 there is no final ->finish_up_the_sync method 2008-11-06 21:19 which is probably wrong 2008-11-06 21:19 so maze, yes, I think you need to save the world by adding a new method 2008-11-06 21:20 oh way 2008-11-06 21:20 let me see 2008-11-06 21:20 can we wrap this entire thing 2008-11-06 21:20 ah, dont'think so 2008-11-06 21:22 oh wait again, at least in sync_filesystems we can avoid the whole mess 2008-11-06 21:22 with our own ->sync_fs function 2008-11-06 21:23 ext3 provides its own 2008-11-06 21:25 flips: yeah, but you know the code so you have some kind of mental understanding pictured in your head. That's very helpful for folks that need pieces explained 2008-11-06 21:26 do dirty pages/buffers, etc, have some sort of time-of-first-and/or-last-dirtying, so that you can ask for a flush of everything dirtied before time x? 2008-11-06 21:28 buffers used to be handled that way 2008-11-06 21:29 it wasn't too useful, plus hard to get right 2008-11-06 21:29 what did we call that, everbody used it 2008-11-06 21:30 35 if (unlikely(laptop_mode)) 2008-11-06 21:30 36 laptop_sync_completion(); <- this strange hook does allow for a cleanup at the end of do_sync 2008-11-06 21:31 but I think the conclusion is, do_sync is still pretty awful 2008-11-06 21:31 tries to do the job the same way for everybody 2008-11-06 21:33 I think what tux3 will do, is try to do the entire job inside sync_fs, and then just not have any dirty pages for do_sync to come along and bother about it later 2008-11-06 21:33 heh 2008-11-06 21:33 of course, nothing prevents new pages from being dirtied after the sync_fs 2008-11-06 21:33 but I think they can be ignored 2008-11-06 21:34 because if the ->sync_fs writes out everything dirty _at the time it was called_ then the semantics are satisfied 2008-11-06 21:34 am I correct in assuming, that - provided no other fs writes to us are happening, after tux3 we should have on-disk state and on-memory state be equivalent (although not necessarily equal)? 2008-11-06 21:34 not correct 2008-11-06 21:35 because of the promise/rollup arrangement 2008-11-06 21:35 thus making ->sync_fs be a synchronization point, and not just a write all dirty data 2008-11-06 21:35 well 2008-11-06 21:35 ok, correct 2008-11-06 21:35 didn't mean after tux3, meant after sync_fs 2008-11-06 21:35 ie. rip out drive after sync_fs completes, or poweroff machine, and you don't lose any modifications or any other state 2008-11-06 21:36 the filesystem will appear dirty on remount 2008-11-06 21:36 (I think this is stronger, then the current required semantics of sync_fs, since it only requires write-out of dirty stuff, but not off the stuff that that write-out will/may dirty) 2008-11-06 21:36 tux3 will always appear 'dirty' though 2008-11-06 21:37 since there's very little difference between clean and dirty (basically a bit, right?) 2008-11-06 21:37 that is of course a big problem with this generic sync 2008-11-06 21:37 that flushing pages can dirty metadata 2008-11-06 21:37 so the only thing it doesn't do is clear the dirty bit in the superblock (although it could actually do that as well, but then we'd have to dirty it again, so kind of pointless) 2008-11-06 21:38 that's about it 2008-11-06 21:38 [dirty it again, for any further modification to the fs, which are likely to happen] 2008-11-06 21:38 so basically sync_fs and umount should be pretty much equivalent - except one does and the other doesn't clear the fs is dirty superblock bit 2008-11-06 21:39 one wonders if ext3 will avoid a journal replay if shut down immediately after do_sync 2008-11-06 21:39 [not talking about tearing down all the in memory structures of course, which are effectively just caches] 2008-11-06 21:40 32 sync_inodes(wait); /* Mappings, inodes and blockdevs, again. */ <- this comment written without blushing, impressive 2008-11-06 21:41 I think the current sync is an implementation of 'sync;sync;sync' - there everything should be synced 2008-11-06 21:41 geez, we actually sync the filesystem about four times 2008-11-06 21:41 in do_sync 2008-11-06 21:42 this thing has always been deeply messed, and I suppose it is a kind of comfort that it still is ;) 2008-11-06 21:42 maze, I was thinking precisely the same thing 2008-11-06 21:43 ok,. family time for me 2008-11-06 21:43 we didn't get to the rename stuff I wanted to do 2008-11-06 21:43 it's a common opinion among sysadmins, that you need to run sync thrice to sync the system ;-) 2008-11-06 21:43 that is for next tuesday now 2008-11-06 21:44 yes, familiar with that one 2008-11-06 21:44 and linux is a new advanced os that does sync sync sync automatically for you 2008-11-06 21:44 indeed 2008-11-06 21:44 so by following the full prescription, you do 9 of those 2008-11-06 21:45 I mean some of this is understandable, since you can have multiple layers of fs/loop/blockdevs/raid/lvm/etc 2008-11-06 21:45 and you should sync bottom up 2008-11-06 21:45 but no-one actually implements the sync that way 2008-11-06 21:45 but it doesn't 2008-11-06 21:45 it syncs top down 2008-11-06 21:45 already deeply suspect 2008-11-06 21:45 I don't think it's top down 2008-11-06 21:45 it's more like random 2008-11-06 21:46 notice that we just sync fs's in linear order 2008-11-06 21:47 I think with the current system to be truly safe you have to run sync on the order of O(# mounted filesystems) + O(# of used block devices) 2008-11-06 21:47 times 2008-11-06 21:47 in parallel actually 2008-11-06 21:47 it's supposed to 2008-11-06 21:47 29 sync_supers(); /* Write the superblocks */ 2008-11-06 21:47 30 sync_filesystems(0); /* Start syncing the filesystems */ 2008-11-06 21:47 31 sync_filesystems(wait); /* Waitingly sync the filesystems */ 2008-11-06 21:47 32 sync_inodes(wait); /* Mappings, inodes and blockdevs, again. */ <- top down 2008-11-06 21:48 that just feels wrong 2008-11-06 21:48 and hope there's no loops in the dependency structure (you can get loops, but you'd have to be crazy to want to: ie. put a file-backed-loop device as a spare member of raid device storing that filesystem) 2008-11-06 21:49 no notion of dependency for that matter 2008-11-06 21:49 if you have a filesystem loopback mounted on a filesystem looback mounted on... 2008-11-06 21:49 yup, it's duct tape and wishful thinking 2008-11-06 21:49 then there is no code here to make sure that the sync is bottom up 2008-11-06 21:50 ok, family time for real 2008-11-06 21:51 tuesday: we go look at sys_rename again, this time concentrating on object lifetimes, locking etc 2008-11-06 21:52 ok, see you 2008-11-06 23:40 -!- hirofumi(~hirofumi@210.171.168.39) has joined #tux3