2008-11-07 01:25 -!- hirofumi(~hirofumi@210.171.168.39) has joined #tux3 2008-11-07 01:55 -!- MaZe(~MaZe@c-24-6-86-168.hsd1.ca.comcast.net) has joined #tux3 2008-11-07 05:04 -!- MaZe(~MaZe@c-24-6-86-168.hsd1.ca.comcast.net) has joined #tux3 2008-11-07 07:00 -!- mingming(~mingming@c-71-193-163-244.hsd1.or.comcast.net) has joined #tux3 2008-11-07 08:27 -!- tim_dimm(~timothyhu@cpe-76-90-98-247.socal.res.rr.com) has joined #tux3 2008-11-07 09:10 -!- mingming(~mingming@32.97.110.51) has joined #tux3 2008-11-07 11:07 -!- MaZe(~MaZe@216-239-45-4.google.com) has joined #tux3 2008-11-07 12:28 -!- ajonat(~ajonat@190.48.97.229) has joined #tux3 2008-11-07 12:47 -!- tim_dimm(~timothyhu@cpe-76-90-98-247.socal.res.rr.com) has joined #tux3 2008-11-07 13:33 -!- tim_dimm(~timothyhu@cpe-76-90-98-247.socal.res.rr.com) has joined #tux3 2008-11-07 13:50 -!- tim_dimm(~timothyhu@cpe-76-90-98-247.socal.res.rr.com) has joined #tux3 2008-11-07 15:12 -!- tim_dimm(~timothyhu@cpe-76-90-98-247.socal.res.rr.com) has joined #tux3 2008-11-07 15:50 -!- tim_dimm(~timothyhu@cpe-76-90-98-247.socal.res.rr.com) has joined #tux3 2008-11-07 16:25 hi 2008-11-07 16:26 just a question: 2008-11-07 16:26 the "flush" operation at a delta transition (used to be called phase transition) also changes the inode table block 2008-11-07 16:26 the question is: what happens if another file create comes in that wants to change the same inode table block? 2008-11-07 16:26 we can't just "fork" the inode table block in cache 2008-11-07 16:27 new file creation is a new delta? 2008-11-07 16:27 in here 2008-11-07 16:29 If new creation was same delta, can't we modify "inode table block" buffer? 2008-11-07 17:39 -!- hirofumi(~hirofumi@210.171.168.39) has joined #tux3 2008-11-07 18:18 hirofumi, yes, new file in a new delta 2008-11-07 18:19 if it's new delta, before starting new delta, we have to "flush"? 2008-11-07 18:19 yes 2008-11-07 18:20 if it is in the same delta (that is, the "active" delta") then the change can be made entirely in the dcache 2008-11-07 18:21 in the original model I had in mind, the change would also be made in a cached directory entry block 2008-11-07 18:21 but now I just want file operations to operate only on the dcache 2008-11-07 18:22 and the state of the dcache at the time of a delta transition is then put into directory entry blocks 2008-11-07 18:22 this will have the side effect of making the case where a file is created then immediately deleted a lot more efficient, for what that is worth 2008-11-07 18:22 in other words, temporary files get really cheap 2008-11-07 18:24 if new delta did "flush", we already has lastest state of buffers? 2008-11-07 18:25 can you clarify? 2008-11-07 18:26 maybe I don't understand what is the issue of that 2008-11-07 18:26 I'm thinking that issue is 2008-11-07 18:26 after the "flush" the state of the namespace is then represented in the cached directory entry blocks 2008-11-07 18:27 however, the top end may have gone on and made more changes 2008-11-07 18:27 so the only time it will exactly match is when there are no new namespace operations taking place 2008-11-07 18:29 um... 2008-11-07 18:31 so if new delta is blocked for "flush" to start new delta, I think there is no new namespace operations 2008-11-07 18:32 all namespace operations would have to block too 2008-11-07 18:32 yes 2008-11-07 18:32 that is what I don't like 2008-11-07 18:33 ah, i see 2008-11-07 18:33 it doesn't really make sense to block top end operations just because the cache layer is busy preparing to _begin_ a writeout phase 2008-11-07 18:33 I thought all operations wait to start new delta in that case 2008-11-07 18:33 the only time it should block is when it can't get new memory for cache 2008-11-07 18:33 i see 2008-11-07 18:33 starting a new delta would not solve the problem 2008-11-07 18:34 yes 2008-11-07 18:34 so the approach I have in mind is for the top end file operations not to modify dirent or inode table blocks at all 2008-11-07 18:34 is it optimization thing? 2008-11-07 18:34 yes 2008-11-07 18:34 ah, i see 2008-11-07 18:35 just about avoiding stalls 2008-11-07 18:35 I thought those were "fix" 2008-11-07 18:35 also a correctness thing, if the top end and the back end are block modifying the same blocks then synchronization is required 2008-11-07 18:36 yes 2008-11-07 18:36 by rearranging things so that the top end does not modify any cached blocks except cached data blocks, that eliminates the need to do the synchronization 2008-11-07 18:36 so I think the implementation complexity is about the same either way 2008-11-07 18:37 but the "deferred namespace operations" approach will give beter behavior 2008-11-07 18:38 maybe block all thing is easy, but much slow although 2008-11-07 18:38 true 2008-11-07 18:39 we could have a rw semaphore that the back end takes when it starts to prepare a delta for writeout 2008-11-07 18:39 btw, inode number is asigned while deffering 2008-11-07 18:39 and file operations take a read lock when they want to create or delete a file 2008-11-07 18:39 that would work, and it would obviously be a bottleneck 2008-11-07 18:39 yes, inode number assignment will be deferred 2008-11-07 18:40 which means that nfs file handle resolution must wait until that is completed 2008-11-07 18:40 so that at least has to be synchronized 2008-11-07 18:40 and the other thing you noticed is that directory listing also has to wait until names are flushed into the directory entry blocks 2008-11-07 18:41 yes 2008-11-07 18:41 what happen to fstat()? 2008-11-07 18:41 it needs to return ino 2008-11-07 18:41 gets the attributes out of the cached inode 2008-11-07 18:41 ah 2008-11-07 18:41 yes 2008-11-07 18:42 too bad about taht 2008-11-07 18:42 that also require "flush"? 2008-11-07 18:42 I seem to recall that you're not supposed to use the inode field 2008-11-07 18:42 let me check 2008-11-07 18:43 btw, ramfs assigns ino (i.e. last_ino) 2008-11-07 18:46 have you got a file/line number? 2008-11-07 18:46 for fstat? 2008-11-07 18:46 for the ramfs ino assignment 2008-11-07 18:47 it's in new_inode() 2008-11-07 18:47 http://lxr.linux.no/linux+v2.6.27.5/fs/inode.c#L549 2008-11-07 18:48 I think assignment is 2008-11-07 18:48 http://lxr.linux.no/linux+v2.6.27.5/fs/inode.c#L567 2008-11-07 18:48 that's new 2008-11-07 18:49 oh 2008-11-07 18:49 seems like a bad hack 2008-11-07 18:49 I wonder what the justification was 2008-11-07 18:49 need to go crawling through git commits to find out 2008-11-07 18:50 might now take long actually 2008-11-07 18:50 justification of last_ino? 2008-11-07 18:50 for ramfs having inode numbers 2008-11-07 18:50 I bet it's connected to some other questionable design direction 2008-11-07 18:51 at least, iirc, hash needs i_ino 2008-11-07 18:52 which hash? 2008-11-07 18:52 inode_hash 2008-11-07 18:53 that's a per-fs operation 2008-11-07 18:53 http://lxr.linux.no/linux+v2.6.27.5/include/linux/fs.h#L1843 2008-11-07 18:55 insert_inode_hash is not called by the vfs 2008-11-07 18:55 http://lxr.linux.no/linux+v2.6.27/+ident=18000161 2008-11-07 18:56 yes 2008-11-07 18:56 hmm, it's taking kernel.org forever to generate a history for ramfs/inode.c 2008-11-07 18:57 maybe ramfs want to use generic stuff and libraries 2008-11-07 18:57 it worked fine for years with no inode numbers 2008-11-07 18:57 um... 2008-11-07 18:59 in 2.4.0, new_inode was called get_empty_inode 2008-11-07 18:59 and get_empty_inode() uses last_ino with same way 2008-11-07 19:00 2.6.26.5 uses new_inode but not last_ino 2008-11-07 19:01 ok, I see what you mean 2008-11-07 19:01 you're right 2008-11-07 19:02 will/can tux3 work without i_ino assignment? 2008-11-07 19:02 -!- tim_dimm(~timothyhu@cpe-76-90-98-247.socal.res.rr.com) has joined #tux3 2008-11-07 19:03 good question 2008-11-07 19:03 I think so 2008-11-07 19:04 at least, inode number assignments can be deferred for a while 2008-11-07 19:04 it might be necessary to make fstat wait until inode number has been assigned 2008-11-07 19:04 i see 2008-11-07 19:05 this is still better than making file create wait until a delta has been prepared for writeout 2008-11-07 19:05 yes 2008-11-07 19:06 btw, we need to handle all physical remapping stuff like this? 2008-11-07 19:06 556 static unsigned int last_ino; <- ohmygod, barf 2008-11-07 19:06 that is disgusting 2008-11-07 19:07 which physical remapping? 2008-11-07 19:07 you mean, moving file blocks around? 2008-11-07 19:08 I'm not sure, maybe pointer to dtree root? 2008-11-07 19:08 and atree root 2008-11-07 19:09 ok, now obviously the inode number generated in new_inode is replaced later by any filesystem that actually has inode numbers 2008-11-07 19:09 so... there is something deeply broken here 2008-11-07 19:09 and inode number that suddenly changes is worse than no inode number at all 2008-11-07 19:10 yes, it's good old broken behavior of no actual inode number 2008-11-07 19:10 re dtree roots etc, that is handled entirely by the back end, we don't have a messy collision like with namespace ops 2008-11-07 19:11 great 2008-11-07 19:11 ok, I will be busy investigating inode number issues for a little while, thanks for bringing up the issue 2008-11-07 19:11 thanks for explaining those 2008-11-07 19:12 I need to get this post posted and get things moving 2008-11-07 19:12 by now, you know everything that is in the post 2008-11-07 19:12 but you may be the only one, so I have to share it with the rest 2008-11-07 19:13 yes, that's very good 2008-11-07 19:14 I want to read it too though :) 2008-11-07 19:16 one thing you _can_ always do before assigning an inode number is write to a file 2008-11-07 19:16 the case of create+write is very common, obviously 2008-11-07 19:17 yes 2008-11-07 19:17 maybe i_ino is needed at a few places 2008-11-07 19:18 anything that could access it before the filesystem make the real assignment is broken 2008-11-07 19:18 but we should look for specific cases 2008-11-07 19:19 last_ino is a private static, that means it starts again at zero each time the kernel boots 2008-11-07 19:19 and is shared by all filesystems, it probably only matters to about three of them 2008-11-07 19:20 yes 2008-11-07 19:20 and because it is limited to 2^32, there is a real danger it can wrap 2008-11-07 19:20 should not be hard to write a test case 2008-11-07 19:20 it's about the crappiest thing I've seen in kernel, ever 2008-11-07 19:21 2^32 is workaround for lfs (ia32e mode in x86_64) 2008-11-07 19:21 I noticed the comment 2008-11-07 19:24 um.. security stuff may use i_ino... 2008-11-07 19:34 http://lxr.linux.no/linux+v2.6.27.5/kernel/auditsc.c#L1792 2008-11-07 19:37 um.. audit seems to use i_ino to check it's interesting inode... 2008-11-07 20:22 -!- RazvanM(~RazvanM@pool-151-196-126-202.balt.east.verizon.net) has joined #tux3 2008-11-07 20:32 -!- tim_dimm(~timothyhu@cpe-76-90-98-247.socal.res.rr.com) has joined #tux3