2008-09-02 00:06 alright I'm heading to bed 2008-09-02 00:13 sorry, didn't notice your chat 2008-09-02 00:13 if you type "flips" then the tab lights up 2008-09-02 00:15 konrad, your writeup is accurate 2008-09-02 00:16 kongrad, if leaf->count is zero, dict[-1] does not exist either 2008-09-02 00:17 valgrind will complain if you try to pretend it exists ;-) 2008-09-02 00:17 ACTION loves valgrind 2008-09-02 06:27 flips: right 2008-09-02 06:42 sorry for the exploded code 2008-09-02 09:30 it was nice code 2008-09-02 09:30 now it's compressed code ;-) 2008-09-02 09:31 yay 2008-09-02 09:32 now what? 2008-09-02 09:49 more checks? 2008-09-02 09:49 let me see 2008-09-02 09:49 what else can be checked about ileaf 2008-09-02 09:50 could check for unkown attributes 2008-09-02 09:50 or could tackle dleaf, much harder 2008-09-02 09:50 I'll do the former then start on the latter, I guess. Sound good? 2008-09-02 09:51 sounds good 2008-09-02 09:51 excellent 2008-09-02 09:51 attr_check 2008-09-02 09:52 I wonder why mailman fails to post so list posts 2008-09-02 09:53 I see your answer to masoud, but not masoud's post 2008-09-02 09:53 ah 2008-09-02 09:53 ACTION looks for logs 2008-09-02 09:53 he didn't write to list 2008-09-02 09:53 just replied to me 2008-09-02 09:53 ah right 2008-09-02 09:53 so I CC'd the list 2008-09-02 09:53 replying back to list is good 2008-09-02 09:59 ok time for me to start another hack 2008-09-02 09:59 truncate I think it was 2008-09-02 10:04 yep 2008-09-02 10:17 hm, how do I setup an hg username? 2008-09-02 10:18 (and anything else it needs) 2008-09-02 10:20 flips: http://pastie.caboo.se/264630 <-- look ok? 2008-09-02 10:25 konrad, also need to check that the attribute list ends exactly at the size limit 2008-09-02 10:25 and do that without accessing out of bounds 2008-09-02 10:25 slightly tricky 2008-09-02 10:25 the neat thing about an rcs like hg is you don't have to ask permission or have a user name 2008-09-02 10:26 it makes commits to the local repo nicer 2008-09-02 10:26 what do you mean by the list ends at the size limit? 2008-09-02 10:27 the attributes are all variable sizes 2008-09-02 10:28 so you need to do that attr = decode(attr...) thing 2008-09-02 10:28 checking that the resulting pointer is not out of range 2008-09-02 10:28 why not just look up the size and check that? 2008-09-02 10:29 sure 2008-09-02 10:29 which is exactly what decode* does 2008-09-02 10:29 ah 2008-09-02 10:29 the magic numbers 6 and 10 should be replaced by constants, you can add those constants to the enum 2008-09-02 10:30 flips: did you see the post about performance on the zumastor list? 2008-09-02 10:30 k 2008-09-02 10:30 mornin' all 2008-09-02 10:30 have not yet 2008-09-02 10:30 hiyah 2008-09-02 10:30 konrad, which post is that 2008-09-02 10:31 konrad: good work, welcome :) 2008-09-02 10:31 flips: which post is what? 2008-09-02 10:31 shapor: thanks 2008-09-02 10:31 oh 2008-09-02 10:31 I'm not on the sumastor list 2008-09-02 10:31 shapor said it 2008-09-02 10:31 :D 2008-09-02 10:31 let me see 2008-09-02 10:32 flips: Subject: Re: RHEL5 2.6.18 support? 2008-09-02 10:33 yes 2008-09-02 10:33 good post 2008-09-02 10:33 and we have the answer: tux3 + backport to zumastor 2008-09-02 10:34 flips: should attr_check fail if the size of an attr is less than 2, or is that allowed? 2008-09-02 10:34 allowed I think, but there is no attribute with that size 2008-09-02 10:35 right 2008-09-02 10:35 I mean 2 including the header 2008-09-02 10:35 that's a bug 2008-09-02 10:35 which is itself 2 bytes 2008-09-02 10:35 ok, I'll fail if that happens 2008-09-02 10:35 headers are never less that 2 bytes, I don't see changing that 2008-09-02 10:35 we're not quite that insane about compression 2008-09-02 10:35 ok 2008-09-02 10:38 "The long and short of truncate" -- new post coming 2008-09-02 10:39 flips: http://pastie.caboo.se/264644 2008-09-02 10:41 konrad, there can be multiple attributes per leaf entry 2008-09-02 10:41 attr_check should not know about dictionary format at all 2008-09-02 10:41 just take (base, size) 2008-09-02 10:42 hm? 2008-09-02 10:42 to set up a unit test, you need to actually encode some attributes, so this function belongs in iattr.c rather than ileaf.c 2008-09-02 10:42 ah 2008-09-02 10:44 attr_check(void *attrs, unsigned size)? 2008-09-02 10:45 right 2008-09-02 10:45 k 2008-09-02 10:45 would return yes/now I think 2008-09-02 10:45 and the caller would complain 2008-09-02 10:45 maybe 2008-09-02 10:53 hm 2008-09-02 10:53 in encode_attrs() in iattr.c 2008-09-02 10:53 the for loop goes does kind from 0 to 32 2008-09-02 10:53 when kind only gets 4 bits on disk 2008-09-02 10:54 yes, sloppy 2008-09-02 10:54 ;-) 2008-09-02 10:54 :) 2008-09-02 10:54 feel free to improve 2008-09-02 10:54 the reason the lowest attr kind is not zero is, catches more bugs it it isn't 2008-09-02 10:54 right, I saw that earlier 2008-09-02 10:55 I think attr kind zero will only get used when all 15 others are used 2008-09-02 10:55 -!- MaZe(~MaZe@216-239-45-4.google.com) has joined #tux3 2008-09-02 10:55 and then it will likely mean "just pad this" 2008-09-02 10:55 heh 2008-09-02 10:55 we might declare at some point that attributes are always padded to even numbers of bytes 2008-09-02 10:56 or we might allow odd numbers 2008-09-02 10:56 then I'd have to recant the above statement about no attr kind less than 2 bytes 2008-09-02 10:56 we'd introduce at least one one byte attr 2008-09-02 10:56 "noop" 2008-09-02 10:56 or pad 2008-09-02 10:57 so that we can update attrs in some cases without moving everything in the leaf 2008-09-02 10:57 future optimization 2008-09-02 10:57 anyway, just to say the design might evolve a little some weeks down the road 2008-09-02 10:58 for now it is a two-byte granularity 2008-09-02 10:58 that means when immediate data attributes get added, they need to be padded out 2008-09-02 10:59 hey maze 2008-09-02 11:01 flips: more like this? http://pastie.caboo.se/264662 2008-09-02 11:01 exactly like that I think 2008-09-02 11:02 hello 2008-09-02 11:02 hey 2008-09-02 11:02 ready for your vfs tutorial? 2008-09-02 11:04 <- maze 2008-09-02 11:04 right now? no, sorry, I'm in the middle of a big turnup which is already behind schedule 2008-09-02 11:04 konrad, you can use the new enum you just declared for both the lower and upper limit of the encode loop 2008-09-02 11:04 no I meant in general 2008-09-02 11:05 it's going to be a long tutorial ;-) 2008-09-02 11:05 ah ok 2008-09-02 11:05 period of 3 weeks I'd think 2008-09-02 11:05 in general? haven't really had the time :-( to do much - as in almost none. 2008-09-02 11:05 I need a vacation... 2008-09-02 11:05 at the end you get the "phillips certificate of vfs competency" 2008-09-02 11:05 and the right to flame newbies on lkml 2008-09-02 11:05 well worth having 2008-09-02 11:06 cool ;-) I'd love to. 2008-09-02 11:06 ACTION listens carefully and hears the sound of google data centers burning down 2008-09-02 11:06 should be able to run a class of two, shapor is about ready for this 2008-09-02 11:07 konrad too I think 2008-09-02 11:07 I'll listen in certainly 2008-09-02 11:07 they're not burning too quickly at least ;-) 2008-09-02 11:08 maze, did you notice your comments on that fat key space were highly relevant? 2008-09-02 11:08 hammer essentially implements what you suggested 2008-09-02 11:08 so the idea is far from useless 2008-09-02 11:08 I'm not sure what you're referring to ;-) fat key space? 2008-09-02 11:09 your "beautiful idea" you had afterthe initial tux3 whiteboarding 2008-09-02 11:09 to incorporate the file offset in the btree key 2008-09-02 11:09 hammer does that 2008-09-02 11:09 ok, right that one 2008-09-02 11:09 just like you imagined 2008-09-02 11:09 I did rather like that one 2008-09-02 11:09 tux3 does not, that is the main difference between them, and the allocation method 2008-09-02 11:10 it makes a beautifully simple design 2008-09-02 11:10 exactly... so why doesn't tux3 use it? 2008-09-02 11:10 but my guess is, tux3 will end up faster as it is more cache efficient to have a two level tree 2008-09-02 11:10 the number of probes is the same 2008-09-02 11:11 hmm, I really htink it should work better with just one 2008-09-02 11:11 not probes, but btree compares 2008-09-02 11:11 I ran the numbers in detail 2008-09-02 11:11 hmm, really? interesting. 2008-09-02 11:11 having a single tree means a deeper tree 2008-09-02 11:11 it works out exactly 2008-09-02 11:11 true 2008-09-02 11:11 log(something) either way 2008-09-02 11:11 and it also probably means less children per node because of larger keys... 2008-09-02 11:12 it does 2008-09-02 11:12 yes, but... 2008-09-02 11:12 it should spread out better over the entire filesystem 2008-09-02 11:12 hammer: 64 tux3: 256 or 512 2008-09-02 11:12 so instead of access being log(# files) + log(size of file) 2008-09-02 11:12 tux3: sometimes 384 2008-09-02 11:12 you have access being something like (log used disk space) 2008-09-02 11:13 - extents 2008-09-02 11:13 in tux3? 2008-09-02 11:13 no comparison of fat vs thin btrees 2008-09-02 11:13 it's mainly log(inode table size) in tux3 2008-09-02 11:14 and the inodes are cached 2008-09-02 11:14 so that disappears mostly 2008-09-02 11:14 leaving the nice little per-file btrees 2008-09-02 11:14 so I guess the two level approach will significantly outperform, just a constant factor but a big one 2008-09-02 11:14 I just want the metadata on a different disk/ram/flash-backed/etc ;-) 2008-09-02 11:15 ah, that is coming 2008-09-02 11:15 my answer to zfs's mess 2008-09-02 11:15 is a rather nice hack 2008-09-02 11:15 I really really like forward logging 2008-09-02 11:15 involving having tux3 work together with lvm3 2008-09-02 11:15 although I hate the fact there's that 0.0001% chance of it breaking 2008-09-02 11:15 the forward logging thing is working out design wise, I should incorporate it into the userspace prototype now 2008-09-02 11:16 the chance is nowhere that big 2008-09-02 11:16 its completely under our control 2008-09-02 11:16 and we will have an option to disable it completely, just for the ultraparanoid 2008-09-02 11:16 the really trick there is to use a sufficiently paranoid checksumming signature 2008-09-02 11:16 with the "phase" commit philosophy, it will still be efficient even without relying on a hash 2008-09-02 11:17 right 2008-09-02 11:17 and even then - it can only fail on non-clean remount 2008-09-02 11:17 and the checksum can be avoided completely in significant cases 2008-09-02 11:17 right again 2008-09-02 11:17 which should be rare... so the chance of failure should be as close to '0' as can be while still theoretically possible 2008-09-02 11:18 so, the really nice thing is, when you have a whole bunch of transactions ready to commit, the forward logging can be done without any hash: wait for transaction completions, then mark complete in a known location 2008-09-02 11:18 that will be the ultra paranoid option 2008-09-02 11:19 I would think a 64 bit decent hash would get close to 0 error chance 2008-09-02 11:19 calculating the hashes should be cheap 2008-09-02 11:19 maybe make that configurable 2008-09-02 11:19 yes 2008-09-02 11:19 so long as its not m5 2008-09-02 11:19 we're not talking about calculating the hash from a large amount of data 2008-09-02 11:19 md5 2008-09-02 11:19 or something like that 2008-09-02 11:20 even if it's md5, it's still fast, because it's much-much faster than a disk seek 2008-09-02 11:20 zfs and btrfs find its a significant cost if they checksum everything 2008-09-02 11:20 and you'd be hashing something like 256 bytes or so 2008-09-02 11:20 about the biggest bottleneck in fact 2008-09-02 11:20 its really important to have an efficient hash 2008-09-02 11:21 oh, no, I thought of it as literally a block signature for the superblock 2008-09-02 11:21 not for everything else 2008-09-02 11:32 right 2008-09-02 11:33 I think, just checksum all the _used_ data in the commit block and part of the data blocks 2008-09-02 11:33 right, that'd be nice - use something like a crc32 (cpu support) for that - maybe two crc32's in parallel (or a crc64 if sse will support that) 2008-09-02 11:34 it can be an option whether we rely on the checksum to know that the data part of the transaction got onto media, or wait for completion on data before submitting the commit block 2008-09-02 11:34 cpu support for crs32? 2008-09-02 11:34 but I was thinkning of each block in the forward log and the superblock having a sort of tail signature which would look kind of like 2008-09-02 11:34 I don't know that instruction ;-) 2008-09-02 11:34 crc32 - yeap, coming in sse4.1 or so 2008-09-02 11:34 oh, that's too bad 2008-09-02 11:34 should be out in nehalem or even earlier 2008-09-02 11:34 crc32 sucks 2008-09-02 11:34 for hashing 2008-09-02 11:34 well.... 2008-09-02 11:35 you should be able to do a crc32*4 easily enough 2008-09-02 11:35 I hope it's not crc32 specific 2008-09-02 11:35 it is 2008-09-02 11:35 still not good 2008-09-02 11:35 crc32 has funnels 2008-09-02 11:35 lots of them 2008-09-02 11:35 yes, well... 2008-09-02 11:35 bleah 2008-09-02 11:35 ACTION hates intel 2008-09-02 11:35 I wish they supported md5/sha1 and aes in the cpu 2008-09-02 11:35 it's a powerful argument for using a substandard hash 2008-09-02 11:36 double bleah 2008-09-02 11:36 SSE4.2 Instruction Description CRC32 Accumulate CRC32C value using the polynomial 0x11EDC6F41 (or, without the high order bit, 0x1EDC6F41).[5] 2008-09-02 11:37 Nehalem and on, so next year 2008-09-02 11:37 I'll ping a mathematician to analyze it 2008-09-02 11:38 see if we can make something useful out of that turd 2008-09-02 11:38 I'd hate to incorporate crc32 into tux3 on-disk format just because intel farted 2008-09-02 11:38 we'll see what amd comes up with 2008-09-02 11:38 right 2008-09-02 11:38 in fact 2008-09-02 11:38 I know who to talk to about that 2008-09-02 11:39 amd is about to go intel one better 2008-09-02 11:39 lol, how? 2008-09-02 11:39 and I'd be happy to run a few cycles slower on intel just to force intel to do it right 2008-09-02 11:39 heh 2008-09-02 11:39 sekrit 2008-09-02 11:39 ok I need to get my mathematical ducks in a row for this 2008-09-02 11:40 AMD claims SSE5 will provide dramatic performance improvements, particularly in high performance computing (HPC), multimedia and computer security applications, including a 5x performance gain for Advanced Encryption Standard (AES) encryption and a 30% performance gain for discrete cosine transform (DCT) used to process video streams.[1] 2008-09-02 11:40 that's more like it 2008-09-02 11:40 I'll go get into the nda loop there 2008-09-02 11:40 AMD's) SSE5 does not include all (Intel's) SSE4 instructions. In other words, it is not a superset of SSE4 but a competitor to it. Likewise, Intels pre-Nehalem cores contain only a partial implementation of SSE4, called SSE4.1. This poses some difficulty and extra work for compilers and assembly-level hand tuning of code 2008-09-02 11:40 make sure amd is a tux-ready machine 2008-09-02 11:43 SSE5 includes: 2008-09-02 11:43 Fused multiply-accumulate (FMACxx) instructions Integer multiply-accumulate (IMAC, IMADC) instructions Permutation (PPERM, PERMPx) and conditional move (PCMOV) instructions Precision control, rounding, and conversion instructions 2008-09-02 11:44 note the permutation stuff 2008-09-02 11:44 probably what gives the aes boost 2008-09-02 11:44 should be useable for hash/crypt stuff as well 2008-09-02 11:45 noted 2008-09-02 11:45 that's the right way to do it 2008-09-02 11:45 it's perfect 2008-09-02 11:45 amd rulez 2008-09-02 11:45 intel suckorz 2008-09-02 11:45 sukzorz 2008-09-02 11:55 the fused multiply will also mean a huge amount for flops freaks everywhere ;-) 2008-09-02 11:56 ie. anybody doing anything high-precision 2008-09-02 11:58 :-) 2008-09-02 11:58 ACTION is a flops freak 2008-09-02 12:06 oh, weird, wonder how I managed to do that 2008-09-02 12:07 yes, odd 2008-09-02 12:07 indeed 2008-09-02 12:08 edit without compile most probably 2008-09-02 12:08 thought I did compile though 2008-09-02 12:08 odd 2008-09-02 12:12 scamjet time 2008-09-02 12:16 konrad, I tghi 2008-09-02 12:16 konrad, I think you have ileaf under control 2008-09-02 12:16 dleaf is 10x harder ;-) 2008-09-02 12:16 maybe 100x 2008-09-02 12:17 heh 2008-09-02 12:17 I suggest shapor for code review on that 2008-09-02 12:17 ok 2008-09-02 12:18 ACTION runs and hides 2008-09-02 12:19 not quick enough 2008-09-02 12:25 tux3 is the... 6th google result for tux3 2008-09-02 12:26 I get first 10 2008-09-02 12:27 bbl 2008-09-02 12:28 'tux 3' 2008-09-02 12:33 interesting 2008-09-02 12:33 http://pastie.caboo.se/264727 <-- building tux3 on my ppc machine 2008-09-02 12:42 comes from trace.h 2008-09-02 12:43 flips: ping 2008-09-02 13:58 hey 2008-09-02 14:32 konrad, pong 2008-09-02 14:32 why is there an asm("int3") in trace.h? 2008-09-02 14:37 it generates a trap into gcc on assert failure 2008-09-02 14:37 really useful 2008-09-02 14:37 sorry 2008-09-02 14:37 doesn't work on non-x86 2008-09-02 14:37 :( 2008-09-02 14:37 trap into gdb 2008-09-02 14:37 just comment it out 2008-09-02 14:38 did 2008-09-02 14:38 and hunt around for something that does work 2008-09-02 14:38 it's really useful 2008-09-02 14:38 you can put "b break" into your gdb .rc 2008-09-02 14:38 and void break(void) { } 2008-09-02 14:38 called from assert 2008-09-02 14:43 konrad, what non-x86 do you run on? 2008-09-02 14:43 ppc 2008-09-02 14:43 mac? 2008-09-02 14:44 ibook 2008-09-02 14:44 cool 2008-09-02 14:44 perfect for checking endian issues 2008-09-02 14:44 and wordsize 2008-09-02 14:44 yep 2008-09-02 14:44 all of ileaf and dleaf have to be converted for endian at some point 2008-09-02 14:44 not right away 2008-09-02 14:53 ACTION is back from Burning Man 2008-09-02 14:53 I feel great 2008-09-02 15:00 me too 2008-09-02 15:01 by the way, what is it that makes you feel great? (only the legal part please) 2008-09-02 15:03 I love this uniden phone system 2008-09-02 15:03 got the 8 series corded base station about 4 years ago 2008-09-02 15:03 its still the best home phone system on the planet 2008-09-02 15:04 just got two new handsets for it, the upgraded 905 series work fine 2008-09-02 15:04 and they're better than the original handsets 2008-09-02 15:04 almost like cell phones 2008-09-02 15:08 flips: I don't do drugs as a rule 2008-09-02 15:08 never really did 2008-09-02 15:08 hard to explain, it's just the overall intensity of the experience 2008-09-02 15:08 like a rage? 2008-09-02 15:09 having such community orientied people really disarms the typical resistence you'd might have dealing with people in a city 2008-09-02 15:09 ah, people not being aholes 2008-09-02 15:09 I get it 2008-09-02 15:09 that's a medium for other things, art, partying, etc... 2008-09-02 15:09 even aholes pretending not to be 2008-09-02 15:09 you'd like it 2008-09-02 15:09 I know I would 2008-09-02 15:09 it's like everthing wrong with US society reversed. 2008-09-02 15:09 kids not compatible I'd think 2008-09-02 15:10 no, folks bring their kids 2008-09-02 15:10 ah 2008-09-02 15:10 then next year for sure 2008-09-02 15:10 it's not a big deal, just avoid certain camps and you're set 2008-09-02 15:10 certain camps where... what? is happening 2008-09-02 15:10 they aren't exhibiting that stuff openly anyways, so it's no big deal 2008-09-02 15:10 death yoga? 2008-09-02 15:10 porn & eggs 2008-09-02 15:10 spike's 2008-09-02 15:10 stuff like that 2008-09-02 15:10 ic 2008-09-02 15:11 right 2008-09-02 15:12 not any worse than a goth festival I'd think 2008-09-02 15:13 you'd like that 2008-09-02 15:13 german version 2008-09-02 15:13 not really 2008-09-02 15:13 I guarantee it 2008-09-02 15:13 for one thing, there's a high concentration of ubergeeks 2008-09-02 15:14 yeah, your infrastructure engineering group is out there for sure, Tim Hockin 2008-09-02 15:14 the death guild camp is full f nerds as well 2008-09-02 15:14 larry & sergey even 2008-09-02 15:17 handset #4 now online, my home pbx is good for another 2 years 2008-09-02 15:18 going to celebrate with some french roast 2008-09-02 15:20 how's tux3 going ? 2008-09-02 15:20 any of my suggestions been thought about futher ? 2008-09-02 15:20 further ? 2008-09-02 15:20 oh yes 2008-09-02 15:21 I'm getting ready to set up a nice environment for you to develop the locking ;-) 2008-09-02 15:21 you'll see growth of the project with more folks joining when you get more stuff working 2008-09-02 15:21 oh shit 2008-09-02 15:21 that's true 2008-09-02 15:21 it's already happening 2008-09-02 15:21 good 2008-09-02 15:21 major stuff now works, see tux3.c 2008-09-02 15:21 yeah, because I don't have faith in Linux file systems after seeing a bunch of NetApp code 2008-09-02 15:21 can create and read/write a tux3 volume from shell commands now 2008-09-02 15:22 nice 2008-09-02 15:22 really did make a 64 petabyte file in an 8k volume image 2008-09-02 15:22 that's with 4K spare for the boot loader 2008-09-02 15:23 decided to make the tux3 superblock 1K just to have that work out ;-) 2008-09-02 15:23 that leaves 12 256 byte blocks for the filesystem structure, root directory, bitmaps, inode table 2008-09-02 15:23 is this all you're doing at Google right now ? 2008-09-02 15:23 you could say that 2008-09-02 15:24 but its actually part time 2008-09-02 15:24 you should see me when I work ;-) 2008-09-02 15:29 "I don't like the flashing red light in the upper left hand corner of each handset. This is a charge indicator that lets you know the phone is charged and ready to go. There is nothing wrong with letting consumers know this, but to have a light that continuously flashes can be a tremendous distraction." -- amazon idiot who doesn't know he owns a digital answering machine 2008-09-02 15:36 maybe I will pthread tux3 before doing delete 2008-09-02 15:36 just for bh 2008-09-02 15:40 flips: how fine-grained are you planning on going with locking? 2008-09-02 15:40 very 2008-09-02 15:40 ask bh ;-) 2008-09-02 15:40 leaf? 2008-09-02 15:40 yes 2008-09-02 15:40 hrm will you do that in the generic btree code? 2008-09-02 15:40 yes 2008-09-02 15:40 with the help of pthreads 2008-09-02 15:40 and futexes 2008-09-02 15:41 where are you planning on storing the locks? 2008-09-02 15:41 bh is going to have fun with it ;-) 2008-09-02 15:41 in the buffer heads 2008-09-02 15:41 or in a hash 2008-09-02 15:41 it's in flux 2008-09-02 15:41 either would work in kernel 2008-09-02 15:41 so i'm guessing locks in the intermediate nodes as well? 2008-09-02 15:42 for merge/split 2008-09-02 15:42 yes 2008-09-02 15:42 careful about deadlocks there 2008-09-02 15:42 all down the chain 2008-09-02 15:42 always 2008-09-02 15:42 anybody who thinkgs abba is a swedish pop group is not touching the locking code 2008-09-02 15:42 lol 2008-09-02 15:43 bh knows that stuff I'm pretty sure 2008-09-02 15:43 didn't ask, but what he talks about is beyond that 2008-09-02 15:45 hrm how about transactional stuff 2008-09-02 15:46 like where you have to create an inode, then reference from a directory 2008-09-02 15:46 which involves more than one tree 2008-09-02 15:47 I'll write it up in a few days 2008-09-02 15:47 it's pretty much all there in the hammer thread 2008-09-02 15:47 we track every time a buffer gets dirty 2008-09-02 15:47 i still haven't had the time to digest that whole brain dump 2008-09-02 15:48 then etiher add it to the current transaction phase or cow the buffer 2008-09-02 15:48 it's basically the phase part of phase tree, the part that netapp never tried to own 2008-09-02 15:50 cowing the buffer is a simple matter of setting its index to some other physical block 2008-09-02 15:50 or in that case of a file blocks, changing the pointer in its parent 2008-09-02 15:50 index block 2008-09-02 15:50 which is done only in cache 2008-09-02 15:50 not on disk 2008-09-02 15:51 so you have one view of the vs on disk, and another, current one that the vfs sees, in memory 2008-09-02 15:51 of the fs I mean 2008-09-02 15:51 when you get the aha on that it's going to be fun 2008-09-02 15:52 I think I'll use the term "fork" instead of cow 2008-09-02 15:53 it's much more descriptive of what happens 2008-09-02 15:53 so tux3's transaction model is to fork any buffer written to after a phase as closed 2008-09-02 15:53 if the phase is still open, just write to it normally 2008-09-02 15:54 unspeakably efficient 2008-09-02 15:54 tux3 has exactly two ways of getting info onto media 1) write to a buffer 2) save the superblock 2008-09-02 15:54 there will eventually be 3) directio 2008-09-02 15:55 which will require more fiddling 2008-09-02 15:56 I wonder if it would be worth the very minor regularity improvement to hold the superblock in a buffer 2008-09-02 15:56 well 2008-09-02 15:56 kind of dumb 2008-09-02 15:56 you don't know the block size for the superblock 2008-09-02 15:56 or 2008-09-02 15:56 more accurately, the blocksize of the superblock may not match the buffer cache blocksize 2008-09-02 15:57 or the filesystem blocksize 2008-09-02 15:57 both making it unnatural to force the sb into a buffer 2008-09-02 15:58 I think we may be studly and to the initial sb load and later saves directly via the bio interface 2008-09-02 15:58 which means we need to handle completion, get the interrupt back into foreground 2008-09-02 15:58 interrupt completion that is 2008-09-02 15:59 which we need to do anyway if we want to avoid the decrepit old block io library 2008-09-02 16:06 http://interviews.slashdot.org/comments.pl?sid=950917&cid=24845533 2008-09-02 16:20 all the remaining conditional exprs in ileaf.c involve leaf->count, there has to be a way to make a macro 2008-09-02 16:20 macroizing those will be a big help in easing the pain of endian conversion 2008-09-02 16:25 ACTION picks up konrad's cute negative for loop for dleaf_trunc 2008-09-02 16:25 I think I grabbed it from somewhere in ileaf.c 2008-09-02 16:25 really? 2008-09-02 16:25 or maybe that was my imagination 2008-09-02 16:25 yeah 2008-09-02 16:25 looks original 2008-09-02 16:26 I had something remotely like it 2008-09-02 16:26 but yours is actually readable 2008-09-02 16:26 ileaf->dump 2008-09-02 16:26 er 2008-09-02 16:26 ileaf_dump 2008-09-02 16:26 same thing 2008-09-02 16:26 roughly 2008-09-02 16:26 oh heh 2008-09-02 16:26 I forget some of the stuff I write :-) 2008-09-02 16:26 :D 2008-09-02 16:27 yours is better 2008-09-02 16:27 it's how I should have written it 2008-09-02 16:27 I'll change ileaf_dump to match, or do you want to do that? 2008-09-02 16:27 go ahead 2008-09-02 16:27 I'm attempting to wrap my head around dleaf 2008-09-02 16:27 good 2008-09-02 16:27 don't bother with ilead :-) 2008-09-02 16:27 dleaf is pure braindamange, ask shapor 2008-09-02 16:28 of the good kind 2008-09-02 16:28 it will make your head hurt 2008-09-02 16:28 heh 2008-09-02 16:32 u16 *gdict = (void *)leaf + btree->sb->blocksize; 2008-09-02 16:32 u16 *edict = (void *)(gdict - leaf->groups); 2008-09-02 16:32 more regular form 2008-09-02 16:32 plus a cute varname 2008-09-02 16:34 -!- tim_dimm(~timothyhu@cpe-76-90-98-247.socal.res.rr.com) has joined #tux3 2008-09-02 16:35 hey tim_dimm 2008-09-02 16:35 hey flips 2008-09-02 16:35 you can shell in any time ;-) 2008-09-02 16:35 no time to be stealthy, huh 2008-09-02 16:35 that's stealthy 2008-09-02 16:35 yup 2008-09-02 16:36 flips: shouldn't those be u32? 2008-09-02 16:36 dcc still doesn't work 2008-09-02 16:36 nat issue 2008-09-02 16:36 konrad, which? 2008-09-02 16:36 [16:32:04] u16 *gdict = (void *)leaf + btree->sb->blocksize; 2008-09-02 16:36 [16:32:04] u16 *edict = (void *)(gdict - leaf->groups); 2008-09-02 16:36 um 2008-09-02 16:36 oh yes 2008-09-02 16:37 shows I cut n pasted 2008-09-02 16:37 without engaging brain 2008-09-02 16:37 heh 2008-09-02 16:37 I'm gonna borrow those 2008-09-02 16:37 they are actually struct something * 2008-09-02 16:37 well yeah 2008-09-02 16:37 but u32 is the same size as said struct 2008-09-02 16:38 struct entry and struct group I think 2008-09-02 16:38 right, a dangerous coindicdence 2008-09-02 16:38 but safe in this case 2008-09-02 16:39 the cutest thing about dleaf is the way the entry offset is incremented inthe lookup loop 2008-09-02 16:39 that is where the brain hurt gets serious 2008-09-02 16:40 the for loops using struct pointers are gratuitous 2008-09-02 16:40 flips: remember to think about the allocator so that related bits of metadata are located closely to each other, this is very important for online disk checking 2008-09-02 16:40 it's more clear using array indices 2008-09-02 16:40 and the complier can optimize to the same thing in theory 2008-09-02 16:40 practice is different of course ;-) 2008-09-02 16:41 the ext3 paper, OLS 2007 ?, might be of interest here, they made a modification to ext3 so that fsck would runs much faster 2008-09-02 16:41 bh, you haven't been reading the recent posts ;-) 2008-09-02 16:41 talke about that very thing 2008-09-02 16:41 gcc -O9999999 linux.c 2008-09-02 16:41 did you get to read the paper btw ? 2008-09-02 16:41 :-) 2008-09-02 16:41 bh, I'm one of the stars in it ;-) 2008-09-02 16:41 yeah, I'm overloaded with -rt work right now, first day back 2008-09-02 16:41 downloaded it last week 2008-09-02 16:41 folks are hitting me up for stuff already 2008-09-02 16:42 oh really ? 2008-09-02 16:42 url ? 2008-09-02 16:42 yep 2008-09-02 16:42 um 2008-09-02 16:42 http://ext2.sourceforge.net/2005-ols/2005-ext3-paper.pdf 2008-09-02 16:43 getting close to sk8 oclock 2008-09-02 16:44 ACTION does another piece of 75% cacao chocolate 2008-09-02 16:44 tim_dimm might have a skate left in him 2008-09-02 16:45 went through 30 wheels in 4 days at Maryhill 2008-09-02 16:45 I'll be on the strand by 5:30 2008-09-02 16:45 about 2008-09-02 16:45 wow 2008-09-02 16:45 http://www.silverfishlongboarding.com/option,com_gallery2/Itemid,53/?g2_itemId=237609/ 2008-09-02 16:45 "vintage plantation" <- I highly recommend this chocolate 2008-09-02 16:46 i'm the one *not* in the rubber suit 2008-09-02 16:46 flips: there might be a newer paper on the matter from IBM 2008-09-02 16:46 beautiful 2008-09-02 16:46 bh, link? 2008-09-02 16:46 OLS 2007 or something like that 2008-09-02 16:46 I'll need a better hint 2008-09-02 16:47 tim, you're the one who looks cool 2008-09-02 16:47 except you need a mirrored helment 2008-09-02 16:47 helmet 2008-09-02 16:48 if you kept your elbows in I bet you woulda won 2008-09-02 16:48 and spray some pam on that jacket 2008-09-02 16:49 ACTION isn't very keen on online disk fragmention either in ext4 2008-09-02 16:49 seems kind of like bottom scrapping to me 2008-09-02 16:50 I was trying to grab some air at that point. Those guys just passed me, and I knew they were about to slam on the brakes. 2008-09-02 16:51 variable metadata is useful for homogenous file types like media files, hmmm, interesting 2008-09-02 17:05 bh, you really need to read my musings 2008-09-02 17:05 let me see if I can find a subject line 2008-09-02 17:06 scott on the right? 2008-09-02 17:06 in blue, yes 2008-09-02 17:06 how'd I guess ;) 2008-09-02 17:06 f'n magic 2008-09-02 17:06 shinyness 2008-09-02 17:07 serious about the pam 2008-09-02 17:07 slickness 2008-09-02 17:07 on the list ? 2008-09-02 17:07 should be 2008-09-02 17:07 yes 2008-09-02 17:07 I look at it a bit, but I didn't see very much 2008-09-02 17:07 or lkml ? 2008-09-02 17:07 the list 2008-09-02 17:08 tux3 ? 2008-09-02 17:08 "Spacial correlation between directory entries, inodes and file data" 2008-09-02 17:08 you have to read between the lines 2008-09-02 17:08 all I see is stuff about patches 2008-09-02 17:08 I have a followup post in the works 2008-09-02 17:08 but there is stuff ahead of it 2008-09-02 17:08 in the queue 2008-09-02 17:08 flips: how does the magic zero entry worth with the dleaf dicts? 2008-09-02 17:08 or is it present? 2008-09-02 17:09 konrad, same way 2008-09-02 17:09 0th entry is implied 2008-09-02 17:09 dict should be positioned one past the top of the list 2008-09-02 17:09 flips: you should make online disk checking the default mechanism for your file system, create a common fsck library to shared between the online checker and offline 2008-09-02 17:09 that is violated in dleaf.c sometimes for no good reason 2008-09-02 17:09 just because we were figuring out how to do it at the time 2008-09-02 17:09 offline checking would be used only in a dev situation 2008-09-02 17:09 bh, planned 2008-09-02 17:09 indeed 2008-09-02 17:09 good 2008-09-02 17:10 need to write a tech note 2008-09-02 17:10 ah ok 2008-09-02 17:10 the tux3 userspace implementation is in fact the base of the online tools 2008-09-02 17:10 because until we get reverse pointers and supporting stuff for file systems that's the only things that's going to work 2008-09-02 17:10 including defrag 2008-09-02 17:10 online and offline 2008-09-02 17:10 volume are getting so large that .... you know... 2008-09-02 17:11 reverse pointers is planned, tech note needed 2008-09-02 17:11 I've mentioned some details from time to time 2008-09-02 17:11 I know 2008-09-02 17:11 it's already broken 2008-09-02 17:11 broke years ago 2008-09-02 17:11 tux3 is going to be allocation groups as well 2008-09-02 17:11 and maybe... not sure about it... relative pointers 2008-09-02 17:12 maybe that is tux3.1 2008-09-02 17:12 don't know, it's too experimental 2008-09-02 17:12 right 2008-09-02 17:12 scary 2008-09-02 17:12 get the basics as much as you can first, format changes are another matter 2008-09-02 17:12 http://pastie.caboo.se/264894 2008-09-02 17:12 like that 2008-09-02 17:12 my dumper "from scratch" if you will 2008-09-02 17:12 that's the plan 2008-09-02 17:12 so I think I'm doing something right 2008-09-02 17:13 konrad, kool 2008-09-02 17:13 oh yes 2008-09-02 17:14 if I go out for a skate, your new dumper will be finished when I get back and I can use it 2008-09-02 17:14 hm? 2008-09-02 17:14 it's sort of redundant to the existing dleaf_dump 2008-09-02 17:14 I just wanted to be sure I understand how to loop through the groups 2008-09-02 17:14 er, entries 2008-09-02 17:14 and groups 2008-09-02 17:15 yours is going to be better, I like to backport like that 2008-09-02 17:15 it's called evolution 2008-09-02 17:17 should I make the output look like the old one? 2008-09-02 17:17 good place to start 2008-09-02 17:17 k time to get rolling 2008-09-02 17:19 what's the purpose of (struct entry*)foo->limit ? 2008-09-02 17:19 flips: tying up some loose ends. I'll be out by 6 2008-09-02 17:21 ok, I'll slow down a little 2008-09-02 17:21 see you at the skate park? 2008-09-02 17:21 sure 2008-09-02 17:22 I'll do slaloms at the pier for a while ;-)_ 2008-09-02 17:22 more fun than slowing down 2008-09-02 17:28 flips: stuff posted today on lkml ? 2008-09-02 17:28 bh, not today 2008-09-02 17:28 soon 2008-09-02 17:28 oh ok, so you haven't posted this yet then 2008-09-02 17:29 mainly just on the current state of the disk format 2008-09-02 17:29 ok 2008-09-02 17:29 "Spacial correlation between directory entries, inodes and file data" 2008-09-02 17:29 (read between the lines) 2008-09-02 17:29 it's working out well as far as it goes 2008-09-02 17:29 there's a lot more detail coming on that 2008-09-02 17:30 read the hint about generating functions 2008-09-02 17:30 spatial 2008-09-02 17:30 I've blabbed about that to you personally, but I don't know if it registered yet 2008-09-02 17:30 right 2008-09-02 17:30 spacial is my new word ;-) 2008-09-02 17:30 I googled for that and go nothing useful 2008-09-02 17:30 it's on the tux3 list 2008-09-02 17:31 I totally don't see it 2008-09-02 17:31 it's just patch discussion that I'm seeing 2008-09-02 17:31 you're right 2008-09-02 17:31 google is damaged or mailman 2008-09-02 17:33 http://tux3.org/pipermail/tux3/2008-August/000083.html 2008-09-02 17:33 google is braindamaged 2008-09-02 17:33 :-p 2008-09-02 17:33 later... 2008-09-02 17:44 ACTION reading 2008-09-02 17:53 ok, dumper2 worsk 2008-09-02 18:45 -!- caoliver(~oliver@75-134-208-20.dhcp.trcy.mi.charter.com) has joined #tux3 2008-09-02 18:45 -!- caoliver(~oliver@75-134-208-20.dhcp.trcy.mi.charter.com) has left #tux3 2008-09-02 19:35 -!- caoliver(~oliver@75-134-208-20.dhcp.trcy.mi.charter.com) has joined #tux3 2008-09-02 19:36 -!- caoliver(~oliver@75-134-208-20.dhcp.trcy.mi.charter.com) has left #tux3 2008-09-02 19:50 konrad, kool 2008-09-02 19:50 I'm not sure it's any more clear than the original 2008-09-02 19:51 it's 3 lines shorter, but that doesn't mean clearer 2008-09-02 20:01 it would be hard to be less clear than the original 2008-09-02 20:03 eh, it's not an easy process 2008-09-02 20:04 -easy +simple 2008-09-02 20:07 it's going to get even less easy when we add in versioned extents 2008-09-02 20:07 to the same code 2008-09-02 20:07 so it has to be clean 2008-09-02 20:12 mhm 2008-09-02 20:12 well, mine uses less pointer arithmetic and more array notation 2008-09-02 20:13 *(a + b) vs a[b] 2008-09-02 20:15 I think that's better 2008-09-02 20:15 for the dumper 2008-09-02 20:15 easier to read 2008-09-02 20:15 can save the pointer tricks for something that matters 2008-09-02 20:15 assuming the compiler can't optimize that well 2008-09-02 20:16 which is not a safe assumption 2008-09-02 20:18 flips: http://pastie.caboo.se/264975 there's a first poke at it 2008-09-02 20:19 not sure I like the ent == -1 logic 2008-09-02 20:20 otherwise looks pretty good 2008-09-02 20:20 offset -= doesn't look right 2008-09-02 20:20 should be += 2008-09-02 20:20 wow, dleaf_isinorder looks nice 2008-09-02 20:21 ent == -1 is the same as the check against the magical zero 2008-09-02 20:21 right 2008-09-02 20:21 ent < -1 ? 2008-09-02 20:21 something seems off by one 2008-09-02 20:21 hm? 2008-09-02 20:22 as in, for every entry but the first entry, check against the previous entry 2008-09-02 20:22 no? 2008-09-02 20:22 how can ent be < -1 ? 2008-09-02 20:22 ent grows smaller? 2008-09-02 20:22 -1, -2, -3 2008-09-02 20:22 oh right :-) 2008-09-02 20:22 (-3 < -1) 2008-09-02 20:22 upside down 2008-09-02 20:22 => true 2008-09-02 20:22 yeah :) 2008-09-02 20:22 braindamage 2008-09-02 20:22 sory 2008-09-02 20:22 same with offset -= 2008-09-02 20:22 offset is negative 2008-09-02 20:22 grows smaller 2008-09-02 20:22 should be ent < 0 though 2008-09-02 20:22 let's see what the skew is 2008-09-02 20:23 hmm 2008-09-02 20:23 maybe it's my brain 2008-09-02 20:23 ah 2008-09-02 20:23 I see 2008-09-02 20:23 I'm assuming the -1th element is greater than zero 2008-09-02 20:23 which isn't good 2008-09-02 20:23 true 2008-09-02 20:23 I think you want a structure where you assign a variable once outside the loop 2008-09-02 20:24 like you had in ileaf 2008-09-02 20:24 alright 2008-09-02 20:24 it's not bad 2008-09-02 20:24 but given how important it is, it should be crystalline 2008-09-02 20:24 yes 2008-09-02 20:25 ok, I can read it now 2008-09-02 20:25 you're right 2008-09-02 20:25 you need to get the correctness of the first value 2008-09-02 20:25 then induce correctness from there 2008-09-02 20:26 and all the little bits have to join together 2008-09-02 20:26 so you need to init your first value outside both loops 2008-09-02 20:26 you can safely init it to zeor 2008-09-02 20:27 zero 2008-09-02 20:27 since keys are unsigned 2008-09-02 20:27 flips: http://pastie.caboo.se/264977 2008-09-02 20:27 oh, keys are unsigned? 2008-09-02 20:28 not crystalline yet ;-) 2008-09-02 20:28 keys are 2008-09-02 20:28 u64 2008-09-02 20:28 see tuxkey_t 2008-09-02 20:28 wait 2008-09-02 20:28 but I'm testing limits 2008-09-02 20:28 not keys 2008-09-02 20:28 they're keys 2008-09-02 20:28 limits are u8 2008-09-02 20:28 that's what dleaf is 2008-09-02 20:28 a key dict 2008-09-02 20:28 right 2008-09-02 20:29 you have to expand those u8s to keys 2008-09-02 20:29 that's the clever thing here 2008-09-02 20:29 ah, I'm just doing what I did in ileaf_isinorder 2008-09-02 20:29 making sure the limits are non-descending 2008-09-02 20:29 when you get the aha it's going to be a big one ;-) 2008-09-02 20:29 which is still important 2008-09-02 20:29 we have two 48 bit fields we combine to make a key 2008-09-02 20:29 two 24 bit fields 2008-09-02 20:29 and the 8 bit fields are just indexes to allow us to do that 2008-09-02 20:29 to make a 48 bit key 2008-09-02 20:29 right 2008-09-02 20:30 sorry 2008-09-02 20:30 so you need to assemble the 48 bit key at each step and compare to prevkey 2008-09-02 20:30 ah, and it should be greater always? 2008-09-02 20:30 yes 2008-09-02 20:30 non-descending or ascending? 2008-09-02 20:30 the 8 bit fields within groups also ascend 2008-09-02 20:30 can two keys be the same? 2008-09-02 20:30 which is what your code checks 2008-09-02 20:30 right 2008-09-02 20:30 which is also good 2008-09-02 20:31 alright 2008-09-02 20:31 twe keys can be the same 2008-09-02 20:31 that's going to be critically important 2008-09-02 20:31 nondescending 2008-09-02 20:32 assembling the 48 bit key is pretty easy 2008-09-02 20:32 it's just 24 bits from the entry and the other 24 bits from the group that owns it 2008-09-02 20:32 computing the offset is a little trickier 2008-09-02 20:32 offset into data 2008-09-02 20:35 http://pastie.caboo.se/264980 checking offsets within groups and non-descending keys now 2008-09-02 20:36 it triggers 3 times running ./dleaf 2008-09-02 20:36 triggers? 2008-09-02 20:37 dleaf_check returns negative with "dleaf entries out of order!" as the error message 2008-09-02 20:37 probably not because of a bug in dealf.c 2008-09-02 20:37 or should I say, "possibly" 2008-09-02 20:38 I still don't much like the inits to -1 2008-09-02 20:38 well 2008-09-02 20:38 I think I see 2008-09-02 20:38 you want a do ( } while (cond) structure 2008-09-02 20:38 probably 2008-09-02 20:39 so the loop iterates over the final n-1 elements 2008-09-02 20:39 it is never allowed to have zero iterations 2008-09-02 20:39 so return false if you find that before entering the do loop 2008-09-02 20:40 so groups aren't allowed to have zero entries? 2008-09-02 20:40 right 2008-09-02 20:40 I should write the definition 2008-09-02 20:40 and post it 2008-09-02 20:41 about time 2008-09-02 20:41 the comment is a little lame 2008-09-02 20:41 in editing a dleaf, and group that drops to zero has to be deleted immediately 2008-09-02 20:42 s/and/any/ 2008-09-02 20:42 what sort of formatting do you prefer for do/while loops? 2008-09-02 20:42 hmm 2008-09-02 20:43 lindent 2008-09-02 20:43 that's with the first curly on the same line as the do 2008-09-02 20:43 I don't like it, but linus does 2008-09-02 20:43 used to write them like you 2008-09-02 20:43 and the second curly on the same or different line as the while? 2008-09-02 20:44 but in the end there is no way to make c pretty ;-) 2008-09-02 20:44 heh 2008-09-02 20:44 different line 2008-09-02 20:44 ok 2008-09-02 20:47 hm, the implied zero entry, what loglo does it have? zero? 2008-09-02 20:54 um 2008-09-02 20:54 ACTION thinks 2008-09-02 20:54 it's not actually there 2008-09-02 20:54 that's where the aha happens 2008-09-02 20:55 only the nonzero entries are actually there, and they encode the upper bound from the key, rather than the usual offset 2008-09-02 20:55 we start one entry away and have an implied zero because we are picking up a pair of entries at each step 2008-09-02 20:56 the current entry and the one above in a sense 2008-09-02 20:56 or maybe better to think of it as the current entry and the one below 2008-09-02 20:56 hm 2008-09-02 20:56 start at -1, and compare to -2, and so on? 2008-09-02 20:56 where you can always directly look at the limit 2008-09-02 20:56 but have to use a clever trick to look at the offset 2008-09-02 20:57 hmm 2008-09-02 20:57 yes 2008-09-02 20:57 well 2008-09-02 20:58 first set offset to zero 2008-09-02 20:58 mhm 2008-09-02 20:58 then enter the loop at i = 0, and look up dict [i -1] -> limit 2008-09-02 20:59 it's a matter of taste 2008-09-02 20:59 and mine was not good when I wrote the original ;-) 2008-09-02 20:59 the loop should always execute eactly n iterations 2008-09-02 20:59 and it should start from zero, but never access dict[0] 2008-09-02 21:00 I think 2008-09-02 21:00 even if it fails early? 2008-09-02 21:00 fail means bail 2008-09-02 21:00 zero tolerance of errors 2008-09-02 21:00 so no, except when it fals 2008-09-02 21:00 fails 2008-09-02 21:01 so what I meant was, the loop should not execute n-1 times 2008-09-02 21:01 but n times 2008-09-02 21:01 yeah 2008-09-02 21:01 and let the actual index i not be used in the loop, but i - 1 instead 2008-09-02 21:01 that's like docmentation 2008-09-02 21:01 because you can arrange the loop to be able to use i directly, but that makes it harder to understand 2008-09-02 21:01 the optimizer can easily do that on its own 2008-09-02 21:02 any, I'm talking about what I _should_ have thought about when I wrote the original 2008-09-02 21:02 was in kind of a hurry to get something running 2008-09-02 21:03 wow, genuine uniden replacement batteries cost almost as much as a new handset