en.WN

Just this goy...

Tuesday, January 02, 2007


17197

2099

16854

19296

2097

2377

6189

371

434

429

318

9752

4523

4370

2653

2307

3785

3363

128

157

1383

1908

801

515

2993

1943

805

1560

116

606

1312

825

4212

1876

920

1834

5223

79

1666

3786
133,194

Okay, I don't believe you. In fact, I think you're a loony when it comes to this particular subject. And yah, it's seriously harmed your credibility with me.

I'm talking about someone's flat out insistence to me that the Wikimedia Foundation's multimedia storage accounts for some 365 GB.

That number seemed way off. I mean, there are well over 1,000,000 media files on Commons alone. Even if Commons did account for every single media file in the Wikimedia Foundation, each media file would be an average of 358KB in length. Not including the fact that Mediawiki stores a resized copy of every image which has ever been resized. And last I'd heard Wikipedia's media file collection was a touch larger than that on Commons.

On commons alone there are more than 133,194 KB of just the .djvu media type. While this file type may seem a bit fat as a reasonable comparison (the above number accounts for only 41 files, an average size of 3,248 KB) it's actually quite lean compared with .ogg song files. It would be difficult to make reasonable estimates, however, as there are several thousands of word pronunciation files on commons, still it's not unreasonable to expect the average is 3 MB. And Google finds nearly 90,000 hits of .ogg [1]

So, do I think that the average file size, including all image resizes, video and music files, averages less than 179KB? It's possible, but I don't think so, no. But exclusive of storage volume, what about when we talk about bandwidth?

A number tossed at me said 50% of bandwidth use is html text. Oddly, I actually believe that. Relatively speaking the wikis are extremely low on gaphics. They don't have a billion and one tiny images, rollover images, flash interfaces, etc. Even with gzip transmission this undoubtedly means that text is going to be the largest percentage of the bandwidth.

But the other number tossed at me was that en.wikipedia images account for 44% of the remaining bandwidth. Again, this is unbelievable bullshit. They may be images in en.wikipeda pages, but that doesn't mean they are held exclusively on that website. That 44% includes all the images which are uploaded to commons.wikimedia, which is a senseless waste of a wiki.

Think about it this way: with further coding, all media files from any WMF project could be available to every other project, without requiring a separate bureaucracy to "manage" images for them. It should not be excessively difficult to externalize the binary namespace from the wikis, and make all media pages available on all the projects simultaneously.

Instead the only system which is available to all wikis is commons, which externalizes image deletion control, as well as the policy-making regarding every element of binary management. And, instead of improving the software and creating greater opportunity and sharing, commons divisively engages in turf wars, and writes software which increases project dependency on them rather than attempting to reduce that dependency.

And it hasn't worked in the past. Historically commons has enforced its own rules arbitrarily; abusing it contributors on some occasions, abusing its missions on others. Its administrators, unaccountable to the projects and equally to their own community, have repeatedly engaged in hostile arguments with representatives of other projects and people attempting to join its efforts. Whether it is the culture which has developed there or its unique position of de facto authority over the content of the WMF projects, it is harmful to the development of the projects it is supposed to support.

Since there is not a need for commons, effort should be put forth not to improve it further but to make it redundant, simultaneously reducing the actuality of redundant binary files. And, WMF-wide, a policy requiring binaries be actually used in projects in order to justify their storage should be implemented.

Of course, that only applies if there is a desire to reduce the number redundant binary files, reduce the complexity and hostility related to contributing media files, eliminate a layer of bureaucracy, and/or to simplify rather than complicate. In short, this entire essay was a waste of time.

Update: Having spoken with the person in charge of the network system for the Wikimedia Foundation, Media file transfers account for more than half of bandwidth usage, roughly 60% vs 40% text. He also says the estimated storage is about 1 TB, but doesn't want to risk slowing down the server to actually get a precise figure. (Actually, he thinks more like 1.5 TB, since it was 1.3 TB a couple months ago)

Which pretty much makes my case, in some respects.

No comments:

Blog Archive

About Me

Owned by Njørđson, a Cape Dory 25D.