no group, yes briefs alberta arizona bc manitoba maritimes maryland nevada new_hampshire ontario oregon saskatchewan utah yes group, no briefs louisiana new_england sfbay 0, unseen,, *** EOOH *** From: Christophe.Wolfhugel@grasp.insa-lyon.fr (Christophe Wolfhugel) Newsgroups: news.lists Subject: Usenet article size statistics -- October 92 Summary: statistical study of usenet articles Keywords: article, statistics, size Date: 7 Nov 92 16:00:38 GMT Followup-To: news.admin.misc Organization: INSA Informatique (Grasp), Lyon, France NNTP-Posting-Host: grasp1.univ-lyon1.fr Following statistics have been done using all 131959 articles having transited on my node during the month of October 1992. UUNET is also publishing a bi-weekly article size distribution study to news.lists, but not yet with the per group details. The sampled feed is far away from a full feed. The only binary groups received are vmsnet. No pictures. No other binaries. The sampled feed contains: news, comp (!sources,!binaries), misc.jobs, bionet, vmsnet plus some other groups and some regional hierarchies (about 300 groups total). Number of articles received: 131959 articles Smallest article received: 256 bytes Largest article received: 343787 bytes Average article size: 2134 bytes Standard deviation: 4665 bytes Article size # % T# T% ------------------------------------------------ 0-499 1731 1.31 1731 1.31 500-999 31797 24.10 33528 25.41 1000-1999 66451 50.36 99979 75.77 2000-4999 26785 20.30 126764 96.07 5000-9999 2897 2.20 129661 98.27 10000-24999 1260 0.95 130921 99.22 25000-49999 683 0.52 131604 99.74 50000+ 355 0.26 131959 100.00 Following in an estimation of the "contents" of all sampled articles. Number of articles sampled: 131959 articles Field Kb %hdr %nohdr ---------------------------------------- Kbytes: 274953.72 100.00 ---.-- Headers: 102851.02 37.41 ---.-- Body: 172102.70 62.59 100.00 Useful: 148267.35 53.92 86.15 Signatures: 9144.75 3.33 5.31 Quoting: 14690.59 5.34 8.54 Total top level domains: 150 (so many badly configured sites) Sorted by size of articles posted... User address #msg #kb %sig %quote 001. edu (1) 48339 105294.25 4.3 6.9 002. com (2) 32794 67603.92 5.8 10.3 003. ca (3) 6293 12877.18 4.8 8.2 004. uucp (4) 5681 10301.84 6.0 10.9 005. org (5) 4556 10280.66 6.2 6.7 006. au (7) 4744 9438.14 5.3 13.4 007. de (6) 4131 8343.79 6.4 12.4 008. uk (8) 4417 7859.74 6.1 9.2 009. gov (9) 2325 5068.23 8.2 6.7 010. nl (-) 1795 3865.08 7.8 14.2 [...] #msg = number of messages #kb = size in kilobytes %sig = percentage of (recognized) signatures %quote = percentage of (recognized) quoting. The number in parenthesis, when present, indicates the rank of the domain last month. The complete set of data (one file per group and hierarchy or sub hierarchy received here) is available either by anonymous ftp or listserv, and by gopher (server: gopher.univ-lyon1.fr, port 70). By anonymous ftp to grasp1.univ-lyon1.fr in /pub/usenet-stats, the file 'all' represents the global summary, 'news' for news.*, news.answers for news.answers, ... By listserv, send mail to listserv@grasp1.univ-lyon1.fr with commands in the BODY of the message (not the subject!). Commands can be: index usenet-stats (approx. 30 Kb long message). get usenet-stats group Exemples: get usenet-stats news get usenet-stats news.answers Following algorithm has been used, it is far far away from being a good one, but the results seem acceptable: - headers are counted up to the first newline. - signatures are counted from line '-- '. So exotic signatures are bypassed. - default quote char is '>', if "In article" is encountered at a start of line, the quote char is set to one of <>:|%!.+~ on the first list it is encountered. - top-level domains are determined from the part after the last dot in the address part, this is an interesting part as it allows to see some misconfigured sites, also, bang path addresses generate strange outputs. Some non standards lines have already been deleted by the statistics software. Disk occupation statistics -------------------------- These data indicates the disk usage of the articles on several systems, depending on the block size. All articles received in October have been included in this sample. Sampled volume: 281539087 bytes (268.5 Mb) Block size #kb used #kb lost % System name ------------------------------------------------------- 512 308067 33126 10.8 [ Put here 1024 339277 64336 19.0 your 2048 397204 122263 30.8 favorite 4096 610312 335371 55.0 system name ] Conlusion: get small blocking factors to save disk! -- Christophe Wolfhugel | Email: Christophe.Wolfhugel@grasp.insa-lyon.fr "No keyboard, press F1 to continue"