172:, the data file shrinks to 10+ GB and it is finally ready for insertion into the database. Please do not be concerned with the variance in file size -- no data is lost; this is expected when you convert a .xml file (a structured human-readable file which requires more redundant data) to a .sql file. And so, the data import has been ongoing since Jan 12, 2007, that means it has been running for
58:
Note: While this is certainly not the best way to get our sample, it is our next best alternative. It is better than going to
Knowledge and trying to fish randomly around for 600 names (in fact, this method is not even random). The Dec database captures a snapshot of editing activity in early Nov
100:
are those who have administrative powers. The Dec database has the information on who are all* the privileged users in the
English Knowledge at that point in time. Non-privileged users are those who are registered users and not anonymous IP addresses. This stratification can be used as a control
167:
of all edits made prior to Nov 2006. This explains why I require survey participants to have registered their accounts before Jan 2006, so that I can track their edits made leading up to Nov 2006. After uncompressing the file, it ballooned up to a size of 20+ GB and after running it though the
35:. But suffice to say that I am investigating the cause-and-effect relationship between motivations and knowledge contributions. The ‘cause’ (motivations) is measured in the survey whereas the ‘effect’ (contributions) is measured in the archival data (
54:
in Dec 2006. The data file contained the edit metadata for the most recent edit made to all the articles in
Knowledge. Now, it has become useful again because we are going to pick our sample from this database.
151:
has a total of 600 usernames, divided equally between privileged and non-privileged users (300 in each group). Direct invitations are extended to
Sampling Unit C over the weekend of Mar 3-4, 2007.
120:
function in the database to shuffle
Sampling Unit B like a deck of cards and draw the first 300 names. These 300 names are non-privileged users that are passed on to
159:
The problem is, as I have stated on my user page, the data import is really taking a long time. Specifically, I am importing the data from the
108:
Firstly, let’s look at all the non-privileged users in
Sampling Unit A. After filtering all the privileged users from Sampling Unit A, we have
62:
The sampling frame consists of all the editors whose usernames appear in the Dec database. From here, I ran two sampling stages:
163:
of 3.1 GB. In contrast to the earlier file mentioned in the
Section 'Sample Selection', this data file contains the
102:
127:
Secondly, we want to find out how many privileged users exist in
Sampling Unit A. The query shows that
85:
17:
32:
139:
names and draw the first 300. These 300 names are privileged users who are passed on to
132:
117:
47:
Explanation in brief on how I selected the target sample for the survey:
160:
51:
36:
169:
131:
privileged users exist in
Sampling Unit A. Again, I called the
97:
31:
I cannot divulge too much details here because of the
105:, so that I can hear the opinions from both sides.
73:The result of the above two sampling stages is
8:
69:Users who are not anonymous IP addresses.
66:Users who are not ‘bot’ (robot) accounts.
7:
59:2006, when the database dump began.
24:
88:Sampling Unit A into two groups:
101:variable later on. I am doing
1:
190:
135:function to shuffle these
116:. After which, I called a
103:non-probabilistic sampling
77:. Sampling Unit A has
112:users. This gives us
176:now (as at Mar 4).
146:Our final result:
52:database download
37:the database dump
18:User:WikiInquirer
181:
98:Privileged users
84:Next, I want to
43:Sample Selection
33:Hawthorne effect
189:
188:
184:
183:
182:
180:
179:
178:
157:
149:Sampling Unit C
147:
141:Sampling Unit C
122:Sampling Unit C
114:Sampling Unit B
75:Sampling Unit A
45:
29:
22:
21:
20:
12:
11:
5:
187:
185:
165:entire history
156:
155:Technicalities
153:
94:non-privileged
71:
70:
67:
44:
41:
28:
27:The Rough Idea
25:
23:
15:
14:
13:
10:
9:
6:
4:
3:
2:
186:
177:
175:
171:
170:mwdumper tool
166:
162:
154:
152:
150:
144:
142:
138:
134:
130:
125:
123:
119:
115:
111:
106:
104:
99:
95:
91:
87:
82:
80:
76:
68:
65:
64:
63:
60:
56:
53:
48:
42:
40:
38:
34:
26:
19:
173:
164:
158:
148:
145:
140:
136:
133:pseudorandom
128:
126:
121:
118:pseudorandom
113:
109:
107:
93:
89:
83:
78:
74:
72:
61:
57:
49:
46:
30:
90:privileged
161:stub dump
96:users.
86:stratify
50:I did a
174:50 days
110:277,350
81:users.
79:278,423
137:1,073
129:1,073
16:<
92:and
39:).
143:.
124:.
Text is available under the Creative Commons Attribution-ShareAlike License. Additional terms may apply.