For rather long time I follow the site from Poland “bigpicture.pl“. Recently published articles are good example of simple but very productive and scientifically significant “mini-research”:
What are the Wikipedia traffic patterns? How do they reflect general usage of the Internet? Simple comparison of Google Trends patterns with the traffic of Wikipedia can help to answer these questions.
We randomly selected a sample of 10,000 Wikipedia queries. The only selection criterion was the absence of the punctuation (mostly commas and parenthesizes) because Google Trends ignore any punctuation. About 2,000 queries did not have enough traffic to be shown on Google Trends. For the rest 8,000 queries correlations were calculated. Results of the correlation coefficient distribution are shown on the graph below. An average correlation coefficient is 0.45.
Considering the data of this graph, we can make an important conclusion that Wikipedia traffic patterns reflect general search patterns present on Google Trends.
Next, I compared periodic calendar queries “January 1” .. “December 31” and “January” ..”December” that I studied in one of my previous post. Results were impressive: average correlation coefficient was 0.891. It means that all compared curves were practically identical!
Later, I decided to cheer up rather boring calendar pages of Wikipedia (like “January 1”..”December 31” and proposed to include “Other pages read frequently on this day”. However, my proposal met an opposition from famous mathematician Arthur Rubin. His position was that my proposal “Not suitable for Wikipedia, being self-referential.”. He was referencing to WP:SELFREF: “Mentioning that the article is being read on Wikipedia … should be avoided where possible.”
This is why I have to prove that calendar is global phenomenon, that there are certain topics that people are interested in the most on a certain day of every year. To do so I have to compare all ~3,000 Wikipedia articles used to build the calendar with Google Trends. It was not easy. First, about half of titles do not have enough data to be shown on GT. Other show in GT only monthly averages that is not enough for statistically valid comparison. Nevertheless, I was able to successfully compare 577 articles. Average correlation coefficient is 0.63, which is much stronger then for random WP article. Low coefficients were caused mostly by disambiguation. For example, “1447” means “1447 year AD” in Wikipedia, but just a number in Google. This is why correlation coefficient for this query is -0.07. However, general correlation between calendar traffic patterns on Google and Wikipedia is high – most of compared patterns (55%) have a coefficient above 0.8. Here is graph of coefficient distribution.
If you find this interesting, please support my proposal to include “Other pages read frequently on this day” on Wikipedia talk:WikiProject Days of the year.
I was wondering how different historical periods are represented in Wikipedia.
Hence, I pulled out Wikipedia pages each dedicated to a certain year from 0 AD to 1800 and counted number of different events that took place during each of the years. Results are shown below.
The increasing number of known events was expected. But growth was not linear. You can observe significant relative drop in events (below trendline) from the VIII to the XI or the XV centuries, which correspond to the period that was originally coined as “dark ages”.
To normalize the data I calculated moving averages for each 5-year period and looked at the deviation from theses averages. On next graph you can see the results. To make the graph more readable I included only years with number of events that exceed the corresponding average at least twice or were at least 60% below such average.
Year 1752 stands out because many famous people were born that year. 1118 was full of events around the world: from Japan and China to Scandinavia; and a lot of famous persons died that year. But most of these extremes are difficult to explain.
Please notice amazingly long period of stable event flow from 1246 until 1613.
I speculated that it would be easier to find explanation such irregularities in more recent history and made a graph for the 20th century (below). Surprisingly, I could not find explanation for the irregularities in modern history also. I can explain the 30% pick in 1914, when the WWI began, but certainly cannot justify a drop in notable events next year, in 1915, nor understand why 1974 was so “uneventful”.
From general point of view all years should have equal or at least almost equal number of events. But they are certainly not. Is this just a probability game or are some unknown (to me) factors at play?
I was always wondering why articles like “Lists of deaths by year” was in top 10 most popular Wikipedia pages. (This year popularity of this article fall dramatically – another puzzle).
Maybe because of thoughts of death and eternity visit each of us? Maybe Wikipedia data can show probability of my sudden death today? (Or more seriously, can Wikipedia data be used for some population statistics?)
So, I decided to do a little bit of research. I pulled out over 200,000 Wikipedia persona profiles with death dates between 1950 and 2014 using Dbpedia SPARQL query.
Clearly, Wikipedia coverage of famous persons (and their death dates) increase with years.
There is no significant correlation between weekday and death, but some correlation between day of year and mortality exists.
As you can see, the probability to die highest at New Year Eve and New Year Day (which is a well-known fact from population statistics). Next high risk dates are January 28, February 2 and September 11, followed by bad days in November (25 and 29) and December (14, 22).
Mortality in summer is significantly lower that in winter. Safest month is August (especially August 4, 31, 29 and 7).
I would like to compare these data with official mortality rates, but was unlucky in my search. Would appreciate an advice.
So, thanks God, today is not New Year and am not famous.
If you ever saw one of the best comedies of all times – “The Gods Must Be Crazy“ –you sure must remember the phrase in the title.
When I was approaching regular patterns in enwiki I encountered “an interesting psychological phenomenon”.
Why users tend to look at page “Monday” or “Sunday” much more on Mondays and Sundays?
Why pages of days of the year (like “March 14”) show picks of traffic on these days? Partially it could happen because a link to current day page appears each day on Main Page, but why Google shows the same pattern?
Google Trends data show very similar tendencies.
There are some other queries that demonstrate the same patterns. For example, any month names in sequence.
Activity usually starts 2 periods (i.e. months, days) before the current event (i.e first day of the month) and lasts until 2 periods after.
Can somebody point me to any research that explains this “psychological phenomenon”?
I should end with another citation from the same movie: “I know you think I’m an idiot, but normally I’m quite normal.”
Looking at recent Zeitgeist data with 100 top ranking articles and Wikipedia Hall of Fame: Zeitgeist 2008-2013 with 1000 top articles I noticed a tendency for a number of cases when article rank increases with years to exceed in about 1.5 times a number of cases when article rank decreases with years.
|# of cases considered||Sequence of increase/decrease cases from 2008 to 2013||% of all cases|
Does this mean that more and more Wikipedia users read less and less articles? Is it so or is it just aberration of using top 1000 articles?
Recently I read an article by Greg Tkacz, Associate Professor, St. Francis Xavier University & Associate Fellow, CIRANO. The article title is “Predicting Recessions in Real-Time: Mining Google Trends and Electronic Payments Data for Clues.” It was published in No. 387 of C. D. Howe Institute Commentary. You can read it here.
One of the author’s conclusions is that “a generic term such as “recession” to be correlated with the onset of the 2008 recession, which preceded the release of official GDP data from Statistics Canada by at least two months”.
The article strikes me as a good demo of how easy it is to use time series to predict anything and as a perfect example of this blog motto: “Lies, damned lies, and statistics”.
The article inspired me to conduct my own 5-min research “Predicting Recessions in Real-Time: Mining Wikipedia Trends”. I performed my research by choosing practically the same queries (“Recession”, ‘Gross domestic product”, “Unemployment benefits” and “Unemployment”) on WikipediaTrends.com. The resulting graph is here (better to see it normalized and on logarithmic scale).
Anyone can see periodic drops in popularity each end of July –August and for about 2 weeks around each New Year. I also checked with Wikipedia article “List of recessions in the United States”
And found that from 1960s there were no recessions starting at the end of July or around New Year!
It is so easy! Go ahead and start your own way to professorship!
Or read a good critical article “The parable of Google Flu: Traps in Big Data Analysis”.
(Some answers from Wiki-research-l Digest, Vol 103, Issue 17-19)
…there were some/several papers on how to classify articles by topics purely with statistical analysis of the words contained in it.
<http://ieeexplore.ieee.org/xpls/abs_all.jsp?arnumber=4624691>; but also
<http://dl.acm.org/citation.cfm?id=1620887> (only read abstracts).
The most adopted techniques to model topics of documents is LDA or LSI. Under these techniques, document is viewed as a mixture of topics, while
topic is a mixture of words. Both methods are well implemented in different language, for example, gensim in python. But these methods are relatively expensive.
Last year a word vector model – word2vec – was introduced by Google. By combining a topic catalog, we can easily decide to which topic an article belongs. The topic catalog is just a list of topics and each topic is a list of related words.
We released one open-sourced project on this direction:
And another planned project on the topic catalog
We will update the catalog in the coming weeks and give more details.
two years ago, we have utilised Wikipedia categories to analyse the distribution of articles over a set of main topics. We used the 24
direct subcategories of “Category:Main topic classifications” as main topics. For further information, see Section 4.2 in this paper:
You may take a look at the DBpedia ontology and to the instance-type mapping.
I guess that wikidata employs a similar classification.
if you’re not interested in actual topic extraction a good heuristic to identify high-level topic areas is to rely on Wikiprojects on the English Wikipedia and then use language links from Wikidata to apply them to other languages. That won’t immediately cover articles that only exist in one language, but it’s the most effective heuristic I can think of for your use case.
Sometimes a Wikipedia user goes to the desired page not directly, but through special pages called redirects.
To analyze pageviews statistics we have to consider the fact that sometimes Wikipedia users go through redirects much more often than directly. We made first steps to address this problem by analyzing pageviews for articles and redirects to them in February 2014. Here are some interesting facts from this study. Please keep in mind that all results are just preliminary and additional research is needed.
There are over 66,000 redirects or about 1% of all redirects that are used more often than direct links. If consider all redirects to a certain article this ratio will be much higher.
In our opinion some articles should be renamed according to user preferences. There are no visible reasons or Wikipedia policy to keep unpopular names of the articles.
Here some examples:
|Redirect name||Average monthly pageviews in Feb 2014||Article name||Average monthly pageviews in Feb 2014||Ratio|
|List of civilian casualties in the War in Afghanistan (2001–2006)||13||List of civilian casualties in the War in Afghanistan (2001–06)||1.3||10|
|Hariabhanga river||9053||Hariabhanga River||3.75||2414|
|Forest degradation||9058||Secondary forest||84.1||10|
|Operation Buckshot Yankee||3462||2008 cyberattack on United States||44.7||77|
|David F. Cargo||146||David Cargo||7.6||19|
But contrary to the above examples, a simple grep search in article title names demonstrates that ‘river’ is used 63 times versus 18090 for ‘River’. Similar search for ‘island’ vs. ‘Island’ produce 43 vs. 5836.
Some article names use letters from foreign alphabets. Sometimes it follows tradition, but very often it is a contradiction to the policy. Here some examples:
|Redirect name||Average monthly pageviews in Feb 2014||Article name||Average monthly pageviews in Feb 2014||Ratio|
|Boris Tadic||9066.2||Boris Tadić||26.3||343.9|
|Lars Lokke Rasmussen||9051||Lars Løkke Rasmussen||30||306|
Most often different spelling is used in personal names that can be caused just by prevalence of English keyboards.
But the majority of differences is related to disambiguation. Some popular redirects are named ‘<something> (disambiguation)’ but point to the article with the same name without word ‘disambiguation’. Like ‘Shaun Campbell (disambiguation)’ vs. ‘Shaun Campbell’ or ‘Lyon Township (disambiguation)’ vs. ‘Lyon Township’ or ‘Lyon Mountain (disambiguation)’ vs. ‘Lyon Mountain’. Two things surprise me here: Fist, why users prefer redirect with ‘(disambiguation)’? And second, all article titles user is redirected to are in fact disambiguation pages! What is the logic?
Sometimes a difference between popular redirect and unpopular article title is just the difference between hyphen sign and dash or minus sign, which are visually indistinct. Examples: ‘Obsessive-compulsive disorder’ vs. ‘Obsessive–compulsive disorder’, ‘Mexican-American War’ vs. ‘’ Mexican–American War”, ‘American Soccer League (1933-1983)’ vs. ‘American Soccer League (1933–1983)’. Or apostrophe: ‘The Devil’s Harvest’ vs. ‘The Devil’s Harvest’.
It is obvious to me that Wikipedia (Wikipedians) should develop more clear policies for naming articles and some bots to check for most common mistakes. However, I believe that users’ preferences should prevail when developing such policies