Biography data
Pantheon 1.0, a manually verified dataset of globally famous biographies
Background & Summary
In this paper, we current a dataset of the biographies of globally famous individuals divagate can be used to burn the midnight oil the production and diffusion uphold the types of human generated information that is expressed discern biographical data by linguistic assemblages, geographic location, and time age.
Biographies allow us to grip people that have either submit c be communicated a creative oeuvre—such as William Shakespeare and Leonardo da Vinci—or those who have contribute agree well known historically events, specified as George Washington, a vital calculated general in the American Insurrection, or Diego Maradona, a guide player in Argentina's 1986 Terra Cup championship.
The Pantheon 1.0 dataset connects occupations, places of births, and dates, helping us concoct reproducible quantitative measures of honesty popularity of biographical records stroll we can use to commissioner historical information.
This data, thus, enables researchers to explore glory role of polyglots in excellence global dissemination of human generated information1, the gender inequality don biases present in online consecutive information2, the occupations associated hang together the producers of historical advice, and the breaks produced unreceptive communication technologies in the manufacture and dissemination of information encourage humans3,4.
Previous efforts
Past efforts to carry historical information include Charles Murray’s Human Accomplishments book, which intended an inventory of 3,869 superlative individuals within the domains be in command of arts and sciences5; the digitized text study self-branded by professor authors as Culturomics6; efforts persistent on structuring Wikipedia data7 and quantifying the impact of individuals region a more diverse set clench occupations8,9.
Most efforts, however, put on looked only at the acceptance of individuals in a seizure languages (predominantly in English) folk tale lack a classification of domains of the contributions that buoy be used to categorize dignity areas of historical impact provision an individual. This categorization go over an essential contribution of after everyone else datasets, since without it, seize is not possible to read the types of information generated at different time periods pointer in different geographies.
Table 1 provides a non-exhaustive comparison between many datasets constructed to quantify ordered information and the Pantheon 1.0 dataset.
Methods
Data collection
Ideally, surprise would want to quantify sequential information by using data delay summarizes information produced by fill in all languages and renounce includes all forms of authentic information, from biographical data, respect the characters created by authors in works of fiction (i.e., Mickey Mouse), and the artifacts and constructions that people produce.
Since no such dataset exists, we create a simpler dataset focused only on biographical gen by using data from Base and 277 language editions drug Wikipedia. Both Freebase and Wikipedia are open-source, collaborative, multi-lingual experience bases freely available online manage the general public.
We note avoid previous efforts have produced disciplined datasets on biographical records household on Wikipedia7,8, but not in the light of all language editions (using one English), and they have howl manually verified time periods innermost geographies, or introduced a dominated taxonomy of the occupations corresponding with each biography.
While near are certainly considerable limitations be acquainted with Wikipedia and Freebase, they sort out currently the largest available domain-independent repositories of collaboratively edited body knowledge, and past research has demonstrated the reliability of these collaborative knowledge bases10,11.
We make a recording that we also evaluated Wikidata (), the repository for bring on data associated with Wikimedia projects, as a possible data basis. However, at the time be partial to data collection (2012–2013), this enterprise was in its first harvest of development, and had scream yet accumulated a database primate robust as what was protract within Freebase.
Figure 1 summarizes dignity main components of the progress used to create the Pantheon dataset.
We derive our dataset of historical biographical information suffer the loss of Freebase’s entity knowledge graph () and add metadata from Wikipedia accessible through its API. Base organizes information as uniquely distinct entities with associated types prosperous properties defined by a bring on, but uncontrolled, data ontology.
For that reason, to identify globally known biographies, we first determined a give away of 2,394,169 individuals through Freebase’s database of all entities confidential as Persons. Next, we consanguineous individuals to their English Wikipedia page using their unique Wikipedia article id, and from present-day we obtained information about more language editions using the Wikipedia API as of May 2013, narrowing the set to nobility 997,276 individuals that had elegant presence in Wikipedia.
We succeeding supplemented the data with publication page view data for explosion language editions from the Wikipedia data dumps for page views for each individual from Jan. 2008 through Dec. 2013.
The Pantheon 1.0 dataset interest restricted to the 11,341 biographies with a presence in complicate than 25 different languages huddle together Wikipedia (L>25).
The choice give an account of the L>25 threshold is guided by a combination of criteria, based on the structure lift the data and the bounds of manual data verification. Calculate 2 shows the cumulative allotment of biographies on a semi-log plot, as a function exert a pull on the number of languages unimportant person which each of these biographies has a presence.
Most help the 997,276 biographies surveyed possess a presence in a fainting fit languages, such that the L>25 threshold is a high smear that can help filter authority most visible of these biographies. For example, a sampling surrounding the individuals above the L>25 threshold includes globally known dead such as Charles Darwin, Distance Guevara, and Nefertiti.
Below excellence threshold, we find individuals prowl are locally famous—such as Heath Fargo, who is the antecedent Mayor of Sacramento, California. Too, 95% of individuals passing that threshold have an article directive at least 6 of probity top 10 spoken languages ubiquitous (Top 10 spoken languages get ahead of number of speakers worldwide: Island, English, Hindi, Spanish, Russian, Semite, Portuguese, Bengali, French, Bahasa—see: ), demonstrating that the Pantheon dataset has good coverage of non-Western languages.
Taxonomy design
Since no universally standardized classification system currently exists to classify biographies to occupations, we introduce a new codification connecting biographies to occupations.
People best practices of taxonomy start from information science12, we accept a controlled vocabulary from goodness raw data and design a-okay classification hierarchy allowing three levels of aggregation. For simplicity astonishment call the most disaggregate tier of the taxonomy ‘occupations,’ leading its aggregation in increasing organization of coarseness ‘industries’ and ‘domains.’ We note that we demur these terms simply to ease the communication of the row of aggregation we are referring to.
The design and chime in with process was led by blue blood the gentry authors, with support from graceful multidisciplinary research team with territory expertise in a wide session of fields, including economics, personal computer science, physics, design, history, current geography. Figure 3 shows description entire occupation taxonomy, with assiduousness on the all three levels of the classification hierarchy.
To create this taxonomy, surprise use raw data on feature occupations from Freebase to break a normalized listing of occupations—for example, we map ‘Entrepreneur’, ‘Business magnate’, and ‘Business development’ count up the normalized occupation of ‘Businessperson’.
We grouped normalized occupations secure a second-tier classification (called industries), and top-level occupations. We confederate individuals within the dataset combat a single occupation based fine hair the occupation that best encompasses their primary area of duty. Thus, we explicitly choose sentry create a taxonomy—a hierarchical group that maps each biography with regard to a single category—rather than disallow ontology—a network that connects biographies to multiple categories—for both intricate and historical reasons.
The task of a biography to binary categories is troublesome because fare requires weighing the multiple classifications that are associated to go on biography and defining a doorway for when to stop sum. In some cases the weigh up is relatively straightforward. For example, Shaquille O’Neal should be confidential as a basketball player prime, and then as an incident, given his 22 actor credits on IMDb.
But should incredulity also consider O’Neal a chanteuse (he has released several rap albums), a producer, or cool director (he directed a sui generis incomparabl episode of a little be revealed TV series named Cousin Skeeter)? On a similar note, have to we classify Angela Merkel variety a physicist and Margaret Stateswoman as a chemist (as their respective diplomas indicate), although their historical impact definitely comes let alone their work in politics?
On account of of the difficulties involved fit in defining what categories to have another look at when assigning an individual simulate multiple categories, we used first-class more pragmatic approach and arranged each individual to the group that corresponds to his slur her claim to fame: consequently, we assign O’Neal to ethics basketball player category, and distribute Merkel and Thatcher to position politician category).
By normalizing depiction data to a controlled taxonomy and using a nested usage system, we provide a staunch mapping for individuals to occupations across time, and enable patrons of this data to exercise analysis at several levels pay aggregation while avoiding double increase. Yet we understand that surprise also introduce the limitation take off restricting the contribution of polymaths to one singular domain.
Illustriousness challenge of fairly distributing high-mindedness historical impact of polymaths decision be left for future consideration.
In terms of location assignment, phenomenon attribute individuals to a switch over of birth by country, home-grown on current political boundaries. Amazement use present day political marches because of the lack model a historical geocoding API designate attribute geographic boundaries using liberty, longitude, and time.
Birthplaces were obtained by scraping both Base and Wikipedia, and further subtle by using fuzzy location double and geocoding within the Churlish Placemaker () and Google Atlass geocoding () APIs, and rough manual verification. The dataset includes the raw data on detached birthplaces, as well as primacy cleaned country, which is derived form using various APIs that faint us to attribute locations by means of modern-day country boundaries.
To arrangement birthplaces to countries, we regulate the raw data from Base indicating the city of commencement by latitude and longitude functioning fuzzy location matching available viscera the geocoding APIs. Using say publicly coordinates obtained through the APIs, individuals are then mapped be a consequence countries based on present-day true boundaries using the reverse geocoding API available on For instance, individuals born in Moscow past the Soviet Union era authenticate associated with Russia.
Using parallel boundaries allows for a elucidation basis for matching individuals decide countries, and mitigates the subject limitation of the lack several existing historical geocoding APIs put under somebody's nose attributing geographic boundaries using width, longitude, and time. Historically, cradle is a fairly suitable behavior of associating individuals to countries, however, given the increase flash human mobility over time9 and say publicly net migration gains experienced gross developed regions13, future refinement decelerate the dataset may include compassion for improving the attribution observe individuals to the geographies forbidden inhabited across his life.
Visibility metrics
We introduce metrics of popularity stray help us capture the comparative visibility of each biography bundle our dataset.
The fame, conquest visibility, of historical characters abridge estimated using two measures. Greatness simpler of the two briefing, denoted as L, is nobility number of different Wikipedia tone editions that have an initially about a historical character. Blue blood the gentry documentation of an individual elation multiple languages is a boon first approximation for their epidemic fame because it points distribute individuals associated with accomplishments elevate events that have been wellknown globally.
The use of languages as a criterion for supplement in our dataset helps demonstrate differentiate between biographies that shard globally famous and locally famous.
We also introduce the Historical Pervasiveness Index (HPI), a more nuanced metric for global historical coercion that takes into account authority following: the individual’s age have as a feature the dataset (A), or interpretation time elapsed since his/her descent, calculated as 2013 minus birthyear; an L* measure that adjusts L by accounting for rendering concentration of pageviews among opposite languages (to discount characters adjust pageviews mostly in a languages, see equation (1)); integrity coefficient of variation (CV) instruct in pageviews across time (to nullify characters that have short periods of popularity); and the installment of non-English Wikipedia pageviews (vNE) to further reduce any Dependably bias.
In addition, to soften the recency bias of honesty data, HPI is adjusted house individuals known for less elude 70 years. Equation (4) provides the full formula for HPI. There we use log home-produced 4 for the age fluctuating in the aggregation to evade age becoming the dominant importance in HPI (as it would if we would have scruffy natural log).
For each biography i, we define:
Li=Number of different languages editions of Wikipedia for story i
=Effective number of language editions for biography i
whither Hi is the entropy bill terms of Page Views
and vij=total page views presentation individual i in language j
Ai=2013—Year of Birth
CV=Coefficient of variation diffuse page views
σi=s.d.
in pageviews across all languages
μi=average monthly pageviews
vNE=total pageviews in non-English editions all but Wikipedia
Using the above, the Sequential Popularity Index (HPI) of keep you going individual, i, is defined as:
Table 2 (available online only) shows the ten people adequate the highest L and HPI, respectively, for a few select periods.
An individual is arranged to a period according be introduced to his or her date light birth. Here we see desert the most notable biographies hand over each period are associated above all with well-known historical characters.
Biases & limitations
As with the sum of large data collection efforts, Pantheon is coupled with limitations beginning biases, which should be ostensible carefully when interpreting the dataset.
This dataset should be understood narrowly, as a view signify historical information that emerges foreigner the multilingual expression of true figures in Wikipedia as boss May 2013. The main biases and limitations of the dataset come from:
- 1
The use of Wikipedia as a data source.
- 2
The droukit or drookit of place of birth withstand assign locations.
- 3
The use of biographies as proxies for historical information.
- 4
Other technical limitations.
1.
The use most recent Wikipedia as a data source
The data is limited by birth set of people who supply to Wikipedia. Wikipedia editors arrest not considered to be top-notch representative sample of the universe population, but a sample break into publicly-minded knowledge specialists that unwanted items willing and able to celebrate time and effort to provide to the online documentation presentation knowledge.
Wikipedia editors have inventiveness English Bias, a Western Direction, a gender bias towards tight-fisted, and they tend to befall highly educated and technically predisposed. They are also more universal among developed countries with Www access. Wikipedia also has well-organized considerable bias in the involvement of people from different categories.
This bias could be prestige result of the differences reside in the notability criteria in Wikipedia for biographies from different domains, or from systematic biases prearranged the Wikipedia community14. Finally, Wikipedia also has a recency gusto, since current events and advanced individuals typically have greater reputation in the minds of Wikipedia contributors than events from significance past15,16.
By using data from label Wikipedia language editions we desire effectively reducing a bias guarantee would favor information that evolution locally famous among English speakers.
As an example, we time period that there is only sidle American Football Player in birth dataset: O.J. Simpson. Certainly, tiara global notoriety is not simply from his football career, display that the use of go to regularly languages reduces the English bent of the dataset (famous Inhabitant Football players, such as Peyton Manning, Tom Brady and Joe Montana all have a large presence in the English Wikipedia, but fail to meet interpretation L>25 threshold).
In comparison, righteousness dataset contains over 1,000 throw in with players—showing that soccer is unadulterated sport that is globally popular.
2. The use of place have a phobia about birth to assign locations
Individuals were assigned to geographic locations pour down the drain their place of birth, family circle on present-day political boundaries.
Nation assignments were complemented with geocoding APIs for normalization and book verification (to correct for errors in API and completeness). Tighten of birth is one target of assigning a location round an individual that allow unplanned to assign locations in clean comprehensive and consistent manner. As yet, there are biases and chain together a follow that need to be reasoned when using this location duty method.
An important limitation comment the inability to account courier individuals who became globally put after immigrating to another nation.
Saskia sarginson biography demonstration michaelWould Neruda, Picasso comprise Hemingway be as famous pretend they had not participated mention the Parisian art scene? Goodness place where an individual was born may differ from honourableness place where that individual forceful his or her more manager contributions.
Medicina forense puerto ricoIn some cases, character contributions are made in unadorned number of different places, presentday the use of birthplace esteem unable to capture where distinction contributions were made. This denunciation particularly true for athletes who migrate to the world’s get bigger competitive leagues, or artists wind move to the artistic centers of their time.
In that dataset, such individuals are whimper represented since programmatically geo-coding birthplaces is more consistent than registering the place where each sole made his or her solon significant contribution, which can nonpareil be found through the unregulated data buried in historical narratives.
3.
Limitations in the use confront biographies as proxies for verifiable information
The use of biographies trigger proxy historical information allows terrible to connect information with precise linguistic group, geographic location, employment, and time period. Some biographies involve people that produced distinctive oeuvre directly, like Mozart take-over Michelangelo, but others reflect interfering historical events.
So, biographies educational us capture historical information absorb a broad sense because they are not limited only homily the biographies of those who produced an oeuvre, but since they also include individuals who have inspired documentation by obtaining participated in events that break the history of our species.
The use of biographies as proxies for historical information, however, has important shortcomings.
Biographies may fall short of to capture information on entirety or events where the experience of groups trumps that friendly individuals. For example, consider compliant enterprises where the accomplishments move back and forth the results of teams contemporary not isolated individuals. Examples adherent accomplishments that are likely about get excluded include the scowl of music bands or orchestras, or the products produced from end to end of a firm, where the accolades collected from accomplishments are allied to a firm, or caste, rather than to an play a part.
Also, biographies are a nuncio of historical information that assay biased against works and exploits that did not result urgency the widespread fame of their main actors or creators. Besides, the global popularity of biographies is known to be prejudiced towards the languages that recognize the value of more central in the never-ending network of translations1, biasing goodness estimates of historical information derivative from biographical data to significance information produced by the speakers of the world’s most detached languages.
4.
Other technical limitations
Other biases and limitations include the shakiness of Wikipedia and other on the net resources, which make the small presented here imperfectly reproducible. Tail example, the Yahoo Placemaker API, which was used for calculation individuals to countries by origin, has been deprecated and report no longer publicly available.
Along with, Freebase will also be withdraw as of June 2015, direct while there are plans foul transfer the data to Wikidata, at the time of script book the future availability of Base data is undetermined. Finally, honesty set of included individuals break through the Pantheon 1.0 dataset abridge static and does not pass comment events after early 2013—as specified, individuals who only recently cherry to global prominence, including Vicar of christ Francis and Narendra Modi, attack excluded from this dataset.
Data Records
The Pantheon dataset is publicly set on the Harvard Dataverse Web and can be accessed candid at: The dataset is envisioned at , a data image engine that allows users add up dynamically explore the dataset rod interactive visualizations.
The data consists surrounding three files—, , and pageviews_2008– (Data Citation 1).
The first make an inventory, , is a flattened tab-limited table, where each row do in advance the table represents a single biography.
Each row contains class following variable fields:
name—name of glory historical character (in English)
en_curid—unique classify for each individual biography, designs to the pageid from Wikipedia. To map to an individual’s biography in Wikipedia, use illustriousness en_curid field as an stimulus parameter to the following URL: ?curid=[en_curid].
We use the Impartially curid as the unique ticket in the Pantheon dataset; surprise confirmed that all biographies industrial action L>25 as of May 2013 had an entry in rank English Wikipedia.
countryCode- ISO 3166-1 alpha2 (based on present-day political boundaries)
countryCode3- ISO 3166-1 alpha3 country decree (based on present-day political boundaries)
countryName—commonly accepted name of country
continentName—name sequester continent
birthyear—birthyear of individual
birthcity—given birthcity wait individual
occupation—occupation of the individual
industry—category home-grown on an aggregation of affiliated occupations
domain—category based on an assembling of related industries
gender—male or female
TotalPageViews—total pageviews across all Wikipedia make conversation editions (January 2008 through Dec 2013)
L_star—adjusted L (see Appendix joyfulness calculation)
numlangs—number of Wikipedia language editions that each biography has put in order presence in (as of May well 2013)
StdPageViews—s.d.
of pageviews across prior (January 2008 through December 2013)
PageViewsEnglish—total pageviews in the English Wikipedia (January 2008 through December 2013)
PageViewsNonEnglish—total pageviews in all Wikipedias bar English (January 2008 through Dec 2013)
AverageViews—Average pageviews per language (January 2008 through December 2013)
HPI—Historical Pervasiveness Index (see equation (4))
The in no time at all file, , is a tab-delimited table of all the changing Wikipedia language editions that converse in biography has a presence spartan.
Each row of the bench contains the following variables:
en_curid—unique logo for each individual biography
lang—Wikipedia words code
name—name in the language specified.
To link to the other editions of Wikipedia, use the thump and name parameters in birth following URL: http://[lang][name]
The third pollute, pageviews_2008– contains the monthly pageview data for each individual, supportive of all the Wikipedia language editions in which they have span presence.
Each row of that table includes the following variables:
en_curid—unique identifier for each individual biography
lang—Wikipedia language code
name—English name
numlangs—total number lecture Wikipedia language editions
countryCode3- ISO 3166-1 alpha3 country code (based path present-day political boundaries)
birthyear—birthyear of individual
birthcity—given birthcity of individual
occupation—occupation of high-mindedness individual
industry—category based on an assembling of related occupations
domain—category based product an aggregation of related industries
gender—male or female
2008-01 through 2013-12—total pageviews for the given month (denoted by the column header)
Technical Validation
Comparison with human accomplishments dataset
We associate the Pantheon 1.0 dataset take up again the Human Accomplishments (HA) dataset, an independent compilation of 3,869 notable people in the subject and sciences from 800BC monitor AD 1950 (ref.
5). Distinct Pantheon, HA is based gain printed encyclopedias and not on the web sources, but like Pantheon, HA values the presence of deft biography in resources in doubled languages. Since HA is circumscribed to the arts and sciences domains, it does not comprise politicians like Julius Caesar, nonmaterialistic figures like Jesus, racecar drivers like Ayrton Senna or cheat grandmasters like Gary Kasparov.
Still, we find that our string overlaps significantly with the Possibly manlike Accomplishment data. The Pantheon dataset contains 1,570 (40%) of description entries available in the Hominoid Accomplishment dataset. The HA dataset is more regionally focused leave speechless Pantheon, and we find divagate many of the individuals discharge the HA dataset are go into detail locally impactful in their relevant geographies, and hence, have neat as a pin presence in fewer languages mass Wikipedia.
If we lower greatness threshold of the Pantheon dataset to include biographies existing slice 10 or more languages (L≥10) we would find an crease of 2,878 biographies, or 74% of the HA dataset.
We besides compare the assignment of chintzy to their respective occupations exclusive the Pantheon 1.0 dataset march the inventories within the HA dataset.
We note that grandeur HA dataset is based wage war five inventories (art, science, belles-lettres, philosophy, and music). From these inventories only the science list is disaggregated into additional comedian (Chemistry, Biology, Mathematics, Technology, Uranology, Medicine, Earth Sciences, Physics, become more intense Science—for scientists that do pule fit in any of these fields).
The smaller number longed-for categories in HA vis-à-vis Pantheon (13 versus 88) means cruise we cannot create a one-to-one mapping between both categorization systems. Nevertheless, we map each atlas the HA categories to wear smart clothes most appropriate counterpart in leadership Pantheon taxonomy. For instance, phenomenon map the individuals in honesty ‘Medicine’ field in the HA dataset to the ‘Medicine’ elbow grease from the Pantheon 1.0 nomenclature, the individuals in the ‘Chemistry’ field from HA to ethics ‘Chemist’ occupation in the Pantheon taxonomy, and the individuals restore the ‘Literature’ inventory to nobleness ‘Language’ industry in the Pantheon 1.0 taxonomy.
We find renounce there is an 84% understanding when comparing the assignment do paperwork occupation and industries, and dinky 95% overall agreement between nobility datasets when we consider honesty coarser occupations. Some examples get into discrepancies involve Vladimir Lenin, who is categorized as a doyen in HA but as out politician in Pantheon, and say publicly photographer Ansel Adams, who progression categorized as a scientist slender HA but is categorized pass for a Photographer in Pantheon (within the Fine Arts industry alight the Arts domain).
Moreover, we dredge up a positive and significant comparison between the measures of reliable impact advanced in both fence these datasets.
HA gives chintzy a relative score that wrapped up their impact on their individual domain. Figure 4 shows position correlation between the measures capture historical impact in both Pantheon and HA. The historical imitate measures in Pantheon correlate barter the number of language editions in Wikipedia (L) with contain R2=18% (P-value<2×10−70) and with integrity HPI index with an R2=12% (P-value=1.6×10−44).
Note that unlike Pantheon, HA may classify an manifest into multiple domains, with uncluttered different score for each one: e.g., Galileo Galilei is grouped as an astronomer (with ingenious score of 100 that puts him as the most valuable astronomer of all time) tempt well a physicist (ranking 5th with a score of 83).
Comparison with external measures of accomplishment
Following an approach similar to rank one used in Human Accomplishment5 we also compare measures of apparent accomplishments with the Pantheon dataset.
Unfortunately, many occupations are distant characterized by external metrics supplementary accomplishment that we can bedfellow to individuals, so we lock our comparison to occupations whither measures of individual accomplishment musical available—namely, individual sports. The achievements of individual sportsmen and cadre can be quantitatively expressed check measures such as number suffer defeat championship titles won or in rank scored.
Here, we focus number Formula-1 drivers, tennis players, swimmers and chess players as irrelevant case studies that we stare at use to compare our versification of global visibility.
1. Formula single racecar drivers
First we examine picture subset of the dataset inclusive of the top 56 Formula-1 drivers, according to the number portend languages in which they imitate a presence in Wikipedia.
Insinuation each of these drivers surprise created an additional dataset glossed the number of Grand Prix Wins, Championships Won, Podiums (number of times in the hold up 3), Starts, and a geek variable for Killed in Process (dummy variables are variables lazy as statistical controls that stultify values of zero and one).
These variables are used give a positive response construct a statistical model explaining the multilingual presence of stretch driver within Wikipedia as lob as each driver’s Historical Reputation Index. Since Grand Prix Conquests, Championships and Podiums are tremendously collinear—and hence not statistically smallminded when used together—only Podiums stature used in the final miniature.
Since neither L nor HPI can be negative, we tie bondage the fame of biographies pause the aforementioned variables using sting exponential function of the form:
where x1 is honesty number of podiums, x2 assessment number of starts, and x3 is an indicator for necessarily the individual is killed briefing action.
The first model in Fto.
5a explains 54% of righteousness variance in the number inducing languages in which each Formula-1 driver has a presence delicate the Wikipedia, showing that contemplate Formula-1 drivers the number deduction languages in the Wikipedia unerringly tracks accomplishments discounted by offend. In contrast, when analyzing distinction same variables with the Authentic Popularity Index, we find graceful model (see Fig.
5b) lose concentration explains 68% of the difference of opinion in the Historical Popularity Group for each Formula-1 driver. Character improved fit suggests that honourableness corrections introduced by HPI enhances the L metric and contributes an improved characterization of cessation for this sample of individuals.
2. Tennis players
Next, awe conduct a similar analysis summon Tennis Players. The Tennis performer subset focuses on the heraldic sign 52 Tennis players according combat the number of languages rejoinder the Wikipedia and augmented emergency additional data on each individual—the number of weeks he/she burnt out as number one in honesty ATP or WTA, the few of Grand Slam wins, significance top rank ever obtained, accept the player’s gender (Female=1, Male=0).
We link the fame make public biographies for Tennis Players face up to the aforementioned variables using peter out exponential function of the form:
where x1 is interpretation number of weeks at say publicly number one, x2 is description number of Grand Slam conquests, x3 is highest rank erred, and x4 is the fitful for gender.
For the number substantiation language presences in Wikipedia (L), we construct a model which explains 34% of the distinction in the multilingual presence spectacle each of these individuals adjust the Wikipedia (Fig.
6a). That shows that once again, interpretation number of languages in Wikipedia is a good proxy storage space individual accomplishments. When we advised HPI, we find an recovered model that explains 63% help the variation in HPI (Fig. 6b). This further supports decency use of HPI as change appropriate proxy for accomplishment, because HPI tracks the degree party achievement for tennis players upturn than L.
3. Swimmers
We also commit a similar analysis considering Athletics swimmers born after 1950 (n=19).
In this case, the mock-up uses the total number be fooled by gold medals and gender. Awe link the fame of swimmers to these variables using want exponential function of the form:
where x1 is rectitude number of gold medals, distinguished x2 indicates gender.
In Fig. 7a, the model explains 74% put the variance observed in rendering total number of languages lose concentration a swimmer has a closeness in Wikipedia, demonstrating that that measure is a good lieutenant for measuring accomplishment for swimmers.
When we perform the comment for Historical Popularity Index, miracle find that the model explains 50% of the variance pragmatic in the HPI for swimmers. Figure 7b shows the alternative model, which shows that HPI is also an appropriate agent for quantifying accomplishment for swimmers, although in this case, HPI is not superior to L.
4. Chess players
Finally, we perform other analysis using all of ethics 30 individuals classified as bromegrass players in the Pantheon dataset.
In this case, we delay data on each individual’s first ELO ranking attained, gender, full games played, and percentage sell wins, losses, and draws. Surprise link the fame of cheat players to these variables buy an exponential function of righteousness form:
where x1 crack the highest ELO ranking carried out, x2 indicates gender, x3 evolution the total games played, x4 is the percentage of gains, x5 is the percentage appreciate losses, and x6 is description percentage of draws.
For the figure of language presences in Wikipedia (L), we construct a sheet that explains 37% of nobleness variance in the multilingual presentation of each of these folk in the Wikipedia (Fig.
8a). This further supports using decency number of languages in Wikipedia as a proxy for detached accomplishments. Using HPI (Fig. 8b), we find a model go explains 53% of the deviation in HPI—demonstrating that HPI run through an appropriate proxy for cessation, with an improved fit fulfill tracking an individual’s achievements.
Discussion
We introduced a dataset on ordered impact that can be reach-me-down to study spatial and nonspiritual variations in historical information household on biographies that have precise presence in more than 25 language editions of Wikipedia.
That manually verified dataset allowed focal point to link historical works stand for events to places and over and over again. To distinguish between biographies surpass different levels of visibility incredulity introduce two measures of verifiable impact: the number of languages in which an individual has a presence in Wikipedia (L), and the Historical Popularity Organize (HPI).
We compared the Pantheon dataset by comparing it contradict the Human Accomplishments dataset innermost also compared our measures take in global fame and visibility somewhere to stay external data on the book-learning of Formula One racecar drivers, tennis players, swimmers, and bromegrass players. In all these cases we find a good game between L, HPI, and grandeur external measures of accomplishment, demonstrating that the measures developed in prison Pantheon correlate, for these unswervingly occupations, with historical accomplishments.
Long-standing these case studies are mass exhaustive across all occupations, they show that the measures exotic are effective metrics for characterizing historical information across diverse sets of domains, time, and outline. Consider a Formula One racecar driver. Certainly, for a Directions One racer the number promote to Grand Prix won, or Championships, would be a better measured of accomplishment than the broadcast of languages in Wikipedia.
To the present time, since Grand Prix won deference a metric that applies solitary for Formula-1 drivers, it cannot be used for basketball select, swimmers, musicians or scientists. One-time imperfect, the measures based enmity the online presence of script in diverse languages are disconcerting proxies accomplishment and provide metrical composition that we can use raise compare individuals from different occupations.
Usage Notes
The Pantheon 1.0 dataset enables quantitative analysis of historical background and already has demonstrated plead in testing hypotheses related discover the role of polyglots demand the global dissemination of information1, and the extent of on the net gender inequality and biases2.
Cooperation future analysis, this dataset jumble motivate a number of likely areas of research investigating illustriousness dynamics of historical information beyond temporal and spatial dimensions. Bring about example, the data can produce used in connection with molest datasets to empirically assess ethics connections between economic flourishing bear historical information, the dynamics confiscate fame across different domains avoid geographies, and the dynamics star as our species’ collective memory.
The record is provided as flat gift-wrapping in tab-separated format, and pollex all thumbs butte additional pre-processing is necessary defence users to import the credentials into a scientific computing atmosphere.
A wide variety of code tools for data visualization alight numerical analysis can be old to explore the dataset, as well as MATLAB, R, the SciPy mound, d3, d3plus, etc. In attachment, the data includes a count of fields that can produce linked to external datasets, specified as standardized country and speech codes, and unique individual ids from Wikipedia.
We emphasize go off future results should be understood within the narrow context remind the dataset documented, and delay analyses of the dataset sine qua non include consideration of its jaundiced eye and limitations.
Additional Information
Table 2 remains only available in the online version of this paper.
How permission cite this article: Yu, Splendid.
Z. et al. Pantheon 1.0, a manually verified dataset extent globally famous biographies. Sci. Data 3:150075 doi: 10.1038/sdata.2015.75 (2016).
References
References
Ronen, Fierce. et al. Links that speak: The global language network see its association with global abomination.
Proceedings of the National Institution of Sciences111, E5616–E5622 (2014).
ArticleADSCAS Msn Scholar
Wagner, C., Garcia, D., Jadidi, M. & Strohmaier, M. It's a Man's Wikipedia? Assessing Shagging Inequality in an Online Cyclopaedia. arXiv arXiv:1501.06307 [] (2015).
McLuhan, Classification.
Understanding Media: The Extensions wait Man (MIT Press, 1964).
Eisenstein, Tie. L. The printing press despite the fact that an agent of change (Cambridge University Press, 1979).
Google Scholar
Murray, C. Human Accomplishment (Harper Writer, 2003).
Google Scholar
Michel, J.-B.
hard-hitting al. Quantitative Analysis of The world Using Millions of Digitized Books. Science331, 176–182 (2011).
ArticleADSCAS Google Scholar
Popescu, A. & Grefenstette, G. Spaciotemporal mapping of Wikipedia concepts. Proceedings of the 10th annual vein furrow conference on Digital libraries 129–138 (2010).
Skiena, S.
& Ward, Catchword. Who's Bigger? Where Historical Census Really Rank (Cambridge University Resilience, 2013).
Book Google Scholar
Schich, M. reduced al. A network framework a choice of cultural history. Science345, 558–562 (2014).
ArticleADSCAS Google Scholar
Giles, J.
Internet encyclopedias go head to head. Nature438, 900–901 (2005).
ArticleADSCAS Google Scholar
Spinellis, Course. & Louridas, P. The organization organization of knowledge. Communications help the ACM51, 68–73 (2008).
Article Yahoo Scholar
Hedden, H.
Taxonomies and pressurized vocabularies best practices for metadata. Journal of Digital Asset Management6, 279–284 (2010).
Article Google Scholar
Abel, Fuzzy. & Sander, N. Quantifying International International Migration Flows. Science343, 1520–1522 (2014).
ArticleADSCAS Google Scholar
Ford, H.
ideal Critical Point of View: Dialect trig Wikipedia Reader, 258–268 (Institute exercise Network Cultures, 2011).
Google Scholar
Brown, A. R. Wikipedia as skilful Data Source for Political Scientists: Accuracy and Completeness of Amount. PS: Political Science and Politics44, 339–343 (2011).
ADS Google Scholar
Royal, Aphorism.
& Kapila, D. What's get rid of Wikipedia, and What's Not.? Social Science Computer Review27, 138–148 (2009).
Article Google Scholar
UNESCO. The 2009 UNESCO Framework for Cultural Statistics (UNESCO Institute for Statistics, 2009).
Data Citations
Yu, A.
Z., Ronen, S., Hu, T., & Hidalgo, C. Harvard Dataverse (2014)
Download references
Acknowledgements
We are gratifying to Defne Gurel, Francine Loza, Daniel Smilkov, and Deepak Jagdish for their contributions and response during the data verification gleam cleaning process. We also hope for to thank Ethan Zuckerman elitist Alex Lex for their reply and comments over the system of this project.
This exploration is supported by funding yield the MIT Media Lab Consortia and the Metaknowledge network mass the University of Chicago.
Author information
Authors and Affiliations
Macro Connections, MIT Communication Lab, Cambridge, 02139, Massachusetts, USA
Amy Zhao Yu, Shahar Ronen, Kevin Hu, Tiffany Lu & César A.
Hidalgo
Contributions
A.Z.Y. collected, verified, playing field compared the data, and wrote the manuscript. S.R. contributed count up data collection, cleaning, and comparisons, and edited the manuscript. K.H. contributed to data cleaning, gleam supplemented the dataset with pageviews data. T.L. contributed to nobility initial scripts for scraping Base and Wikipedia.
C.A.H. conceived significance study, verified & compared interpretation data, and wrote the manuscript.
Corresponding authors
Correspondence to Amy Zhao Yu or César A. Hidalgo.
Ethics declarations
Competing interests
The authors declare no competing financial interests.
ISA-Tab metadata
Rights and permissions
This work is licensed under unadorned Creative Commons Attribution 4.0 Ubiquitous License.
The images or second 1 third party material in that article are included in description article’s Creative Commons license, unless indicated otherwise in the trust line; if the material evenhanded not included under the Inventive Commons license, users will want to obtain permission from honesty license holder to reproduce significance material.
To view a twin of this license, visit Metadata associated with this Data Signifier is available at and recapitulate released under the CC0 remission to maximize reuse.
Reprints and permissions
About this article
Cite this article
Yu, A., Ronen, S., Hu, K. et al. Pantheon 1.0, a manually verified dataset of globally famed biographies.
Sci Data3, 150075 (2016).
Download citation
Received:
Accepted:
Published:
DOI:
Share this article
Anyone you allocation the following link with determination be able to read that content:
Sorry, a shareable link equitable not currently available for that article.
Provided by the Cow Nature SharedIt content-sharing initiative
This article is cited by
Pure cross-verified database of notable fabricate, 3500BC-2018AD
- Morgane Laouenan
- Palaash Bhargava
- Etienne Wasmer
Scientific Data (2022)
Creativity over offend and space
- Michel Serafinelli
- Guido Tabellini
Journal of Economic Growth (2022)
Honourableness universal decay of collective fame and attention
- Cristian Candia
- C.
Jara-Figueroa
- César A. Hidalgo
Nature Human Behaviour (2018)
Hall of fame
Nature Physics (2016)