Cleaning up Countries

From Mediawiki1
Jump to navigationJump to search

Countries on EN DBpedia

select count(*) {
  ?country a dbo:Country}

returns 1694 countries. Obviously this won't do.

Let's analyze the reasons for this pollution and try to fix it.

Countries on EN Wikipedia

There are 638 Infobox_country instances. Most of them are international organizations (free trade zones, unions, etc). On the other hand, transclusion count says 1137.

The following templates redirect to Template:Infobox country:

  • Template:Infobox Countries
  • Template:Infobox Country
  • Template:Infobox Country or territory
  • Template:Infobox Geopolitical organisation
  • Template:Infobox Geopolitical organization
  • Template:Infobox Micronation
  • Template:Infobox geopolitical organisation
  • Template:Infobox geopolitical organization
  • Template:Infobox micronation
  • Template:Infobox nation

DBpedia works only with the target template (Infobox_country), so these redirects are not a problem.

Analysis Needed

  • Why are there more dbo:Country in DBpedia than "Infobox country" in Wikipedia?
  • Why "transclusion count" shows more than uses of "Infobox country"
  • Many pages have a "type" that allows us to map to a better class (eg https://en.wikipedia.org/wiki/Eurasian_Economic_Space has Type "Single market"). Eg #296 "Why Infobox_Geopolitical_organization (eg United_Nations) is mapped to Country" is resolved in this way.
  • But many other instances remain, eg United_Nations_Transitional_Authority_in_Cambodia: analyze and see if there's some other discriminator than "type"
  • How to filter out the sports organziations, eg Cricket_Samoa, IBA_Asia, Great_Britain_men%27s_national_basketball_team?
  • How about country-related articles, eg Radio_in_the_United_States, Human_trafficking_in_the_United_States, History_of_the_Jews_in_20th-century_Poland?
  • How about articles that are not even country-related, eg Comic_book_collecting, Record_collecting?
  • How about admin locations (eg cities) that are not countries, eg Russian_Dalian?
  • How about non-administrative locations, eg Reñaca_beach?

Infobox national basketball team

Great_Britain_men%27s_national_basketball_team uses template Infobox national basketball team that is not mapped in the wiki. But why does it come out as dbo:Country?

The extraction sample doesn't have any type... http://mappings.dbpedia.org/server/extraction/en/extract?title=Great_Britain_men%27s_national_basketball_team&format=turtle-triples