Cleaning up Countries
Countries on EN DBpedia
select count(*) { ?country a dbo:Country}
returns 1694 countries. Obviously this won't do.
Let's analyze the reasons for this pollution and try to fix it.
Countries on EN Wikipedia
There are 638 Infobox_country instances. Most of them are international organizations (free trade zones, unions, etc). On the other hand, transclusion count says 1137.
The following templates redirect to Template:Infobox country:
- Template:Infobox Countries
- Template:Infobox Country
- Template:Infobox Country or territory
- Template:Infobox Geopolitical organisation
- Template:Infobox Geopolitical organization
- Template:Infobox Micronation
- Template:Infobox geopolitical organisation
- Template:Infobox geopolitical organization
- Template:Infobox micronation
- Template:Infobox nation
DBpedia works only with the target template (Infobox_country), so these redirects are not a problem.
Analysis Needed
- Why are there more dbo:Country in DBpedia than "Infobox country" in Wikipedia?
- Why "transclusion count" shows more than uses of "Infobox country"
- Many pages have a "type" that allows us to map to a better class (eg https://en.wikipedia.org/wiki/Eurasian_Economic_Space has Type "Single market"). Eg #296 "Why Infobox_Geopolitical_organization (eg United_Nations) is mapped to Country" is resolved in this way.
- But many other instances remain, eg United_Nations_Transitional_Authority_in_Cambodia: analyze and see if there's some other discriminator than "type"
- How to filter out the sports organziations, eg Cricket_Samoa, IBA_Asia, Great_Britain_men%27s_national_basketball_team?
- How about country-related articles, eg Radio_in_the_United_States, Human_trafficking_in_the_United_States, History_of_the_Jews_in_20th-century_Poland?
- How about articles that are not even country-related, eg Comic_book_collecting, Record_collecting?
- How about admin locations (eg cities) that are not countries, eg Russian_Dalian?
- How about non-administrative locations, eg Reñaca_beach?
Infobox national basketball team
Great_Britain_men%27s_national_basketball_team uses template Infobox national basketball team that is not mapped in the wiki. But why does it come out as dbo:Country?
The extraction sample doesn't have any type... http://mappings.dbpedia.org/server/extraction/en/extract?title=Great_Britain_men%27s_national_basketball_team&format=turtle-triples