Rewriting templateProperty: Difference between revisions

From Mediawiki1
Jump to navigationJump to search
(Created page with 'TODO: Make an EN example using Infobox Politician --~~~~ == Intro == The basic way the extractor works is like this: * data is extracted from template props * these are emitted ...')
 
 
(14 intermediate revisions by the same user not shown)
Line 1: Line 1:
'''WARNING: Work in progess. The content of this page is largely incorrect.'''
TODO: Make an EN example using Infobox Politician
TODO: Make an EN example using Infobox Politician
--[[User:VladimirAlexiev|VladimirAlexiev]] 12:12, 25 February 2015 (UTC)
--[[User:VladimirAlexiev|VladimirAlexiev]] 12:12, 25 February 2015 (UTC)
Line 4: Line 6:
== Intro ==
== Intro ==
The basic way the extractor works is like this:
The basic way the extractor works is like this:
* data is extracted from template props
* data is extracted from template props
* these are emitted as language-specific '''raw''' props, eg
* these are emitted as language-specific '''raw''' props, eg
** http://dbpedia.org/property/parent for EN (usual prefix [http://prefix.cc/dbp dbp:])
** http://dbpedia.org/property/parent for EN (usual prefix [http://prefix.cc/dbp dbp:])
** http://bg.dbpedia.org/property/родител for BG (usual prefix [http://prefix.cc/bgdbp bgdbp:]
** http://bg.dbpedia.org/property/родител for BG (usual prefix [http://prefix.cc/bgdbp bgdbp:]
* the raw data is passed through mappings templateProperty -> ontologyProperty
<s>* the raw data is passed through mappings templateProperty -> ontologyProperty</s>
 
<s>You'd think that templateProperty is the same as the raw prop name. Yeah but not always.</s>
 
The last part (''data is passed through mappings'') is wrong. The mapping based extractor processes the Wikitext source, '''not''' the output of the InfoboxExtractor. A pipeline architecture would make a lot of sense, but that's not how DBpedia works. [[User:Chrisahn|Chrisahn]] 17:54, 25 February 2015 (UTC)
 
Here's what actually happens:
 
* Wikitext is parsed into an AST (abstract syntax tree)
* The AST is passed to several different extractors according to the configuration
* Each extractor processes the AST and produces triples
* The triples are not used as input for any other extractors.
 
Here's what the [http://github.com/dbpedia/extraction-framework/blob/master/core/src/main/scala/org/dbpedia/extraction/mappings/InfoboxExtractor.scala InfoboxExtractor] does:
 
* data is extracted from template props in the AST
* these are emitted as language-specific '''raw''' props, eg
** http://dbpedia.org/property/parent for EN (usual prefix [http://prefix.cc/dbp dbp:])
** http://bg.dbpedia.org/property/родител for BG (usual prefix [http://prefix.cc/bgdbp bgdbp:]


You'd think that templateProperty is the same as the raw prop name. Yeah but not always.
Here's what the [http://github.com/dbpedia/extraction-framework/blob/master/core/src/main/scala/org/dbpedia/extraction/mappings/MappingExtractor.scala MappingExtractor] does:
 
* data is extracted from template props in the AST and passed through mappings templateProperty -> ontologyProperty
* these are emitted as generic mapping-based props, eg
** http://dbpedia.org/ontology/parent for EN, BG and any other language (usual prefix dbo:)


== Wikipedia Prop Structures ==
== Wikipedia Prop Structures ==
Line 67: Line 92:
</pre>
</pre>
What the extractor '''really''' does is:
What the extractor '''really''' does is:
: No it doesn't. See above. [[User:Chrisahn|Chrisahn]] 18:06, 25 February 2015 (UTC)
* Takes data from the templateProperty provided (as expected)
* Takes data from the templateProperty provided (as expected)
* Strips parasitic prefixes & suffixes from the templateProperty (maybe unexpected) and converts to camelCase
* Strips parasitic prefixes & suffixes from the templateProperty (maybe unexpected) and converts to camelCase

Latest revision as of 21:44, 25 February 2015

WARNING: Work in progess. The content of this page is largely incorrect.

TODO: Make an EN example using Infobox Politician --VladimirAlexiev 12:12, 25 February 2015 (UTC)

Intro

The basic way the extractor works is like this:

* the raw data is passed through mappings templateProperty -> ontologyProperty

You'd think that templateProperty is the same as the raw prop name. Yeah but not always.

The last part (data is passed through mappings) is wrong. The mapping based extractor processes the Wikitext source, not the output of the InfoboxExtractor. A pipeline architecture would make a lot of sense, but that's not how DBpedia works. Chrisahn 17:54, 25 February 2015 (UTC)

Here's what actually happens:

  • Wikitext is parsed into an AST (abstract syntax tree)
  • The AST is passed to several different extractors according to the configuration
  • Each extractor processes the AST and produces triples
  • The triples are not used as input for any other extractors.

Here's what the InfoboxExtractor does:

Here's what the MappingExtractor does:

  • data is extracted from template props in the AST and passed through mappings templateProperty -> ontologyProperty
  • these are emitted as generic mapping-based props, eg

Wikipedia Prop Structures

Many Wikipedia templates allow creating several instances of something. Eg Listen allows a Wikipedia editor to attach up to 11 soundRecording to the subject, using "parallel" arrays of properties:

  • filename, filename1... filename10
  • title, title1... title10
  • description, description1.. description10

The parallelism is reflected in a numeric suffix.

Good maps take care of this, by grouping the "parallel props" in separate IntermediateNodeMappings or a similar structure that can produce an "array". Eg mapping Listen has this 11 times:

  {{IntermediateNodeMapping | nodeClass = Sound | correspondingProperty = soundRecording | mappings =
    {{ PropertyMapping | templateProperty = type          | ontologyProperty = dc:type }}
    {{ PropertyMapping | templateProperty = filename1     | ontologyProperty = filename }}
    {{ PropertyMapping | templateProperty = title1        | ontologyProperty = title }}
    {{ PropertyMapping | templateProperty = description1  | ontologyProperty = description }}
  }}

Now consider Politicians. They may hold several Positions, each over several Mandates (they are nasty that way). For each Position>Mandate (say 5*3=15), there's a bunch of props such as party, predecessor, successor, colleagues (eg vicePresident, governor...), years the subject came to that position, years the colleagues came to their respective positions, etc.

Eg see prop names of Държавник_инфо, but that's not the complete story: there's also трети_мандат_* ("third mandate" fields) etc.

  • If the 2D data arrays below the photos of Rosen Plevneliev and Angela Merkel don't strike your fancy, check out one of them Socialists that ruled for 40 years: Тодор_Живков

See a full list of props and an incomplete attempt to group them all at Mapping Държавник_инфо.

Wikidata editors were at a loss to create meaningful two-dimensional parallel arrays of names, so they created parasitc prefixes & suffixes that are not so easy to match up. Eg there are 10 props "предшестванОт", all mapped to "predecessor" but in different groups:

 предшестван от
 предшестван от2
 предшестван от3
 втори_мандат_предшестван от
 втори_мандат_предшестван от2
 втори_мандат_предшестван от3
 трети_мандат_предшестван от
 ...
  • The prefixes may have any form
  • The suffixes are digits, optionally followed by letters

Rewriting templateProperty

The parasitic prefixes/suffixes encode important info about the grouping of props, but that info is not transmitted in any clear way.

Assume a mapping fragment like this, extracting data for resource bgdbr:Тодор_Живков

{{IntermediateNodeMapping | nodeClass = CareerStation | correspondingProperty = careerStation | mappings = 
    {{ PropertyMapping | templateProperty = втори_мандат_предшестван от3 | ontologyProperty = predecessor }}

What the extractor really does is:

No it doesn't. See above. Chrisahn 18:06, 25 February 2015 (UTC)
  • Takes data from the templateProperty provided (as expected)
  • Strips parasitic prefixes & suffixes from the templateProperty (maybe unexpected) and converts to camelCase
  • Emits the data using the original subject and this rewritten templateProperty, eg:
     bgdbr:Тодор_Живков bgdbp:предшестванОт 
  • Makes an IntermediateNode and connects it with correspondingProperty (as expected), eg:
     bgdbr:Тодор_Живков dbo:careerStation bgdbr:Тодор_Живков__1
  • Emits the data using the IntermediateNode and the ontologyProperty as provided (as expected), eg;
     bgdbr:Тодор_Живков__1 dbo:predeccessor 

This achieves several goals:

  • the general semantics of the raw property is preserved, but not its grouping
  • the grouping is preserved by the creation of IntermediateNodes that use mapped properties (if the mapping is good)

This allows you to make queries such as:

  • all predecessors of Тодор_Живков lumped together (regardless of the position). This works even if these raw props are not mapped!
     select * {bgdbr:Тодор_Живков bgdbp:предшестванОт ?pred}
  • all predecessors of Тодор_Живков, paired with successors, and the corresponding position name (office). (Note: you may want to throw in some OPTIONALs)
     select * {bgdbr:Тодор_Живков dbo:careerStation
       [dbo:predecessor ?pred; dbo:successor ?succ; dbo:office ?office]}

Neat!

NOTE Currently only purely numeric parasitic suffixes are stripped. Prefixes and alphanumeric suffixes would be stripped after issue #317 is implemented