Wikidata:Pywikibot - Python 3 Tutorial/Big Data
This chapter will introduce the concept of gathering data from more than one Wikidata-item. As you can probably guess, iterating over more than 10 million items is extremly inefficient. We therefore need a way to pre-select a subset of all items.
Introduction
[edit]To follow along with the next few examples you should understand generators. A generator acts similar to a list in a for-loop. Instead of iterating over the items of a list, a for-loop will iterate over each item that will be returned by the generator.
The examples on this page can be easily connected with the examples from the previous chapters to query certain statements in the for-loop and writing functions that will save the data to disk.
Selecting Items by Template Usage
[edit]One way to iterate over a subset of items is to choose them by the usage of a template on Wikipedia. The way to do this is to write a generator that will return each page for us to iterate over. The example will look at the usage of Template:Infobox meteorite (Q6037522) on en-wiki, but you can also replace the string with another template. It is difficult to separate the parts of the example. Read it, run it and then we will discuss some of the new things:
import pywikibot
from pywikibot import pagegenerators as pg
def list_template_usage(site_obj, tmpl_name):
"""
Takes Site object and template name and returns a generator.
The function expects a Site object (pywikibot.Site()) and
a template name (String). It creates a list of all
pages using that template and returns them as a generator.
The generator will load 50 pages at a time for iteration.
"""
name = "{}:{}".format(site_obj.namespace(10), tmpl_name)
tmpl_page = pywikibot.Page(site_obj, name)
ref_gen = tmpl_page.getReferences(follow_redirects=False)
filter_gen = pg.NamespaceFilterPageGenerator(ref_gen, namespaces=[0])
generator = site_obj.preloadpages(filter_gen, pageprops=True)
return generator
site = pywikibot.Site("en", 'wikipedia')
tmpl_gen = list_template_usage(site, "Infobox meteorite")
for page in tmpl_gen:
item = pywikibot.ItemPage.fromPage(page)
print(page.title(), item.getID())
The first line that is executed gets the Site
-object of the English Wikipedia. The second already calls the function that returns the generator. The function takes two arguments and is therefore sufficiently flexible to handle any language of Wikipedia and different templates. Notice that we don't write "Template:Infobox meteorite"
. The namespace is added in the function itself.
Within the list_template_usage()
function we first construct the string consisting of namespace + template-name. The namespace is queried from the site object (site_obj.namespace(10)
returns "Template"
). Next we need to get the Page
object of the template page passing the Site
object and the template name.
Once we have the template Page
object we get the referring pages generator (returns a PageGenerator
object). We then need to pass this to the NamespaceFilterPageGenerator
(namespaces [0] is "", an empty string and the standard namespace in which Wikpedia entries reside) and finally the preloadpages
generator, which is returned by the function. These lines are more advanced and to find out more about them read the source in pywikibot/site.py
and pywikibot/pagegenerators.py
.
Finally we use the tmpl_gen
variable that stores the generator to start a for-loop. The for-loop will get 50 pages at a time, iterate over them and then ask the generator for the next batch of pages until the generator will yield no more pages. The print statement we put in the for-loop will output the following:
Retrieving 50 pages from wikipedia:en. Wold Cottage (meteorite) Q4053207 Allan Hills 84001 Q47580 Campo del Cielo Q1031478 Sayh al Uhaymir 169 Q2228546 Sikhote-Alin meteorite Q652204 ... ... total of 50 objects ... Retrieving 50 pages from wikipedia:en. Gao–Guenie meteorite Q176241 Pallasovka (meteorite) Q7127754
We can see that is a really powerful and easy way to preselect items for querying Wikidata.
Selecting Item by Wikidata Statement
[edit]Pywikibot also allows to select items by statement. This can be done using a SPARQL query. As an example, we will look at all the items that have a pKa (P1117) value set.
First of all, we need to build the query and check if we get the correct results. I created this query:
#Items that have a pKa value set
SELECT ?item ?value
WHERE
{
?item wdt:P1117 ?value .
}
Attention: When you build your own query, note that ?item
is currently the only variable allowed by Pywikibot for selecting items. Likewise, ?itemLabel
is the only allowed variable to select labels.
This currently yields around 200 results.
In the next step, we can copy the query into a file named pka-query.rq
in our project directory (.rq
is the file extension for SPARQL queries).
Loading the query in the script is straightforward, and the following snippet shows how to call the generator and iterate over the items:
#!/usr/bin/python3
import pywikibot
from pywikibot import pagegenerators as pg
with open('pka-query.rq', 'r') as query_file:
QUERY = query_file.read()
wikidata_site = pywikibot.Site("wikidata", "wikidata")
generator = pg.WikidataSPARQLPageGenerator(QUERY, site=wikidata_site)
for item in generator:
print(item)
This will output a list of each item selected as ?item
.
Conclusion
[edit]This chapter had the goal to teach you to iterate over Wikidata in a more intelligent way than going from Universe (Q1) all the way up to the most recent item. Try to keep this selecting logic in a separate function, so that you can adapt your bot to a different use-case without changing too many lines of code.