The DBLP Graph

This folder contains scripts that download and build the full graph of dblp using NetworkDisk.

A link to a recent version of the graph can be found here here

Manipulating the graph

Basic Manipulation

To load the graph, assuming the downloaded version is at the location examples/test/dblp.db:

>>> dblp = nd.sqlite.Graph(db=dblp_path)
>>> Knuth = dblp.find_one_node("name", name="Donald E. Knuth")
>>> #positional argument "name" to say “the node should have an attribute "name"”
>>> #keyworded argument name="Donald E. Knuth" to say “if the node has an attribute "name" then the associated value should be "Donald E. Knuth"”
>>> print(Knuth) # nodes are just indices
280664
>>> dblp.nodes[Knuth]
{name: 'Donald E. Knuth'}

The graph has two type of nodes: publications and authors. It is a bipartite graph, there are no relations between authors and no relations between publications.

>>> knuth_articles = dblp[Knuth]
>>> min_article = min(knuth_articles) # fetch one article
>>> Kdata = dblp.nodes[min_article].fold() # fold force to fetch all node data
>>> Kdata_expected = {'_attrib': {'mdate': '2020-07-09', 'key': 'journals/combinatorics/Knuth96'}, 'ee': {'_attrib': {'type': 'oa'}, '_text': 'http://www.combinatorics.org/Volume_3/Abstracts/v3i2r5.html'}, 'journal': 'Electron. J. Comb.', 'number': '2', 'title': 'Overlapping Pfaffians.', 'url': 'db/journals/combinatorics/combinatorics3.html#Knuth96', 'volume': '3', 'year': 1996}
>>> Kdata == Kdata_expected
True

Path finding

It is possible to manipulate the graph easily to compute some path for instance.

>>> Shannon = dblp.find_one_node("name", name="Claude E. Shannon")
>>> nx.shortest_path_length(dblp, Shannon, Knuth)
8

Or to print a description of the path with the datavalue.

>>> for i, e in enumerate(nx.shortest_path(dblp, Shannon, Knuth)):
...             if i%2:
...                     print(">Title:", dblp.nodes[e]["title"])
...             else:
...                     print("Name:", dblp.nodes[e]["name"])
Name: Claude E. Shannon
>Title: Where the Action Is and Was in Information Science.
Name: Gerard Salton
>Title: ACM TODS Publication Policy.
Name: Philip A. Bernstein
>Title: The Concurrency Control Mechanism of SDD-1: A System for Distributed Databases (The Fully Redundant Case).
Name: Christos H. Papadimitriou
>Title: An Algorithmic View of the Universe.
Name: Donald E. Knuth

Building SubGraphs

It is possible to build subgraphs based on the dblp graph. The subgraph will not be actually materialized, it is simply a query rewriting technique based on some conditions provided.

For instance:

>>> Neighbors_2 =  nx.algorithms.descendants_at_distance(dblp, Knuth, 2).union(dblp[Knuth], [Knuth])

2Neighbors contains the list of all nodes at distance at most 2 of Knuth.

>>> dblp_knuth = dblp.subgraph(Neighbors_2)
>>> len(dblp_knuth.nodes())
246
>>> len(dblp_knuth.edges())
278

It is possible to find all co-authors of Knuth with this subgraph:

>>> coauthors = list(dblp_knuth.find_all_nodes("name"))

The variable coauthors contains all nodes at distance at most 2 of Knuth having the field “name” set.

>>> len(coauthors)
73

NetworkDisk Graphs are not very good at performing intensive operations over small graphs (Networkx is way faster). But we can extract subgraphs quickly and build small Networkx graphs from them.

>>> nx_dblp_knuth = dblp_knuth.copy_to_networkx(edge_data=False)

TODO: Why edge_data=False improves performance??

It is also possible to build sub-graphs based on some data value. For instance, we can compute the dblp graph by filtering articles of recent years.

TODO: It is not working so well, conditions are not pushed forward and materialization makes it impossible to use.

Building the graph

To build the DBLP graph, it is necessary to have the lxml module installed:

$ python3 -m pip install lxml

Then from the repository, to execute the bash script:

$ bash generate_graph.sh foo

It will download and build the graph in the foo sub-directory. If a subdirectory is not provided, it will download and build the graph in a date-related folder. For instance 210311 for 11th march 2021.

If the folder already exists and already contains the file dblp.xml, it will not re-download it except if the flag -f is set.

The graph construction takes time: around half an hour on a desktop computer.

TODO

Improve XML parsing.
Improve log and timing evaluation
Improve target directory selection (broken?)