The DBLP Graph

This folder contains scripts that download and build the full graph of dblp using NetworkDisk.

A link to a recent version of the graph can be found here here

Manipulating the graph

Basic Manipulation

To load the graph, assuming the downloaded version is at the location examples/test/dblp.db:

>>> dblp = nd.sqlite.Graph(db=dblp_path)
>>> Knuth = dblp.find_one_node("name", name="Donald E. Knuth")
>>> #positional argument "name" to say “the node should have an attribute "name"”
>>> #keyworded argument name="Donald E. Knuth" to say “if the node has an attribute "name" then the associated value should be "Donald E. Knuth"”
>>> print(Knuth) # nodes are just indices
280664
>>> dblp.nodes[Knuth]
{name: 'Donald E. Knuth'}

The graph has two type of nodes: publications and authors. It is a bipartite graph, there are no relations between authors and no relations between publications.

>>> knuth_articles = dblp[Knuth]
>>> min_article = min(knuth_articles) # fetch one article
>>> Kdata = dblp.nodes[min_article].fold() # fold force to fetch all node data
>>> Kdata_expected = {'_attrib': {'mdate': '2020-07-09', 'key': 'journals/combinatorics/Knuth96'}, 'ee': {'_attrib': {'type': 'oa'}, '_text': 'http://www.combinatorics.org/Volume_3/Abstracts/v3i2r5.html'}, 'journal': 'Electron. J. Comb.', 'number': '2', 'title': 'Overlapping Pfaffians.', 'url': 'db/journals/combinatorics/combinatorics3.html#Knuth96', 'volume': '3', 'year': 1996}
>>> Kdata == Kdata_expected
True

Path finding

It is possible to manipulate the graph easily to compute some path for instance.

>>> Shannon = dblp.find_one_node("name", name="Claude E. Shannon")
>>> nx.shortest_path_length(dblp, Shannon, Knuth)
8

Or to print a description of the path with the datavalue.

>>> for i, e in enumerate(nx.shortest_path(dblp, Shannon, Knuth)):
...             if i%2:
...                     print(">Title:", dblp.nodes[e]["title"])
...             else:
...                     print("Name:", dblp.nodes[e]["name"])
Name: Claude E. Shannon
>Title: Where the Action Is and Was in Information Science.
Name: Gerard Salton
>Title: ACM TODS Publication Policy.
Name: Philip A. Bernstein
>Title: The Concurrency Control Mechanism of SDD-1: A System for Distributed Databases (The Fully Redundant Case).
Name: Christos H. Papadimitriou
>Title: An Algorithmic View of the Universe.
Name: Donald E. Knuth

Building SubGraphs

It is possible to build subgraphs based on the dblp graph. The subgraph will not be actually materialized, it is simply a query rewriting technique based on some conditions provided.

For instance:

>>> Neighbors_2 =  nx.algorithms.descendants_at_distance(dblp, Knuth, 2).union(dblp[Knuth], [Knuth])

2Neighbors contains the list of all nodes at distance at most 2 of Knuth.

>>> dblp_knuth = dblp.subgraph(Neighbors_2)
>>> len(dblp_knuth.nodes())
246
>>> len(dblp_knuth.edges())
278

It is possible to find all co-authors of Knuth with this subgraph:

>>> coauthors = list(dblp_knuth.find_all_nodes("name"))

The variable coauthors contains all nodes at distance at most 2 of Knuth having the field “name” set.

>>> len(coauthors)
73

NetworkDisk Graphs are not very good at performing intensive operations over small graphs (Networkx is way faster). But we can extract subgraphs quickly and build small Networkx graphs from them.

>>> nx_dblp_knuth = dblp_knuth.copy_to_networkx(edge_data=False)

TODO: Why edge_data=False improves performance??

It is also possible to build sub-graphs based on some data value. For instance, we can compute the dblp graph by filtering articles of recent years.

TODO: It is not working so well, conditions are not pushed forward and materialization makes it impossible to use.

Building the graph

To build the DBLP graph, it is necessary to have the lxml module installed:

$ python3 -m pip install lxml

Then from the repository, to execute the bash script:

$ bash generate_graph.sh foo

It will download and build the graph in the foo sub-directory. If a subdirectory is not provided, it will download and build the graph in a date-related folder. For instance 210311 for 11th march 2021.

If the folder already exists and already contains the file dblp.xml, it will not re-download it except if the flag -f is set.

The graph construction takes time: around half an hour on a desktop computer.

TODO

  • Improve XML parsing.

  • Improve log and timing evaluation

  • Improve target directory selection (broken?)