Graph transformation and manipulations

One of the advantages of accessing graphs through a database is the possibility to manipulate the graph implicitly without materializing it. Here, we provide several examples of such manipulation.

We also provide some of the queries finally generated to gives some hints on the type of operations performed by the database.

Some of these queries will be inefficient on SQLite because of the limited capacity of its query planner. We hope for better performances on full-fledged SGBD.

Nodes SubGraphs

Let us first build a random graph.

>>> G = nd.sqlite.Graph(db="/tmp/nd.db", name="nodes_ex")
>>> G.add_nodes_from([(i, {"parity": i%2}) for i in range(100)])
>>> G.add_edges_from([(randint(0, 99), randint(0, 99)) for _ in range(1000)])

Small explict subgraphs

We can look at the graph restricting the set of nodes to some explicit value:

>>> SmallGraph = G.subgraph(range(10))
>>> 10 in SmallGraph
False
>>> 9 in SmallGraph
True
>>> set(SmallGraph.edges).issubset(set(G.edges))
True

Subgraphs defined by conditions

We can select all even nodes using a property as follows:

>>> Even = G.find_all_nodes(parity=0)
>>> EvenGraph = G.subgraph(Even)

We can easily verify that its nodes are indeed even numbers:

>>> sorted(EvenGraph) == list(range(0, 100, 2))
True

The graph EvenGraph is actually a view on the initial graph:

>>> G.add_node(100)
>>> sorted(EvenGraph) == list(range(0, 101, 2))
True

Under the hood

It is possible to look at the query to access some node data.

For instance, the following instruction:

>>> sorted(SmallGraph[5])
[0, 1]

Is produced with the following query:

show/hide
WITH node_domain AS     (
        SELECT name FROM nodes1 WHERE name IN (?, ?, ?, ?, ?, ?, ?, ?, ?, ?)
)
SELECT DISTINCT key FROM
        nodes1
INNER JOIN
        symedges1
ON name = source AND (source IN node_domain AND target IN node_domain)
INNER JOIN
        edge_data1
ON symedges1.id = edge_data1.id
WHERE key IS NOT NULL AND (name = ? AND (target = ?)) LIMIT 4
--- ↖ ('0', '1', '2', '3', '4', '5', '6', '7', '8', '9', '5', '1')

Similarly we can look at the queries generated by EvenGraph:

>>> sorted(EvenGraph[50])
[28, 48, 84, 86, 94]

It is produced with the following query:

show/hide
WITH node_domain AS (
        SELECT name FROM nodes1
        WHERE name IN (
                SELECT nodes1.name FROM nodes1
                LEFT JOIN node_data1 ON nodes1.name = node_data1.name AND key = ?
                WHERE nodes1.name IS NOT NULL AND (key IS NULL OR value = ?)
        )
)
SELECT DISTINCT target
FROM
        nodes1
INNER JOIN
        symedges1
ON name = source AND (source IN node_domain AND target IN node_domain)
WHERE target IS NOT NULL AND (name = ?)
---                      ↖ ('"parity"', '0', '50')

Unfortunately, the query optimizer of SQLite failed to optimize such a query. For performance, and in the case of static graphs, it is possible to materialize the subset of nodes to accelerate and simplify query evaluation.

>>> EvenGraphMat = G.subgraph(Even, temporary_table=True)

The Graph EvenGraphMat relies on a temporary table that will be alive as long as the connection will be open.

The simple query above is thus much more simple and efficient:

show/hide
SELECT DISTINCT target
FROM
        nodes1
INNER JOIN
        symedges1 ON name = source AND (source IN nd_node_domain0 AND target IN nd_node_domain0)
WHERE target IS NOT NULL AND (name = ?)

Remark that some complex queries are generated, for instance for degree computation:

show/hide
WITH node_domain AS (
        SELECT name FROM nodes1
        WHERE name IN (
                SELECT nodes1.name
                FROM
                        nodes1
                LEFT JOIN
                        node_data1
                ON nodes1.name = node_data1.name AND key = ?
                !WHERE nodes1.name IS NOT NULL AND (key IS NULL OR value = ?)
                )
        )
        SELECT name, SUM(degree)
        FROM (
                SELECT name, COUNT(target) AS degree
                FROM
                        nodes1
                INNER JOIN
                        symedges1 ON name = source AND (source IN node_domain AND target IN node_domain)
                WHERE target IS NOT NULL AND name != target
                GROUP BY name
                UNION ALL
                SELECT name, COUNT(target)*? AS degree
                FROM
                        nodes1
                INNER JOIN
                        symedges1 ON name = source AND (source IN node_domain AND target IN node_domain)
        WHERE name = target GROUP BY name) AS partial_aggregate GROUP BY name
---    ↖ ('"parity"', '0', 2)

Temporary Table weirdness

When the flag temporary table is set, the resulting object is neither a view nor a static copy of the graph. We call that a semi-view. Indeed, the view is restricted to the subgraph induced by the set of nodes, that has originally been selected. Thus, deletions and re-insertions of nodes that belong to this set are dynamically reported to the semi-view, and edge insertions and deletions between these nodes as well. However, inserting a node (or an edge with some endpoint) not belonging to this set, will not be reported, even if the node would have been selected if it were existent at the time of the subgraph creation. Thus, the temporary_table flag should be used with caution.

We illustrate here this weirdness with some examples:
>>> G.remove_nodes_from(G)
>>> G.add_nodes_from(range(0, 16, 3))
>>> G.add_edge(9, 6)
>>> H = G.subgraph(range(10), temporary_table=True)
>>> 0 in G, 0 in H
(True, True)

Here, 0 was in the graph G at the time of the creation of the subgraph H, so its existence in H will always be maintained.

>>> G.remove_node(0)
>>> 0 in G, 0 in H
(False, False)
>>> G.add_node(0)
>>> 0 in G, 0 in H
(True, True)
>>> G.remove_node(0)
>>> 0 in G, 0 in H
(False, False)
>>> G.add_edge(0, 12) #notice that 12 is a node of `G` but not of `H`
>>> 0 in G, 0 in H
(True, True)

In contrast, 1 was not in G at the time of creation of H, so it will never be in H, although it would have been if it originally existed.

>>> 1 in G, 1 in H
(False, False)
>>> G.add_node(1)
>>> 1 in G, 1 in H
(True, False)
>>> G.add_edge(0, 1)
>>> 1 in G, 1 in H
(True, False)

A pair (s, t) is an edge of H if and only if it is an actual edge of G, and both its endpoints are nodes of H. So, considering only edges between nodes that were in G at the time of its creation, the edges of H are correctly maintained.

>>> (3, 6) in G.edges, (3, 6) in H.edges
(False, False)
>>> G.add_edge(3, 6)
>>> (3, 6) in H.edges
True
>>> (9, 6) in G.edges, (9, 6) in H.edges
(True, True)
>>> G.remove_edge(9, 6)
>>> (9, 6) in H.edges
False
>>> G.add_edge(9, 6)
>>> (9, 6) in H.edges
True

Edges subgraphs

We can use the same construction to define subgraphs over edge properties.

>>> G = nd.sqlite.Graph(db="/tmp/nd.db", name="edge_ex")
>>> G.add_nodes_from(range(100))
>>> Edges = [(randint(0, 99), randint(0, 99)) for _ in range(4000)]
>>> G.add_edges_from([ (source, target, {"sum": source+target}) for source, target in Edges])

We can define a subgraph by restricting only to some edges. We can provide the edges explictly or use an SQL-condition.

>>> SmallEdges = G.find_all_edges(sum=lambda c: c.cast("INT").lt(50))
>>> SmallEdgesGraph = G.edge_subgraph(SmallEdges)

As in NetworkX, we only get nodes that are induced by the selected edges.

>>> len(SmallEdgesGraph)
49

As in the case of nodes, we can use a temporary table to improve performance, but losing the view aspect on the node side.

>>> SmallEdgesGraphTemp = G.edge_subgraph(SmallEdges, temporary_table=True)
>>> G.add_edge(-2, 9)
>>> -2 in SmallEdgesGraph
True
>>> -2 in SmallEdgesGraphTemp
False

But this behaves correctly on edges as long as the nodes were already in the graph when building the table:

>>> G.add_edge(0, 2, sum=2)
>>> (0, 2) in SmallEdgesGraph.edges
True
>>> (0, 2) in SmallEdgesGraphTemp.edges
True
>>> G.add_edge(-1, 1, sum=0)
>>> (-1, 1) in SmallEdgesGraph.edges
True
>>> (-1, 1) in SmallEdgesGraphTemp.edges
False

Another possibility to prevent the need of temporary_table performance fixes is to use the restrict_node_domain flag to keep all nodes of the graph without filtering the nodes.

>>> SmallEdgesAllN = G.edge_subgraph(SmallEdges, restrict_node_domain=False)
>>> len(SmallEdgesAllN) == len(G)
True
>>> sorted(SmallEdgesGraph.edges) == sorted(SmallEdgesAllN.edges)
True

Regular Path Queries

Nothing implemented yet. In the roadmap.