Introduction

The main way to interact with NetworkDisk is to use graph classes which behave similarly to NetworkX graph classes (Graph and DiGraph only). Some specific methods are introduced to handle the specificity of interacting with a database.

This introduction provides some examples of manipulations of those classes, using the SQLite backend.

Graphs and DiGraphs initialization

To create a SQLite Graph, one can simply call the constructor exactly like for networkdisk:

>>> G = nd.sqlite.Graph()

Here, G is an in memory SQLite Graph nd.sqlite.Graph using the SQLite database in-memory. We can see this by observing its helper’s dbpath, which is the sqlite3 key ':memory:':

>>> G.helper.dbpath
':memory:'

It is possible of course to provide a path to an SQLite file:

>>> K = nd.sqlite.Graph(db="/tmp/mydb.db")
>>> K.helper.dbpath
'/tmp/mydb.db'

This graph will be persistent, as long as the file /tmp/mydb.db is on the file system. To access it again we can simply provide the same file.

>>> J = nd.sqlite.Graph(db="/tmp/mydb.db")

Now, the graphs K and J points towards the same DB-graph although they are distinct nd.sqlite.Graph instances. It can be illustrated by inserting some data in one or the other of the graphs.

>>> K.add_node("foo")
>>> J.add_node("bar")
>>> sorted(J)
['bar', 'foo']
>>> sorted(K)
['bar', 'foo']

As we just observed, the db keyword parameter accepts a file path. It also accepts the special key ':memory:' (which is its default value, used to create the above Graph G) for storing graph in central memory. (This key is actually provided by the module sqlite3 connector.) For ease of use, two other keys have been added in NetworkDisk, namely ':tempfile' and ':autofile'. They are both substitued by a file path, in which the graph will be written.

>>> T = nd.sqlite.Graph(db=':tempfile:')
>>> T.helper.dbpath 
'/tmp/networkdisk/graphs/Graph_-anonymous-_adjnaz_asn.db'
>>> A = nd.sqlite.Graph(db=':autofile:')
>>> A.helper.dbpath 
'$PWD/_networkdisk_graphs/Graph_-anonymous-.db'

In both cases, the graph is located in a subdirectory. With ':tempfile:', this subdirectory is in the /tmp/networkdisk directory, while with ':autofile:' it is in the _networkdisk_graphs subdirectory of the current directory. (Subdirectories are automatically created if they do not exist.) There is another difference between the two keys: the first one always yields a new fresh file name, while the second key deterministically choose a file name, which may already exist. In the above example, we can see that in both cases the file base name is prefixed by Graph_-anonymous- in which Graph is the type of graph for which the file was created (it would have been DiGraph in case of nd.sqlite.DiGraph) and -anonymous- indicates that the graph was not named at the time of its creation (it would have been simply its name instead of -anonymous- otherwise). A direct consequence of these two distinct file naming strategies is that the use of ':tempfile:' always implies that a new graph is created, while the use of ':autofile:' makes it possible to load a graph from the deterministically generated path, if it already exists. Indeed:

>>> B = nd.sqlite.Graph(db=':autofile:')
>>> B.helper.dbpath 
'$PWD/_networkdisk_graphs/Graph_-anonymous-.db'
>>> B.helper.dbpath == A.helper.dbpath
True
>>> A.add_edge(0, 1, foo='bar')
>>> sorted(A.edges)
[(0, 1)]
>>> sorted(B.edges)
[(0, 1)]

Another possibility for the db keyworded parameter, il to directly feed it with a module sqlite3 db connector:

>>> db = sqlite3.connect("/tmp/mydb.db")
>>> L = nd.sqlite.Graph(db=db)
>>> sorted(L)
['bar', 'foo']
>>> L.helper.dbpath is None #the graph do not have access to the DB-file path any longer
True

or a NetworkDisk sqlite Helper (which are a kind of wrapper to sqlite3 connectors):

>>> M = nd.sqlite.Graph(db=L.helper)
>>> sorted(M)
['bar', 'foo']
>>> M.helper.dbpath is None
True

NetworkX graphs can be named. We use this opportunity to allow storing and loading many graphs into the same database file. The retrieval might use the non-mandatory graph name coupled with the graph type, or a unique id (as shown later).

>>> K.name = "my first graph"
>>> G = nd.sqlite.DiGraph(db="/tmp/mydb.db", name="my first digraph")

Now the variables K and G are respectively a nd.sqlite.Graph and a nd.sqlite.DiGraph in the same DB-file. We can interact with both without interference: they are distinct graphs:

>>> G.add_edge(0, 1)
>>> K.add_edge(1, 2)
>>> sorted(G.edges)
[(0, 1)]
>>> sorted(K.edges)
[(1, 2)]

To reload the appropriate graph, we can provide its name:

>>> S = nd.sqlite.DiGraph(db="/tmp/mydb.db", name="my first digraph")

If we choose no name or a new name, it will create another Graph:

>>> V = nd.sqlite.Graph(db="/tmp/mydb.db")
>>> U = nd.sqlite.DiGraph(db="/tmp/mydb.db", name="my second digraph")

Here S is exactly the same graph as G, while U and V are two distinct new graphs with V being anonymous.

Wait! Earlier in this tutorial, we built a graph J pointing to an existent anonymous graph G, by simply specifying the DB-file path, that is, with exactly the arguments used for creating V above. So, why are J and V different?

That’s true. There is some clever behavior (that can be controlled by advanced parameters) in choosing what to do with the given parameters. When we created J, there was only one graph in the DB-file, so the constructor decided to use it, making J pointing to that DB-graph. Yet, at V’s creation time, several graphs existed in the indicated DB-file, so none of them was chosen but a new one was rather created. The clever behavior aims to make the use of NetworkDisk easy and natural, without requiring much knowledge of its internal behavior. The particular case of single graph DB-file thus enjoys simpler graph access.
>>> _G = nd.sqlite.Graph(db="/tmp/myotherdb.db")
>>> _H = nd.sqlite.Graph(db="/tmp/myotherdb.db") #_H points to the same DB-graph as _G
>>> _I = nd.sqlite.DiGraph(db="/tmp/myotherdb.db", name="other graph")
>>> _J = nd.sqlite.Graph(db="/tmp/myotherdb.db") #_J is a new anonymous graph
>>> _G.add_node(0)
>>> 0 in _G, 0 in _H, 0 in _J
(True, True, False)

Graph names are not necessarily unique in the database. However, each graph is given a unique id. It is more robust to retrieve a graph by its id, but it requires knowing the id.

>>> K.masterid
1
>>> I = nd.sqlite.Graph(db="/tmp/mydb.db", masterid=1)
>>> sorted(I.edges)
[(1, 2)]

To make it easier to juggle between graphs in the same DB file, it is possible to use the master table: a table created in each DB-file accessed by NetworkDisk (unless explicitly avoided) that contains the list of the graphs, by name and by id.

>>> U.name = "my first digraph" #names are not unique
>>> G.master
MasterGraphs<SQLite(/tmp/mydb.db)>

|id  |name               |type      |
|————|———————————————————|——————————|
|1   |my first graph     |Graph     |
|2   |my first digraph   |DiGraph   |
|3   |None               |Graph     |
|4   |my first digraph   |DiGraph   |

It is also possible to load/create graphs directly within a DB file by using the master table nd.sql.MasterGraphs.

>>> M = nd.sqlite.MasterGraphs("/tmp/master.db") # Empty DB file
>>> M.new_ungraph(name="A") #the created graph is returned
<networkdisk.sqlite.graph.Graph object at 0x...>
>>> _ = M.new_ungraph(name="B")
>>> _ = M.new_diGraph(name="C")
>>> _ = M.new_diGraph(name="D")
>>> _ = M.new_ungraph(name=1)
>>> _ = M.new_ungraph(name=2)
>>> M.pretty_print()
|id  |name|type      |
|————|————|——————————|
|1   |A   |Graph     |
|2   |B   |Graph     |
|3   |C   |DiGraph   |
|4   |D   |DiGraph   |
|5   |1   |Graph     |
|6   |2   |Graph     |

This class is simply an interface to manipulate the MasterTable, which is a specific Table stored in the DB holding some meta information on the graphs within the DB.

The MasterGraphs is a mapping whose keys are either names or ids (ids having the priority over names). For instance, it is possible to load the Graph “A” by simply providing:

>>> M["A"].name
'A'
>>> M[1].name
'A'

Transactionality

One of the advantages of using NetworkDisk for graph manipulation, is to use its transaction management. It can indeed be exploited to avoid keeping changes made by an algorithm that failed.

>>> with G.helper.transaction():
...     G.add_edge(-99, 99)
...     raise ValueError("Something wrong happened")
Traceback (most recent call last):
        ...
ValueError: Something wrong happened
>>> (-99, 99) in G.edges
False

The failure might be an exception raised, as in the above example, or fired manually:

>>> with G.helper.transaction:
...     G.add_edge(-99, 99)
...     print((-99, 99) in G.edges)
...     G.helper.transaction.rollback()
True
>>> (-99, 99) in G.edges
False
>>> with G.helper.transaction as t: #t is a shorthand for G.helper.transaction
...     G.add_edge(-99, 99)
...     print((-99, 99) in G.edges)
...     t.rollback()
True
>>> (-99, 99) in G.edges
False

Or, without context management:

>>> G.helper.transaction.begin()
>>> G.add_edge(-99, 99)
>>> (-99, 99) in G.edges
True
>>> G.helper.transaction.rollback()
>>> (-99, 99) in G.edges
False

Of course, if the algorithm ends without failing, changes are preserved.

>>> with G.helper.transaction:
...     G.add_edge(-99, 99)
>>> (-99, 99) in G.edges
True

Transactions can be nested (actually, they are not nested transactions, but they look like by an adequate use of SQL save points).

>>> with G.helper.transaction() as t:
...     G.remove_edge(-99, 99)
...     with t(oncontext=2):
...             G.remove_node(-99)
...             t.rollback(to_savepoint=True) #Here, True means "the lastly-created savepoint"
>>> (-99, 99) in G.edges
False
>>> -99 in G
True

Under the hood

It is possible to look at which kind of queries are sent to the database by activating the SQL Logger. It can provide interesting information on how the graph is stored and how an algorithm is behaving. Typically, if too many queries are sent to the database, it is likely that some function is performing inefficient operations. On the negative side, the logger can impact the performance greatly, even over one query by intercepting iterators. It is thus advised to deactivate the logger when performance is an issue.

The logger can be activated or deactivated easily:

>>> G.helper.sql_logger.active = False # Deactivate
>>> G.helper.sql_logger.active = True # Reactivate

It is also possible to activate it at graph instantiation. If the Graph is created, it will produce the Schema Creation queries, which is interesting to understand how the Graph is defined in the Database:

>>> G = nd.sqlite.DiGraph(sql_logger=True)

show/hide

PRAGMA foreign_keys = ON
SELECT name FROM sqlite_master WHERE type = ? AND name = ?
SELECT name FROM sqlite_master WHERE type = ? AND name = ?
BEGIN TRANSACTION
CREATE TABLE networkdisk_master (
        id INTEGER PRIMARY KEY AUTOINCREMENT,
         name TEXT,
         type TEXT,
         schema BLOB,
         networkdisk_version TEXT,
         creation_date DATE DEFAULT (DATE('now')),
         last_access_date DATE DEFAULT (DATE('now')),
         last_alteration_date DATE DEFAULT (DATE('now')),
         info TEXT)
COMMIT TRANSACTION
BEGIN TRANSACTION
INSERT INTO networkdisk_master(type, schema, networkdisk_version) VALUES (?, ?, ?)
REPLACE INTO networkdisk_master(type, schema, id, networkdisk_version) VALUES (?, ?, ?, ?)
COMMIT TRANSACTION
--- Begin graph's schema creation
BEGIN TRANSACTION
CREATE TABLE nodes1 (name TEXT PRIMARY KEY ON CONFLICT IGNORE NOT NULL ON CONFLICT IGNORE)
CREATE TABLE node_data1 (
        name TEXT REFERENCES nodes1(name) ON DELETE CASCADE ON UPDATE CASCADE DEFERRABLE INITIALLY DEFERRED NOT NULL ON CONFLICT IGNORE,
        key TEXT NOT NULL ON CONFLICT IGNORE,
        value TEXT NOT NULL,
        UNIQUE (name, key) ON CONFLICT REPLACE)
CREATE INDEX index_node_data1_key_value ON node_data1(key, value)
CREATE TABLE edges1 (
        id INTEGER PRIMARY KEY AUTOINCREMENT NOT NULL,
        source TEXT REFERENCES nodes1(name) ON DELETE CASCADE ON UPDATE CASCADE DEFERRABLE INITIALLY DEFERRED NOT NULL,
        target TEXT REFERENCES nodes1(name) ON DELETE CASCADE ON UPDATE CASCADE DEFERRABLE INITIALLY DEFERRED NOT NULL ON CONFLICT IGNORE,
        UNIQUE (source, target) ON CONFLICT IGNORE)
CREATE INDEX index_edges1_target_source ON edges1(target, source)
CREATE TABLE edge_data1 (
        id INTEGER REFERENCES edges1(id) ON DELETE CASCADE ON UPDATE CASCADE NOT NULL,
        key TEXT NOT NULL,
        value TEXT NOT NULL,
        UNIQUE (id, key) ON CONFLICT REPLACE)
CREATE INDEX index_edge_data1_key_value ON edge_data1(key, value)
CREATE TABLE graph_info1 (
        key TEXT PRIMARY KEY ON CONFLICT REPLACE NOT NULL,
        value TEXT NOT NULL)
CREATE INDEX index_graph_info1_key_value ON graph_info1(key, value)
COMMIT TRANSACTION
--- End graph's schema creation

We can also look at the trace of the operation performed when inserting nodes and edges:

>>> G.add_node("foo")

show/hide

BEGIN TRANSACTION
SELECT COUNT(?) FROM nodes1 WHERE name = ? LIMIT 1
---                             ↖ (1, '"foo"')
DELETE FROM nodes1 WHERE name = ?
---             ↖ ('"foo"',) … ('"foo"',)
INSERT INTO nodes1(name) VALUES (?)
---                     ↖ ('"foo"',)
INSERT INTO node_data1(name, key, value) SELECT name, ?, ? FROM nodes1 WHERE name = ?
---                             ↖ ('"bar"', '{"c": "d"}', '"foo"') … ('"bar"', '{"c": "d"}', '"foo"')
COMMIT TRANSACTION

>>> G.add_edges_from([(0, 1), (1, 2, {"foo":"bar", "value":42})])

show/hide

BEGIN TRANSACTION
INSERT INTO edges1(source, target) VALUES (?, ?)
---                             ↖ ('0', '1') … ('1', '2')
INSERT INTO nodes1(name) SELECT DISTINCT source FROM edges1 UNION SELECT DISTINCT target FROM edges1
INSERT INTO edge_data1(id, key, value) SELECT id, ?, ? FROM edges1 WHERE source = ? AND (target = ?)
---                             ↖ ('"foo"', '"bar"', '1', '2') … ('"value"', '42', '1', '2')
COMMIT TRANSACTION

Bulk Insertion

Some custom optimizations are performed when inserting insert many nodes at once. For instance, the following code will produce as many transactions as nodes to insert, it will be really slow.

>>> for i in range(2):
...     G.add_node(f"new_node_{i}")

show/hide

BEGIN TRANSACTION
SELECT COUNT(?) FROM nodes1 WHERE name = ? LIMIT 1
---              ↖ (1, '"new_node_0"')
DELETE FROM nodes1 WHERE name = ?
---              ↖ ('"new_node_0"',) … ('"new_node_0"',)
INSERT INTO nodes1(name) VALUES (?)
---              ↖ ('"new_node_0"',) … ('"new_node_0"',)
COMMIT TRANSACTION
BEGIN TRANSACTION
SELECT COUNT(?) FROM nodes1 WHERE name = ? LIMIT 1
---              ↖ (1, '"new_node_1"')
DELETE FROM nodes1 WHERE name = ?
---              ↖ ('"new_node_1"',) … ('"new_node_1"',)
INSERT INTO nodes1(name) VALUES (?)
---              ↖ ('"new_node_1"',) … ('"new_node_1"',)
COMMIT TRANSACTION
BEGIN TRANSACTION
SELECT COUNT(?) FROM nodes1 WHERE name = ? LIMIT 1
---              ↖ (1, '"new_node_2"')
DELETE FROM nodes1 WHERE name = ?
---              ↖ ('"new_node_2"',) … ('"new_node_2"',)
INSERT INTO nodes1(name) VALUES (?)
---              ↖ ('"new_node_2"',) … ('"new_node_2"',)
COMMIT TRANSACTION

Meanwhile, it is possible to insert them all at once in a very efficient way:

>>> G.add_nodes_from(list(f"new_node_{i}" for i in range(3, 1000)))

show/hide

BEGIN TRANSACTION
INSERT INTO nodes1(name) VALUES (?)
---              ↖ ('"new_node_3"',) … ('"new_node_999"',)
COMMIT TRANSACTION

On a common computer, inserting 1000 nodes sequentially takes more than 2 seconds, while inserting them in bulk will take only a few milliseconds.

During the insertion, two distinct algorithms are used depending on whether the input data is only an iterable or any object that is reiterable. In the case of reiterable data, two passes will be performed, one to insert the nodes, and one to insert their data.

If the input data is only an iterable then we need to insert data and nodes in one pass only, which degrades the performance greatly.

>>> G.add_nodes_from( ((i, {"foo":"bar"}) for i in range(2)) )

show/hide

BEGIN TRANSACTION
SELECT COUNT(?) FROM nodes1 WHERE name = ? LIMIT 1
---              ↖ (1, '0')
DELETE FROM nodes1 WHERE name = ?
---              ↖ ('0',) … ('0',)
INSERT INTO nodes1(name) VALUES (?)
---              ↖ ('0',)
INSERT INTO node_data1(name, key, value) SELECT name, ?, ? FROM nodes1 WHERE name = ?
---              ↖ ('"foo"', '"bar"', '0') … ('"foo"', '"bar"', '0')
SELECT COUNT(?) FROM nodes1 WHERE name = ? LIMIT 1
---              ↖ (1, '1')
DELETE FROM nodes1 WHERE name = ?
---              ↖ ('1',) … ('1',)
INSERT INTO nodes1(name) VALUES (?)
---              ↖ ('1',)
INSERT INTO node_data1(name, key, value) SELECT name, ?, ? FROM nodes1 WHERE name = ?
---              ↖ ('"foo"', '"bar"', '1') … ('"foo"', '"bar"', '1')
COMMIT TRANSACTION

In contrast:

>>> G.add_nodes_from(list((i, {"foo":"bar") for i in range(2)))

show/hide

BEGIN TRANSACTION
INSERT INTO nodes1(name) VALUES (?)
---              ↖ ('0',) … ('1999',)
INSERT INTO node_data1(name, key, value) SELECT name, ?, ? FROM nodes1 WHERE name = ?
---              ↖ ('"foo"', '"bar"', '0') … ('"foo"', '"bar"', '1999')
COMMIT TRANSACTION

Of course, bulk insertion is also available for edges. Moreover, the data of each edge (or node) can be given locally (as third mapping component of the tuple defining the edge (node)), or globally, as a keyworded argument of the adding method. Locally given data has precedence over globally given data.

>>> G = nd.sqlite.Graph()
>>> G.add_nodes_from([(0, {'data': "local"}), (1,), (2, {'foo': "bar"})], data="global", generation=0)
>>> sorted(G.nodes(data=True))
[(0, {data: 'local', generation: 0}), (1, {data: 'global', generation: 0}), (2, {data: 'global', foo: 'bar', generation: 0})]

Here is a more complex Graph with data on edges only.

>>> H = nd.sqlite.Graph(name="powers_and_modulo")
>>> H.add_edges_from(
...    [
...            (
...                    i, (i**3+i**2)%8,
...                    dict(
...                            { 'k':"v" } if i%3 else { 'p': "local" },
...                            gen=1, parity=i%2, foo="bar" if i**7%5<3 else "bor"
...                    )
...            )
...            for i in range(8)
...    ],
...    p="global", global_data="for all", gen=0
... )
>>> nx.is_connected(H)
True
>>> sorted(H.edges(data="k"))
[(0, 4, 'v'), (0, 7, 'v'), (1, 2, 'v'), (2, 4, 'v'), (5, 6, 'v')]
>>> sorted(H.edges(data="p"))
[(0, 0, 'local'), (0, 4, 'global'), (0, 7, 'global'), (1, 2, 'global'), (2, 4, 'global'), (3, 4, 'local'), (4, 6, 'local'), (5, 6, 'global')]
>>> sorted(H.edges(data="gen"))
[(0, 0, 1), (0, 4, 1), (0, 7, 1), (1, 2, 1), (2, 4, 1), (3, 4, 1), (4, 6, 1), (5, 6, 1)]
>>> sorted(H.edges(data="parity"))
[(0, 0, 0), (0, 4, 0), (0, 7, 1), (1, 2, 1), (2, 4, 0), (3, 4, 1), (4, 6, 0), (5, 6, 1)]
>>> sorted(H.edges(data="foo"))
[(0, 0, 'bar'), (0, 4, 'bor'), (0, 7, 'bor'), (1, 2, 'bar'), (2, 4, 'bor'), (3, 4, 'bar'), (4, 6, 'bar'), (5, 6, 'bar')]
>>> sorted(H.edges(data="global_data"))
[(0, 0, 'for all'), (0, 4, 'for all'), (0, 7, 'for all'), (1, 2, 'for all'), (2, 4, 'for all'), (3, 4, 'for all'), (4, 6, 'for all'), (5, 6, 'for all')]
>>> Even = H.edge_subgraph(H.find_all_edges(parity=0))
>>> sorted(Even.edges(data=False))
[(0, 0), (0, 4), (2, 4), (4, 6)]
>>> nx.is_connected(Even)
True
>>> Odd = H.edge_subgraph(H.find_all_edges(parity=1))
>>> sorted(Odd.edges(data=False))
[(0, 7), (1, 2), (3, 4), (5, 6)]
>>> nx.is_connected(Odd)
False

Some complex queries

Some Graph information extractions can be obtained through automatic complicated query construction.

We first define a random DiGraph with some random data:

>>> G = nd.sqlite.Graph(sql_logger=False)
>>> G.add_edges_from([(randint(0, 100), randint(0, 100), {"w":randint(0, 10)}) for _ in range(500)])

We can easily access the degree distribution of this Graph:

>>> degree = list(G.degree)

This produces the following query:

show/hide

SELECT name, SUM(degree) FROM (
        SELECT name, COUNT(target) AS degree
        FROM
                nodes1
        INNER JOIN
                symedges1
        ON name = source
        WHERE target IS NOT NULL AND name != target
        GROUP BY name
        UNION ALL
        SELECT name, COUNT(target)*2 AS degree
        FROM
                nodes1
        LEFT JOIN symedges1
                ON name = source
        WHERE name = target
        GROUP BY name)
AS partial_aggregate
GROUP BY name

Remark that this query is generated automatically independantly of the chosen Graph Schema mapping.

It is also possible to compute the degree using the random weight provided.

>>> weighted_degree = list(G.degree(weight="w"))

It produces the following query

show/hide

SELECT name, SUM(degree) FROM (
        SELECT
                name,
                SUM(IIF(target IS NOT NULL, CAST(IFNULL(value, 1.0) AS NUMERIC), 0)) AS degree
        FROM
                nodes1
        INNER JOIN
                symedges1
        ON name = source
        LEFT JOIN
                edge_data1
        ON symedges1.id = edge_data1.id AND key = '"w"'
        WHERE target IS NOT NULL AND name != target
        GROUP BY name
        UNION ALL
        SELECT
                name,
                SUM(IIF(target IS NOT NULL, CAST(IFNULL(value, 1.0) AS NUMERIC), 0))*2 AS degree
        FROM
                nodes1
        LEFT JOIN
                symedges1
        ON name = source
        LEFT JOIN edge_data1
        ON symedges1.id = edge_data1.id AND key = '"w"'
        WHERE name = target
        GROUP BY name)
AS partial_aggregate
GROUP BY name