ispyb.sqlalchemy

SQLAlchemy is a Python SQL toolkit and Object Relational Mapper that is used to provide access to the ISPyB database with the full power and flexibility of SQL. For a more general introduction to SQLAlchemy, see the official reference documentation and tutorials.

ORM models for all database tables have been automatically generated using sqlacodegen and are available inside the ispyb.sqlalchemy module.

Connecting to the database

First obtain the database connection URL by calling ispyb.sqlalchemy.url(), passing either the path to a config file or a Python dictionary containing the database credentials. If credentials=None then the function will look for a credentials file pointed to by the ISPYB_CREDENTIALS environment variable.

Example credentials file:

[ispyb_sqlalchemy]
username = user
password = password
host = localhost
port = 3306
database = ispyb_build

Example credentials dictionary:

{
   "username": "user",
   "password": "password",
   "host": "localhost",
   "port": 3306,
   "database": "ispyb",
}
import ispyb.sqlalchemy

credentials = "/dls_sw/dasc/mariadb/credentials/ispyb.cfg"
url = ispyb.sqlalchemy.url(credentials)

We interact with the database via an SQLAlchemy Session, which is used to manage database transactions. Usually we should construct a new Session at the beginning of a logical operation involving database access. We can use the sessionmaker class to provide a factory for Session objects:

from sqlalchemy import create_engine
from sqlalchemy.orm import sessionmaker

# an Engine, which the Session will use for connection
# resources, typically in module scope
engine = create_engine(url, connect_args={"use_pure": True})

# a sessionmaker(), also in the same scope as the engine
Session = sessionmaker(engine)

Making a basic query

First we import the relevant database table models from ispyb.sqlalchemy, in this case, the DataCollection class, which represents the database table of the same name. Next we query the database to ask how many entries there are in this table:

from ispyb.sqlalchemy import DataCollection

# in production code it is typical to use Session as a
# context manager around a logical block of code
with Session() as session:
    query = session.query(DataCollection)
    print(query.count())

# for convenience in this tutorial we won't use it as a
# context manager
session = Session()
query = session.query(DataCollection)
3081314

More interesting would be to query the database for the records for a specific data collection. In this case, calling .filter() on the original query will return the entries that match the conditions provided to the .filter() method:

query = query.filter(DataCollection.dataCollectionId == 6036518)
# This will return the first row of the results, or None if there aren't any results
dc = query.first()
# This will return the first row, and raise an error if no results match the query
dc = query.one()
print(dc)
<ispyb.sqlalchemy._auto_db_schema.DataCollection object at 0x7f2041b05460>

We can use the resulting DataCollection object to print some basic information from the DataCollection table:

print(
    f"""\
dcid: {dc.dataCollectionId}
image template: {dc.imageDirectory}{dc.fileTemplate}
start time: {dc.startTime}\
"""
)
dcid: 6036518
image template: /dls/i04/data/2021/cm28182-1/20210302/TestInsulin/ZincInsulinB4/ZincInsulinB4_1_master.h5
start time: 2021-03-02 13:38:10

Extracting autoprocessing statistics

In this next example, we extract some autoprocessing statistics for a given visit. We will take advantage that we can easily use SQLAlchemy to extract data into a pandas.DataFrame to facilitate further analysis. First, we create our query to select the relevant information from the database.

from ispyb.sqlalchemy import (
    AutoProcIntegration,
    AutoProc,
    AutoProcProgram,
    AutoProcScaling,
    AutoProcScalingStatistics,
)

query = (
    session.query(DataCollection, AutoProcProgram, AutoProcScalingStatistics)
    .join(
        AutoProcIntegration,
        AutoProcIntegration.dataCollectionId == DataCollection.dataCollectionId,
    )
    .join(
        AutoProcProgram,
        AutoProcIntegration.autoProcProgramId == AutoProcProgram.autoProcProgramId,
    )
    .join(AutoProc, AutoProcProgram.autoProcProgramId == AutoProc.autoProcProgramId)
    .join(AutoProcScaling, AutoProc.autoProcId == AutoProcScaling.autoProcId)
    .join(
        AutoProcScalingStatistics,
        AutoProcScalingStatistics.autoProcScalingId
        == AutoProcScaling.autoProcScalingId,
    )
    .filter(AutoProcScalingStatistics.scalingStatisticsType == "outerShell")
    .filter(DataCollection.SESSIONID == bs.sessionId)
)
print(query.count())
1221

Thanks to being able to access the raw SQL statement that would be run by a query, we can use the pandas.read_sql() function to directly convert this query into a pandas.DataFrame:

import pandas as pd

df = pd.read_sql(query.statement, session.bind)
print(df.head())
   dataCollectionId  BLSAMPLEID  SESSIONID experimenttype  0           5791190   3224381.0   27446315           None
1           5791190   3224381.0   27446315           None
2           5791190   3224381.0   27446315           None
3           5791190   3224381.0   27446315           None
4           5791220   3224384.0   27446315           None

   dataCollectionNumber           startTime             endTime  0                     1 2021-01-14 13:47:30 2021-01-14 13:48:29
1                     1 2021-01-14 13:47:30 2021-01-14 13:48:29
2                     1 2021-01-14 13:47:30 2021-01-14 13:48:29
3                     1 2021-01-14 13:47:30 2021-01-14 13:48:29
4                     1 2021-01-14 13:51:01 2021-01-14 13:51:56

                   runStatus  axisStart  axisEnd  ...  meanIOverSigI  0  DataCollection Successful        0.0      0.1  ...       0.330998
1  DataCollection Successful        0.0      0.1  ...       0.588335
2  DataCollection Successful        0.0      0.1  ...       0.600000
3  DataCollection Successful        0.0      0.1  ...       1.400000
4  DataCollection Successful        0.0      0.1  ...       0.043246

   completeness  multiplicity  anomalousCompleteness  anomalousMultiplicity  0     99.900400       39.4367                99.7773                20.8294
1    100.000000       38.7516               100.0000                20.7655
2    100.000000       37.6000               100.0000                20.1000
3     69.100000       39.3000                70.3000                21.6000
4      0.053611        1.0000                 0.0000                 1.0000

    recordTimeStamp_1 anomalous    ccHalf ccAnomalous resIOverSigI2
0 2021-01-14 14:01:36      None  0.436763   -0.213406          None
1 2021-01-14 14:02:49      None  0.284781    0.020924          None
2 2021-01-14 14:09:14      None  0.366000   -0.026000          None
3 2021-01-14 14:09:15      None  0.728000   -0.008000          None
4 2021-01-14 16:17:06      None  0.000000    0.000000          None

[5 rows x 137 columns]

We can see that this query gave us a large number of columns (136 in fact!), most of which we’re probably not interested in. There are even some duplicate column names across this as some of the tables share column names, e.g. dataCollectionId. This will be problematic if we want to use pandas groupby functionality on dataCollectionId.

print(df["dataCollectionId"].head())
0    5791190
1    5791190
2    5791190
3    5791190
4    5791220
Name: dataCollectionId, dtype: int64

We can modify our query using the query.with_entities() method to select only those columns we’re interested in:

query = query.with_entities(
    DataCollection.dataCollectionId,
    AutoProcProgram.processingPrograms,
    AutoProcProgram.processingMessage,
    AutoProcProgram.processingStartTime,
    AutoProcProgram.processingEndTime,
    AutoProcScalingStatistics.ccHalf,
    AutoProcScalingStatistics.resolutionLimitHigh,
)
df = pd.read_sql(query.statement, session.bind)
print(df.head())
   dataCollectionId  processingPrograms      processingMessage  0           5791190          xia2 dials  processing successful
1           5791190           xia2 3dii  processing successful
2           5791190            autoPROC  processing successful
3           5791190  autoPROC+STARANISO  processing successful
4           5791220           xia2 3dii  processing successful

  processingStartTime   processingEndTime    ccHalf  resolutionLimitHigh
0 2021-01-14 13:48:34 2021-01-14 14:01:36  0.436763              2.05890
1 2021-01-14 13:48:35 2021-01-14 14:02:49  0.284781              2.26003
2 2021-01-14 13:48:41 2021-01-14 14:09:14  0.366000              2.32900
3 2021-01-14 14:09:15 2021-01-14 14:09:15  0.728000              2.29700
4 2021-01-14 13:52:04 2021-01-14 16:17:06  0.000000              1.06354

Next, compute the runtime for each program, filtering only for programs that completed successfully (note we could have done this filtering as part of the original database query), group by dataCollectionId and processingPrograms and then plot a histogram of the runtimes:

# Calculate the processing time
df["processingDuration"] = df["processingEndTime"] - df["processingStartTime"]

# Select only those programs that successfully completed and group on
# dataCollectionid and processingPrograms
grouped = df[df["processingMessage"] == "processing successful"].groupby(
    ["dataCollectionId", "processingPrograms"]
)

# Calculate the minimum processing time for each group
processing_duration = grouped["processingDuration"].min().dt.total_seconds().unstack()

# Drop those columns that aren't strictly comparable
processing_duration.drop(
    ["autoPROC+STARANISO", "xia2 dials (multi)", "fast_dp", "xia2.multiplex"],
    inplace=True,
    axis=1,
)
print(processing_duration.median())

import math
import matplotlib.pyplot as plt

plt.style.use("ggplot")

# Plot processing duration as a histogram and boxplot
processing_duration.plot(kind="hist", bins=50, subplots=True)
plt.xlabel("Processing duration (seconds)")
plt.show()
processing_duration.plot(kind="box", vert=False)
plt.xlabel("Processing duration (seconds)")
plt.show()
processingPrograms
autoPROC             1876.0
xia2 3dii            1352.5
xia2 3dii (multi)    1472.0
xia2 dials           1316.0
dtype: float64
<IPython.core.display.Javascript object>
<IPython.core.display.Javascript object>
def comparison_scatter_plot(df, reference, compare):
    ncols = 2
    nrows = int(math.ceil(len(compare) / ncols))
    fig, axes = plt.subplots(nrows=nrows, ncols=ncols, sharex=True, sharey=True)
    for ax, comparison_prg in zip(axes.flatten(), compare):
        df.plot(kind="scatter", x=reference, y=comparison_prg, s=2, ax=ax)
        ax.set_title(comparison_prg)
    for ax in axes.flatten():
        vmin = min(ax.get_xlim()[0], ax.get_ylim()[0])
        vmax = max(ax.get_xlim()[1], ax.get_ylim()[1])
        ax.plot([vmin, vmax], [vmin, vmax], c="black", linestyle="dotted")
        ax.set_aspect("equal")
    return fig, axes


# Scatter plot for xia2 3dii and autoPROC vs xia2 dials
compare = ("xia2 3dii", "autoPROC")
fig, axes = comparison_scatter_plot(
    processing_duration, reference="xia2 dials", compare=compare
)
for ax, comparison_prg in zip(axes.flatten(), compare):
    ax.set_title(comparison_prg)
    ax.set_xlabel("xia2 dials runtime (seconds)")
    ax.set_ylabel(f"runtime (seconds)")
fig.suptitle(f"Runtime vs. xia2-dials for {visit}")
plt.show()
<IPython.core.display.Javascript object>
# Compare resolution limits
compare = ("fast_dp", "xia2 3dii", "autoPROC", "autoPROC+STARANISO")
fig, axes = comparison_scatter_plot(
    grouped["resolutionLimitHigh"].min().unstack(),
    reference="xia2 dials",
    compare=compare,
)
for ax, comparison_prg in zip(axes.flatten(), compare):
    ax.set_title(comparison_prg)
    ax.set_xlabel("xia2 dials d_min (Å^-1)")
    ax.set_ylabel(f"d_min (Å^-1)")
fig.suptitle(f"Resolution vs. xia2-dials for {visit}")
plt.show()
<IPython.core.display.Javascript object>
# Compare outer shell CC1/2
compare = ("fast_dp", "xia2 3dii", "autoPROC", "autoPROC+STARANISO")
fig, axes = comparison_scatter_plot(
    grouped["ccHalf"].min().unstack(), reference="xia2 dials", compare=compare
)
for ax, comparison_prg in zip(axes.flatten(), compare):
    ax.set_title(comparison_prg)
    ax.set_xlabel("xia2 dials d_min (Å^-1)")
    ax.set_ylabel(f"d_min (Å^-1)")
fig.suptitle(f"Resolution vs. xia2-dials for {visit}")
plt.show()
<IPython.core.display.Javascript object>