ispyb.sqlalchemy¶
SQLAlchemy is a Python SQL toolkit and Object Relational Mapper that is used to provide access to the ISPyB database with the full power and flexibility of SQL. For a more general introduction to SQLAlchemy, see the official reference documentation and tutorials.
ORM models for all database tables have been automatically generated using
sqlacodegen and are available
inside the ispyb.sqlalchemy
module.
Connecting to the database¶
First obtain the database connection URL by calling
ispyb.sqlalchemy.url()
, passing either the path to a config file or
a Python dictionary containing the database credentials. If
credentials=None
then the function will look for a credentials file
pointed to by the ISPYB_CREDENTIALS
environment variable.
Example credentials file:
[ispyb_sqlalchemy]
username = user
password = password
host = localhost
port = 3306
database = ispyb_build
Example credentials dictionary:
{
"username": "user",
"password": "password",
"host": "localhost",
"port": 3306,
"database": "ispyb",
}
import ispyb.sqlalchemy
credentials = "/dls_sw/dasc/mariadb/credentials/ispyb.cfg"
url = ispyb.sqlalchemy.url(credentials)
We interact with the database via an SQLAlchemy Session
, which is
used to manage database transactions. Usually we should construct a new
Session
at the beginning of a logical operation involving database
access. We can use the sessionmaker
class to provide a factory for
Session
objects:
from sqlalchemy import create_engine
from sqlalchemy.orm import sessionmaker
# an Engine, which the Session will use for connection
# resources, typically in module scope
engine = create_engine(url, connect_args={"use_pure": True})
# a sessionmaker(), also in the same scope as the engine
Session = sessionmaker(engine)
Making a basic query¶
First we import the relevant database table models from
ispyb.sqlalchemy
, in this case, the DataCollection
class, which
represents the database table of the same name. Next we query the
database to ask how many entries there are in this table:
from ispyb.sqlalchemy import DataCollection
# in production code it is typical to use Session as a
# context manager around a logical block of code
with Session() as session:
query = session.query(DataCollection)
print(query.count())
# for convenience in this tutorial we won't use it as a
# context manager
session = Session()
query = session.query(DataCollection)
3081314
More interesting would be to query the database for the records for a
specific data collection. In this case, calling .filter()
on the
original query will return the entries that match the conditions
provided to the .filter()
method:
query = query.filter(DataCollection.dataCollectionId == 6036518)
# This will return the first row of the results, or None if there aren't any results
dc = query.first()
# This will return the first row, and raise an error if no results match the query
dc = query.one()
print(dc)
<ispyb.sqlalchemy._auto_db_schema.DataCollection object at 0x7f2041b05460>
We can use the resulting DataCollection
object to print some basic
information from the DataCollection
table:
print(
f"""\
dcid: {dc.dataCollectionId}
image template: {dc.imageDirectory}{dc.fileTemplate}
start time: {dc.startTime}\
"""
)
dcid: 6036518
image template: /dls/i04/data/2021/cm28182-1/20210302/TestInsulin/ZincInsulinB4/ZincInsulinB4_1_master.h5
start time: 2021-03-02 13:38:10
Extracting autoprocessing statistics¶
In this next example, we extract some autoprocessing statistics for a
given visit. We will take advantage that we can easily use SQLAlchemy to
extract data into a pandas.DataFrame
to facilitate further analysis.
First, we create our query to select the relevant information from the
database.
from ispyb.sqlalchemy import (
AutoProcIntegration,
AutoProc,
AutoProcProgram,
AutoProcScaling,
AutoProcScalingStatistics,
)
query = (
session.query(DataCollection, AutoProcProgram, AutoProcScalingStatistics)
.join(
AutoProcIntegration,
AutoProcIntegration.dataCollectionId == DataCollection.dataCollectionId,
)
.join(
AutoProcProgram,
AutoProcIntegration.autoProcProgramId == AutoProcProgram.autoProcProgramId,
)
.join(AutoProc, AutoProcProgram.autoProcProgramId == AutoProc.autoProcProgramId)
.join(AutoProcScaling, AutoProc.autoProcId == AutoProcScaling.autoProcId)
.join(
AutoProcScalingStatistics,
AutoProcScalingStatistics.autoProcScalingId
== AutoProcScaling.autoProcScalingId,
)
.filter(AutoProcScalingStatistics.scalingStatisticsType == "outerShell")
.filter(DataCollection.SESSIONID == bs.sessionId)
)
print(query.count())
1221
Thanks to being able to access the raw SQL statement that would be run
by a query, we can use the pandas.read_sql()
function to directly
convert this query into a pandas.DataFrame
:
import pandas as pd
df = pd.read_sql(query.statement, session.bind)
print(df.head())
dataCollectionId BLSAMPLEID SESSIONID experimenttype 0 5791190 3224381.0 27446315 None 1 5791190 3224381.0 27446315 None 2 5791190 3224381.0 27446315 None 3 5791190 3224381.0 27446315 None 4 5791220 3224384.0 27446315 None dataCollectionNumber startTime endTime 0 1 2021-01-14 13:47:30 2021-01-14 13:48:29 1 1 2021-01-14 13:47:30 2021-01-14 13:48:29 2 1 2021-01-14 13:47:30 2021-01-14 13:48:29 3 1 2021-01-14 13:47:30 2021-01-14 13:48:29 4 1 2021-01-14 13:51:01 2021-01-14 13:51:56 runStatus axisStart axisEnd ... meanIOverSigI 0 DataCollection Successful 0.0 0.1 ... 0.330998 1 DataCollection Successful 0.0 0.1 ... 0.588335 2 DataCollection Successful 0.0 0.1 ... 0.600000 3 DataCollection Successful 0.0 0.1 ... 1.400000 4 DataCollection Successful 0.0 0.1 ... 0.043246 completeness multiplicity anomalousCompleteness anomalousMultiplicity 0 99.900400 39.4367 99.7773 20.8294 1 100.000000 38.7516 100.0000 20.7655 2 100.000000 37.6000 100.0000 20.1000 3 69.100000 39.3000 70.3000 21.6000 4 0.053611 1.0000 0.0000 1.0000 recordTimeStamp_1 anomalous ccHalf ccAnomalous resIOverSigI2 0 2021-01-14 14:01:36 None 0.436763 -0.213406 None 1 2021-01-14 14:02:49 None 0.284781 0.020924 None 2 2021-01-14 14:09:14 None 0.366000 -0.026000 None 3 2021-01-14 14:09:15 None 0.728000 -0.008000 None 4 2021-01-14 16:17:06 None 0.000000 0.000000 None [5 rows x 137 columns]
We can see that this query gave us a large number of columns (136 in
fact!), most of which we’re probably not interested in. There are even
some duplicate column names across this as some of the tables share
column names, e.g. dataCollectionId
. This will be problematic if we
want to use pandas groupby
functionality on dataCollectionId
.
print(df["dataCollectionId"].head())
0 5791190
1 5791190
2 5791190
3 5791190
4 5791220
Name: dataCollectionId, dtype: int64
We can modify our query using the query.with_entities()
method to
select only those columns we’re interested in:
query = query.with_entities(
DataCollection.dataCollectionId,
AutoProcProgram.processingPrograms,
AutoProcProgram.processingMessage,
AutoProcProgram.processingStartTime,
AutoProcProgram.processingEndTime,
AutoProcScalingStatistics.ccHalf,
AutoProcScalingStatistics.resolutionLimitHigh,
)
df = pd.read_sql(query.statement, session.bind)
print(df.head())
dataCollectionId processingPrograms processingMessage 0 5791190 xia2 dials processing successful 1 5791190 xia2 3dii processing successful 2 5791190 autoPROC processing successful 3 5791190 autoPROC+STARANISO processing successful 4 5791220 xia2 3dii processing successful processingStartTime processingEndTime ccHalf resolutionLimitHigh 0 2021-01-14 13:48:34 2021-01-14 14:01:36 0.436763 2.05890 1 2021-01-14 13:48:35 2021-01-14 14:02:49 0.284781 2.26003 2 2021-01-14 13:48:41 2021-01-14 14:09:14 0.366000 2.32900 3 2021-01-14 14:09:15 2021-01-14 14:09:15 0.728000 2.29700 4 2021-01-14 13:52:04 2021-01-14 16:17:06 0.000000 1.06354
Next, compute the runtime for each program, filtering only for programs
that completed successfully (note we could have done this filtering as
part of the original database query), group by dataCollectionId
and
processingPrograms
and then plot a histogram of the runtimes:
# Calculate the processing time
df["processingDuration"] = df["processingEndTime"] - df["processingStartTime"]
# Select only those programs that successfully completed and group on
# dataCollectionid and processingPrograms
grouped = df[df["processingMessage"] == "processing successful"].groupby(
["dataCollectionId", "processingPrograms"]
)
# Calculate the minimum processing time for each group
processing_duration = grouped["processingDuration"].min().dt.total_seconds().unstack()
# Drop those columns that aren't strictly comparable
processing_duration.drop(
["autoPROC+STARANISO", "xia2 dials (multi)", "fast_dp", "xia2.multiplex"],
inplace=True,
axis=1,
)
print(processing_duration.median())
import math
import matplotlib.pyplot as plt
plt.style.use("ggplot")
# Plot processing duration as a histogram and boxplot
processing_duration.plot(kind="hist", bins=50, subplots=True)
plt.xlabel("Processing duration (seconds)")
plt.show()
processing_duration.plot(kind="box", vert=False)
plt.xlabel("Processing duration (seconds)")
plt.show()
processingPrograms
autoPROC 1876.0
xia2 3dii 1352.5
xia2 3dii (multi) 1472.0
xia2 dials 1316.0
dtype: float64
<IPython.core.display.Javascript object>
<IPython.core.display.Javascript object>
def comparison_scatter_plot(df, reference, compare):
ncols = 2
nrows = int(math.ceil(len(compare) / ncols))
fig, axes = plt.subplots(nrows=nrows, ncols=ncols, sharex=True, sharey=True)
for ax, comparison_prg in zip(axes.flatten(), compare):
df.plot(kind="scatter", x=reference, y=comparison_prg, s=2, ax=ax)
ax.set_title(comparison_prg)
for ax in axes.flatten():
vmin = min(ax.get_xlim()[0], ax.get_ylim()[0])
vmax = max(ax.get_xlim()[1], ax.get_ylim()[1])
ax.plot([vmin, vmax], [vmin, vmax], c="black", linestyle="dotted")
ax.set_aspect("equal")
return fig, axes
# Scatter plot for xia2 3dii and autoPROC vs xia2 dials
compare = ("xia2 3dii", "autoPROC")
fig, axes = comparison_scatter_plot(
processing_duration, reference="xia2 dials", compare=compare
)
for ax, comparison_prg in zip(axes.flatten(), compare):
ax.set_title(comparison_prg)
ax.set_xlabel("xia2 dials runtime (seconds)")
ax.set_ylabel(f"runtime (seconds)")
fig.suptitle(f"Runtime vs. xia2-dials for {visit}")
plt.show()
<IPython.core.display.Javascript object>
# Compare resolution limits
compare = ("fast_dp", "xia2 3dii", "autoPROC", "autoPROC+STARANISO")
fig, axes = comparison_scatter_plot(
grouped["resolutionLimitHigh"].min().unstack(),
reference="xia2 dials",
compare=compare,
)
for ax, comparison_prg in zip(axes.flatten(), compare):
ax.set_title(comparison_prg)
ax.set_xlabel("xia2 dials d_min (Å^-1)")
ax.set_ylabel(f"d_min (Å^-1)")
fig.suptitle(f"Resolution vs. xia2-dials for {visit}")
plt.show()
<IPython.core.display.Javascript object>
# Compare outer shell CC1/2
compare = ("fast_dp", "xia2 3dii", "autoPROC", "autoPROC+STARANISO")
fig, axes = comparison_scatter_plot(
grouped["ccHalf"].min().unstack(), reference="xia2 dials", compare=compare
)
for ax, comparison_prg in zip(axes.flatten(), compare):
ax.set_title(comparison_prg)
ax.set_xlabel("xia2 dials d_min (Å^-1)")
ax.set_ylabel(f"d_min (Å^-1)")
fig.suptitle(f"Resolution vs. xia2-dials for {visit}")
plt.show()
<IPython.core.display.Javascript object>