================
ispyb.sqlalchemy
================
SQLAlchemy_ is a Python SQL toolkit and Object Relational Mapper that is used
to provide access to the ISPyB database with the full power and flexibility of
SQL. For a more general introduction to SQLAlchemy, see the official
`reference documentation `_ and
`tutorials `_.
ORM models for all database tables have been automatically generated using
`sqlacodegen `_ and are available
inside the ``ispyb.sqlalchemy`` module.
.. _SQLAlchemy: https://www.sqlalchemy.org/
Connecting to the database
--------------------------
First obtain the database connection URL by calling
``ispyb.sqlalchemy.url()``, passing either the path to a config file or
a Python dictionary containing the database credentials. If
``credentials=None`` then the function will look for a credentials file
pointed to by the ``ISPYB_CREDENTIALS`` environment variable.
Example credentials file:
::
[ispyb_sqlalchemy]
username = user
password = password
host = localhost
port = 3306
database = ispyb_build
Example credentials dictionary:
::
{
"username": "user",
"password": "password",
"host": "localhost",
"port": 3306,
"database": "ispyb",
}
.. code:: ipython3
import ispyb.sqlalchemy
credentials = "/dls_sw/dasc/mariadb/credentials/ispyb.cfg"
url = ispyb.sqlalchemy.url(credentials)
We interact with the database via an SQLAlchemy ``Session``, which is
used to manage database transactions. Usually we should construct a new
``Session`` at the beginning of a logical operation involving database
access. We can use the ``sessionmaker`` class to provide a factory for
``Session`` objects:
.. code:: ipython3
from sqlalchemy import create_engine
from sqlalchemy.orm import sessionmaker
# an Engine, which the Session will use for connection
# resources, typically in module scope
engine = create_engine(url, connect_args={"use_pure": True})
# a sessionmaker(), also in the same scope as the engine
Session = sessionmaker(engine)
Making a basic query
--------------------
First we import the relevant database table models from
``ispyb.sqlalchemy``, in this case, the ``DataCollection`` class, which
represents the database table of the same name. Next we query the
database to ask how many entries there are in this table:
.. code:: ipython3
from ispyb.sqlalchemy import DataCollection
# in production code it is typical to use Session as a
# context manager around a logical block of code
with Session() as session:
query = session.query(DataCollection)
print(query.count())
# for convenience in this tutorial we won't use it as a
# context manager
session = Session()
query = session.query(DataCollection)
.. parsed-literal::
3081314
More interesting would be to query the database for the records for a
specific data collection. In this case, calling ``.filter()`` on the
original query will return the entries that match the conditions
provided to the ``.filter()`` method:
.. code:: ipython3
query = query.filter(DataCollection.dataCollectionId == 6036518)
# This will return the first row of the results, or None if there aren't any results
dc = query.first()
# This will return the first row, and raise an error if no results match the query
dc = query.one()
print(dc)
.. parsed-literal::
We can use the resulting ``DataCollection`` object to print some basic
information from the ``DataCollection`` table:
.. code:: ipython3
print(
f"""\
dcid: {dc.dataCollectionId}
image template: {dc.imageDirectory}{dc.fileTemplate}
start time: {dc.startTime}\
"""
)
.. parsed-literal::
dcid: 6036518
image template: /dls/i04/data/2021/cm28182-1/20210302/TestInsulin/ZincInsulinB4/ZincInsulinB4_1_master.h5
start time: 2021-03-02 13:38:10
Querying related tables
-----------------------
We can easily traverse related objects (as defined in the database by
foreign key relationships), e.g.:
.. code:: ipython3
dcg = dc.DataCollectionGroup
sample = dcg.BLSample
container = sample.Container
crystal = sample.Crystal
protein = crystal.Protein
print(
f"""\
experiment type: {dcg.experimentType}
sample name: {sample.name}
container code: {container.code}
container type: {container.containerType}
container capacity: {container.capacity}
space group: {crystal.spaceGroup}
protein name: {protein.name}
protein acronym: {protein.acronym}\
"""
)
.. parsed-literal::
experiment type: SAD
sample name: ZincInsulinB4
container code: I04R-012
container type: Puck
container capacity: 16
space group:
protein name: TestInsulin
protein acronym: TestInsulin
An important point to note is that by default, relationships are loaded
lazily, i.e. only data for the ``DataCollection`` entry were returned by
the initial database query, and when we call (e.g.) the
``dc.DataCollectionGroup`` attribute of the resulting ``DataCollection``
object, another query is made to the database. This can be inefficient,
especially if you want to examine multiple rows across related tables
for several data collections.
SQLAlchemy provides a number of different ways to control how
relationships are loaded, included, lazy loading (this is the default
for most relationships), joined loading, subquery loading and select IN
loading. For more detailed information see the SQLAlchemy documentation
on `loading
relationships `__.
In the example below we use joined eager loading by calling the
``joinedload`` option:
.. code:: ipython3
from sqlalchemy.orm import joinedload
from ispyb.sqlalchemy import DataCollectionGroup, BLSample, Container, Crystal, Protein
query = (
session.query(DataCollection)
.options(
joinedload(DataCollection.DataCollectionGroup)
.joinedload(DataCollectionGroup.BLSample)
.options(
joinedload(BLSample.Container),
joinedload(BLSample.Crystal).joinedload(Crystal.Protein),
)
)
.filter(DataCollection.dataCollectionId == 6036518)
)
dc = query.first()
dcg = dc.DataCollectionGroup
sample = dcg.BLSample
container = sample.Container
crystal = sample.Crystal
protein = crystal.Protein
print(
f"""\
dcid: {dc.dataCollectionId}
image template: {dc.imageDirectory}{dc.fileTemplate}
start time: {dc.startTime}
experiment type: {dcg.experimentType}
sample name: {sample.name}
container code: {container.code}
container type: {container.containerType}
container capacity: {container.capacity}
space group: {crystal.spaceGroup}
protein name: {protein.name}
protein acronym: {protein.acronym}\
"""
)
.. parsed-literal::
dcid: 6036518
image template: /dls/i04/data/2021/cm28182-1/20210302/TestInsulin/ZincInsulinB4/ZincInsulinB4_1_master.h5
start time: 2021-03-02 13:38:10
experiment type: SAD
sample name: ZincInsulinB4
container code: I04R-012
container type: Puck
container capacity: 16
space group:
protein name: TestInsulin
protein acronym: TestInsulin
Alternatively we can use an SQL ``JOIN`` query. Here we indicate that
we’re only interested in the ``DataCollection`` and ``Protein`` entries,
but we must specify how the two tables are linked via intermediates:
.. code:: ipython3
query = (
session.query(DataCollection, Protein)
.join(DataCollectionGroup)
.join(BLSample)
.join(Crystal)
.join(Protein)
.filter(DataCollection.dataCollectionId == 6036518)
)
dc, protein = query.one()
print(dc.dataCollectionId, protein.name)
.. parsed-literal::
6036518 TestInsulin
Below we query the ``BLSession`` and ``Proposal`` tables to determine
the full visit name. The ``DataCollection`` and ``BLSession`` tables
aren’t explictly linked via a foreign key, so it is necessary to
explicitly link the tables in our query:
.. code:: ipython3
from ispyb.sqlalchemy import BLSample, BLSession, Proposal
query = (
session.query(BLSession, Proposal)
.join(DataCollection, DataCollection.SESSIONID == BLSession.sessionId)
.join(Proposal)
.filter(DataCollection.dataCollectionId == 6036455)
)
bs, p = query.first()
visit = f"visit: {p.proposalCode}{p.proposalNumber}-{bs.visit_number}"
print(visit)
.. parsed-literal::
visit: cm28182-1
Extracting autoprocessing statistics
------------------------------------
In this next example, we extract some autoprocessing statistics for a
given visit. We will take advantage that we can easily use SQLAlchemy to
extract data into a ``pandas.DataFrame`` to facilitate further analysis.
First, we create our query to select the relevant information from the
database.
.. code:: ipython3
from ispyb.sqlalchemy import (
AutoProcIntegration,
AutoProc,
AutoProcProgram,
AutoProcScaling,
AutoProcScalingStatistics,
)
query = (
session.query(DataCollection, AutoProcProgram, AutoProcScalingStatistics)
.join(
AutoProcIntegration,
AutoProcIntegration.dataCollectionId == DataCollection.dataCollectionId,
)
.join(
AutoProcProgram,
AutoProcIntegration.autoProcProgramId == AutoProcProgram.autoProcProgramId,
)
.join(AutoProc, AutoProcProgram.autoProcProgramId == AutoProc.autoProcProgramId)
.join(AutoProcScaling, AutoProc.autoProcId == AutoProcScaling.autoProcId)
.join(
AutoProcScalingStatistics,
AutoProcScalingStatistics.autoProcScalingId
== AutoProcScaling.autoProcScalingId,
)
.filter(AutoProcScalingStatistics.scalingStatisticsType == "outerShell")
.filter(DataCollection.SESSIONID == bs.sessionId)
)
print(query.count())
.. parsed-literal::
1221
Thanks to being able to access the raw SQL statement that would be run
by a query, we can use the ``pandas.read_sql()`` function to directly
convert this query into a ``pandas.DataFrame``:
.. code:: ipython3
import pandas as pd
df = pd.read_sql(query.statement, session.bind)
print(df.head())
.. parsed-literal::
dataCollectionId BLSAMPLEID SESSIONID experimenttype \
0 5791190 3224381.0 27446315 None
1 5791190 3224381.0 27446315 None
2 5791190 3224381.0 27446315 None
3 5791190 3224381.0 27446315 None
4 5791220 3224384.0 27446315 None
dataCollectionNumber startTime endTime \
0 1 2021-01-14 13:47:30 2021-01-14 13:48:29
1 1 2021-01-14 13:47:30 2021-01-14 13:48:29
2 1 2021-01-14 13:47:30 2021-01-14 13:48:29
3 1 2021-01-14 13:47:30 2021-01-14 13:48:29
4 1 2021-01-14 13:51:01 2021-01-14 13:51:56
runStatus axisStart axisEnd ... meanIOverSigI \
0 DataCollection Successful 0.0 0.1 ... 0.330998
1 DataCollection Successful 0.0 0.1 ... 0.588335
2 DataCollection Successful 0.0 0.1 ... 0.600000
3 DataCollection Successful 0.0 0.1 ... 1.400000
4 DataCollection Successful 0.0 0.1 ... 0.043246
completeness multiplicity anomalousCompleteness anomalousMultiplicity \
0 99.900400 39.4367 99.7773 20.8294
1 100.000000 38.7516 100.0000 20.7655
2 100.000000 37.6000 100.0000 20.1000
3 69.100000 39.3000 70.3000 21.6000
4 0.053611 1.0000 0.0000 1.0000
recordTimeStamp_1 anomalous ccHalf ccAnomalous resIOverSigI2
0 2021-01-14 14:01:36 None 0.436763 -0.213406 None
1 2021-01-14 14:02:49 None 0.284781 0.020924 None
2 2021-01-14 14:09:14 None 0.366000 -0.026000 None
3 2021-01-14 14:09:15 None 0.728000 -0.008000 None
4 2021-01-14 16:17:06 None 0.000000 0.000000 None
[5 rows x 137 columns]
We can see that this query gave us a large number of columns (136 in
fact!), most of which we’re probably not interested in. There are even
some duplicate column names across this as some of the tables share
column names, e.g. ``dataCollectionId``. This will be problematic if we
want to use pandas ``groupby`` functionality on ``dataCollectionId``.
.. code:: ipython3
print(df["dataCollectionId"].head())
.. parsed-literal::
0 5791190
1 5791190
2 5791190
3 5791190
4 5791220
Name: dataCollectionId, dtype: int64
We can modify our query using the ``query.with_entities()`` method to
select only those columns we’re interested in:
.. code:: ipython3
query = query.with_entities(
DataCollection.dataCollectionId,
AutoProcProgram.processingPrograms,
AutoProcProgram.processingMessage,
AutoProcProgram.processingStartTime,
AutoProcProgram.processingEndTime,
AutoProcScalingStatistics.ccHalf,
AutoProcScalingStatistics.resolutionLimitHigh,
)
df = pd.read_sql(query.statement, session.bind)
print(df.head())
.. parsed-literal::
dataCollectionId processingPrograms processingMessage \
0 5791190 xia2 dials processing successful
1 5791190 xia2 3dii processing successful
2 5791190 autoPROC processing successful
3 5791190 autoPROC+STARANISO processing successful
4 5791220 xia2 3dii processing successful
processingStartTime processingEndTime ccHalf resolutionLimitHigh
0 2021-01-14 13:48:34 2021-01-14 14:01:36 0.436763 2.05890
1 2021-01-14 13:48:35 2021-01-14 14:02:49 0.284781 2.26003
2 2021-01-14 13:48:41 2021-01-14 14:09:14 0.366000 2.32900
3 2021-01-14 14:09:15 2021-01-14 14:09:15 0.728000 2.29700
4 2021-01-14 13:52:04 2021-01-14 16:17:06 0.000000 1.06354
Next, compute the runtime for each program, filtering only for programs
that completed successfully (note we could have done this filtering as
part of the original database query), group by ``dataCollectionId`` and
``processingPrograms`` and then plot a histogram of the runtimes:
.. code:: ipython3
# Calculate the processing time
df["processingDuration"] = df["processingEndTime"] - df["processingStartTime"]
# Select only those programs that successfully completed and group on
# dataCollectionid and processingPrograms
grouped = df[df["processingMessage"] == "processing successful"].groupby(
["dataCollectionId", "processingPrograms"]
)
# Calculate the minimum processing time for each group
processing_duration = grouped["processingDuration"].min().dt.total_seconds().unstack()
# Drop those columns that aren't strictly comparable
processing_duration.drop(
["autoPROC+STARANISO", "xia2 dials (multi)", "fast_dp", "xia2.multiplex"],
inplace=True,
axis=1,
)
print(processing_duration.median())
import math
import matplotlib.pyplot as plt
plt.style.use("ggplot")
# Plot processing duration as a histogram and boxplot
processing_duration.plot(kind="hist", bins=50, subplots=True)
plt.xlabel("Processing duration (seconds)")
plt.show()
processing_duration.plot(kind="box", vert=False)
plt.xlabel("Processing duration (seconds)")
plt.show()
.. parsed-literal::
processingPrograms
autoPROC 1876.0
xia2 3dii 1352.5
xia2 3dii (multi) 1472.0
xia2 dials 1316.0
dtype: float64
.. parsed-literal::
.. raw:: html
.. parsed-literal::
.. raw:: html
.. code:: ipython3
def comparison_scatter_plot(df, reference, compare):
ncols = 2
nrows = int(math.ceil(len(compare) / ncols))
fig, axes = plt.subplots(nrows=nrows, ncols=ncols, sharex=True, sharey=True)
for ax, comparison_prg in zip(axes.flatten(), compare):
df.plot(kind="scatter", x=reference, y=comparison_prg, s=2, ax=ax)
ax.set_title(comparison_prg)
for ax in axes.flatten():
vmin = min(ax.get_xlim()[0], ax.get_ylim()[0])
vmax = max(ax.get_xlim()[1], ax.get_ylim()[1])
ax.plot([vmin, vmax], [vmin, vmax], c="black", linestyle="dotted")
ax.set_aspect("equal")
return fig, axes
# Scatter plot for xia2 3dii and autoPROC vs xia2 dials
compare = ("xia2 3dii", "autoPROC")
fig, axes = comparison_scatter_plot(
processing_duration, reference="xia2 dials", compare=compare
)
for ax, comparison_prg in zip(axes.flatten(), compare):
ax.set_title(comparison_prg)
ax.set_xlabel("xia2 dials runtime (seconds)")
ax.set_ylabel(f"runtime (seconds)")
fig.suptitle(f"Runtime vs. xia2-dials for {visit}")
plt.show()
.. parsed-literal::
.. raw:: html
.. code:: ipython3
# Compare resolution limits
compare = ("fast_dp", "xia2 3dii", "autoPROC", "autoPROC+STARANISO")
fig, axes = comparison_scatter_plot(
grouped["resolutionLimitHigh"].min().unstack(),
reference="xia2 dials",
compare=compare,
)
for ax, comparison_prg in zip(axes.flatten(), compare):
ax.set_title(comparison_prg)
ax.set_xlabel("xia2 dials d_min (Å^-1)")
ax.set_ylabel(f"d_min (Å^-1)")
fig.suptitle(f"Resolution vs. xia2-dials for {visit}")
plt.show()
.. parsed-literal::
.. raw:: html
.. code:: ipython3
# Compare outer shell CC1/2
compare = ("fast_dp", "xia2 3dii", "autoPROC", "autoPROC+STARANISO")
fig, axes = comparison_scatter_plot(
grouped["ccHalf"].min().unstack(), reference="xia2 dials", compare=compare
)
for ax, comparison_prg in zip(axes.flatten(), compare):
ax.set_title(comparison_prg)
ax.set_xlabel("xia2 dials d_min (Å^-1)")
ax.set_ylabel(f"d_min (Å^-1)")
fig.suptitle(f"Resolution vs. xia2-dials for {visit}")
plt.show()
.. parsed-literal::
.. raw:: html