================ ispyb.sqlalchemy ================ SQLAlchemy_ is a Python SQL toolkit and Object Relational Mapper that is used to provide access to the ISPyB database with the full power and flexibility of SQL. For a more general introduction to SQLAlchemy, see the official `reference documentation `_ and `tutorials `_. ORM models for all database tables have been automatically generated using `sqlacodegen `_ and are available inside the ``ispyb.sqlalchemy`` module. .. _SQLAlchemy: https://www.sqlalchemy.org/ Connecting to the database -------------------------- First obtain the database connection URL by calling ``ispyb.sqlalchemy.url()``, passing either the path to a config file or a Python dictionary containing the database credentials. If ``credentials=None`` then the function will look for a credentials file pointed to by the ``ISPYB_CREDENTIALS`` environment variable. Example credentials file: :: [ispyb_sqlalchemy] username = user password = password host = localhost port = 3306 database = ispyb_build Example credentials dictionary: :: { "username": "user", "password": "password", "host": "localhost", "port": 3306, "database": "ispyb", } .. code:: ipython3 import ispyb.sqlalchemy credentials = "/dls_sw/dasc/mariadb/credentials/ispyb.cfg" url = ispyb.sqlalchemy.url(credentials) We interact with the database via an SQLAlchemy ``Session``, which is used to manage database transactions. Usually we should construct a new ``Session`` at the beginning of a logical operation involving database access. We can use the ``sessionmaker`` class to provide a factory for ``Session`` objects: .. code:: ipython3 from sqlalchemy import create_engine from sqlalchemy.orm import sessionmaker # an Engine, which the Session will use for connection # resources, typically in module scope engine = create_engine(url, connect_args={"use_pure": True}) # a sessionmaker(), also in the same scope as the engine Session = sessionmaker(engine) Making a basic query -------------------- First we import the relevant database table models from ``ispyb.sqlalchemy``, in this case, the ``DataCollection`` class, which represents the database table of the same name. Next we query the database to ask how many entries there are in this table: .. code:: ipython3 from ispyb.sqlalchemy import DataCollection # in production code it is typical to use Session as a # context manager around a logical block of code with Session() as session: query = session.query(DataCollection) print(query.count()) # for convenience in this tutorial we won't use it as a # context manager session = Session() query = session.query(DataCollection) .. parsed-literal:: 3081314 More interesting would be to query the database for the records for a specific data collection. In this case, calling ``.filter()`` on the original query will return the entries that match the conditions provided to the ``.filter()`` method: .. code:: ipython3 query = query.filter(DataCollection.dataCollectionId == 6036518) # This will return the first row of the results, or None if there aren't any results dc = query.first() # This will return the first row, and raise an error if no results match the query dc = query.one() print(dc) .. parsed-literal:: We can use the resulting ``DataCollection`` object to print some basic information from the ``DataCollection`` table: .. code:: ipython3 print( f"""\ dcid: {dc.dataCollectionId} image template: {dc.imageDirectory}{dc.fileTemplate} start time: {dc.startTime}\ """ ) .. parsed-literal:: dcid: 6036518 image template: /dls/i04/data/2021/cm28182-1/20210302/TestInsulin/ZincInsulinB4/ZincInsulinB4_1_master.h5 start time: 2021-03-02 13:38:10 Querying related tables ----------------------- We can easily traverse related objects (as defined in the database by foreign key relationships), e.g.: .. code:: ipython3 dcg = dc.DataCollectionGroup sample = dcg.BLSample container = sample.Container crystal = sample.Crystal protein = crystal.Protein print( f"""\ experiment type: {dcg.experimentType} sample name: {sample.name} container code: {container.code} container type: {container.containerType} container capacity: {container.capacity} space group: {crystal.spaceGroup} protein name: {protein.name} protein acronym: {protein.acronym}\ """ ) .. parsed-literal:: experiment type: SAD sample name: ZincInsulinB4 container code: I04R-012 container type: Puck container capacity: 16 space group: protein name: TestInsulin protein acronym: TestInsulin An important point to note is that by default, relationships are loaded lazily, i.e. only data for the ``DataCollection`` entry were returned by the initial database query, and when we call (e.g.) the ``dc.DataCollectionGroup`` attribute of the resulting ``DataCollection`` object, another query is made to the database. This can be inefficient, especially if you want to examine multiple rows across related tables for several data collections. SQLAlchemy provides a number of different ways to control how relationships are loaded, included, lazy loading (this is the default for most relationships), joined loading, subquery loading and select IN loading. For more detailed information see the SQLAlchemy documentation on `loading relationships `__. In the example below we use joined eager loading by calling the ``joinedload`` option: .. code:: ipython3 from sqlalchemy.orm import joinedload from ispyb.sqlalchemy import DataCollectionGroup, BLSample, Container, Crystal, Protein query = ( session.query(DataCollection) .options( joinedload(DataCollection.DataCollectionGroup) .joinedload(DataCollectionGroup.BLSample) .options( joinedload(BLSample.Container), joinedload(BLSample.Crystal).joinedload(Crystal.Protein), ) ) .filter(DataCollection.dataCollectionId == 6036518) ) dc = query.first() dcg = dc.DataCollectionGroup sample = dcg.BLSample container = sample.Container crystal = sample.Crystal protein = crystal.Protein print( f"""\ dcid: {dc.dataCollectionId} image template: {dc.imageDirectory}{dc.fileTemplate} start time: {dc.startTime} experiment type: {dcg.experimentType} sample name: {sample.name} container code: {container.code} container type: {container.containerType} container capacity: {container.capacity} space group: {crystal.spaceGroup} protein name: {protein.name} protein acronym: {protein.acronym}\ """ ) .. parsed-literal:: dcid: 6036518 image template: /dls/i04/data/2021/cm28182-1/20210302/TestInsulin/ZincInsulinB4/ZincInsulinB4_1_master.h5 start time: 2021-03-02 13:38:10 experiment type: SAD sample name: ZincInsulinB4 container code: I04R-012 container type: Puck container capacity: 16 space group: protein name: TestInsulin protein acronym: TestInsulin Alternatively we can use an SQL ``JOIN`` query. Here we indicate that we’re only interested in the ``DataCollection`` and ``Protein`` entries, but we must specify how the two tables are linked via intermediates: .. code:: ipython3 query = ( session.query(DataCollection, Protein) .join(DataCollectionGroup) .join(BLSample) .join(Crystal) .join(Protein) .filter(DataCollection.dataCollectionId == 6036518) ) dc, protein = query.one() print(dc.dataCollectionId, protein.name) .. parsed-literal:: 6036518 TestInsulin Below we query the ``BLSession`` and ``Proposal`` tables to determine the full visit name. The ``DataCollection`` and ``BLSession`` tables aren’t explictly linked via a foreign key, so it is necessary to explicitly link the tables in our query: .. code:: ipython3 from ispyb.sqlalchemy import BLSample, BLSession, Proposal query = ( session.query(BLSession, Proposal) .join(DataCollection, DataCollection.SESSIONID == BLSession.sessionId) .join(Proposal) .filter(DataCollection.dataCollectionId == 6036455) ) bs, p = query.first() visit = f"visit: {p.proposalCode}{p.proposalNumber}-{bs.visit_number}" print(visit) .. parsed-literal:: visit: cm28182-1 Extracting autoprocessing statistics ------------------------------------ In this next example, we extract some autoprocessing statistics for a given visit. We will take advantage that we can easily use SQLAlchemy to extract data into a ``pandas.DataFrame`` to facilitate further analysis. First, we create our query to select the relevant information from the database. .. code:: ipython3 from ispyb.sqlalchemy import ( AutoProcIntegration, AutoProc, AutoProcProgram, AutoProcScaling, AutoProcScalingStatistics, ) query = ( session.query(DataCollection, AutoProcProgram, AutoProcScalingStatistics) .join( AutoProcIntegration, AutoProcIntegration.dataCollectionId == DataCollection.dataCollectionId, ) .join( AutoProcProgram, AutoProcIntegration.autoProcProgramId == AutoProcProgram.autoProcProgramId, ) .join(AutoProc, AutoProcProgram.autoProcProgramId == AutoProc.autoProcProgramId) .join(AutoProcScaling, AutoProc.autoProcId == AutoProcScaling.autoProcId) .join( AutoProcScalingStatistics, AutoProcScalingStatistics.autoProcScalingId == AutoProcScaling.autoProcScalingId, ) .filter(AutoProcScalingStatistics.scalingStatisticsType == "outerShell") .filter(DataCollection.SESSIONID == bs.sessionId) ) print(query.count()) .. parsed-literal:: 1221 Thanks to being able to access the raw SQL statement that would be run by a query, we can use the ``pandas.read_sql()`` function to directly convert this query into a ``pandas.DataFrame``: .. code:: ipython3 import pandas as pd df = pd.read_sql(query.statement, session.bind) print(df.head()) .. parsed-literal:: dataCollectionId BLSAMPLEID SESSIONID experimenttype \ 0 5791190 3224381.0 27446315 None 1 5791190 3224381.0 27446315 None 2 5791190 3224381.0 27446315 None 3 5791190 3224381.0 27446315 None 4 5791220 3224384.0 27446315 None dataCollectionNumber startTime endTime \ 0 1 2021-01-14 13:47:30 2021-01-14 13:48:29 1 1 2021-01-14 13:47:30 2021-01-14 13:48:29 2 1 2021-01-14 13:47:30 2021-01-14 13:48:29 3 1 2021-01-14 13:47:30 2021-01-14 13:48:29 4 1 2021-01-14 13:51:01 2021-01-14 13:51:56 runStatus axisStart axisEnd ... meanIOverSigI \ 0 DataCollection Successful 0.0 0.1 ... 0.330998 1 DataCollection Successful 0.0 0.1 ... 0.588335 2 DataCollection Successful 0.0 0.1 ... 0.600000 3 DataCollection Successful 0.0 0.1 ... 1.400000 4 DataCollection Successful 0.0 0.1 ... 0.043246 completeness multiplicity anomalousCompleteness anomalousMultiplicity \ 0 99.900400 39.4367 99.7773 20.8294 1 100.000000 38.7516 100.0000 20.7655 2 100.000000 37.6000 100.0000 20.1000 3 69.100000 39.3000 70.3000 21.6000 4 0.053611 1.0000 0.0000 1.0000 recordTimeStamp_1 anomalous ccHalf ccAnomalous resIOverSigI2 0 2021-01-14 14:01:36 None 0.436763 -0.213406 None 1 2021-01-14 14:02:49 None 0.284781 0.020924 None 2 2021-01-14 14:09:14 None 0.366000 -0.026000 None 3 2021-01-14 14:09:15 None 0.728000 -0.008000 None 4 2021-01-14 16:17:06 None 0.000000 0.000000 None [5 rows x 137 columns] We can see that this query gave us a large number of columns (136 in fact!), most of which we’re probably not interested in. There are even some duplicate column names across this as some of the tables share column names, e.g. ``dataCollectionId``. This will be problematic if we want to use pandas ``groupby`` functionality on ``dataCollectionId``. .. code:: ipython3 print(df["dataCollectionId"].head()) .. parsed-literal:: 0 5791190 1 5791190 2 5791190 3 5791190 4 5791220 Name: dataCollectionId, dtype: int64 We can modify our query using the ``query.with_entities()`` method to select only those columns we’re interested in: .. code:: ipython3 query = query.with_entities( DataCollection.dataCollectionId, AutoProcProgram.processingPrograms, AutoProcProgram.processingMessage, AutoProcProgram.processingStartTime, AutoProcProgram.processingEndTime, AutoProcScalingStatistics.ccHalf, AutoProcScalingStatistics.resolutionLimitHigh, ) df = pd.read_sql(query.statement, session.bind) print(df.head()) .. parsed-literal:: dataCollectionId processingPrograms processingMessage \ 0 5791190 xia2 dials processing successful 1 5791190 xia2 3dii processing successful 2 5791190 autoPROC processing successful 3 5791190 autoPROC+STARANISO processing successful 4 5791220 xia2 3dii processing successful processingStartTime processingEndTime ccHalf resolutionLimitHigh 0 2021-01-14 13:48:34 2021-01-14 14:01:36 0.436763 2.05890 1 2021-01-14 13:48:35 2021-01-14 14:02:49 0.284781 2.26003 2 2021-01-14 13:48:41 2021-01-14 14:09:14 0.366000 2.32900 3 2021-01-14 14:09:15 2021-01-14 14:09:15 0.728000 2.29700 4 2021-01-14 13:52:04 2021-01-14 16:17:06 0.000000 1.06354 Next, compute the runtime for each program, filtering only for programs that completed successfully (note we could have done this filtering as part of the original database query), group by ``dataCollectionId`` and ``processingPrograms`` and then plot a histogram of the runtimes: .. code:: ipython3 # Calculate the processing time df["processingDuration"] = df["processingEndTime"] - df["processingStartTime"] # Select only those programs that successfully completed and group on # dataCollectionid and processingPrograms grouped = df[df["processingMessage"] == "processing successful"].groupby( ["dataCollectionId", "processingPrograms"] ) # Calculate the minimum processing time for each group processing_duration = grouped["processingDuration"].min().dt.total_seconds().unstack() # Drop those columns that aren't strictly comparable processing_duration.drop( ["autoPROC+STARANISO", "xia2 dials (multi)", "fast_dp", "xia2.multiplex"], inplace=True, axis=1, ) print(processing_duration.median()) import math import matplotlib.pyplot as plt plt.style.use("ggplot") # Plot processing duration as a histogram and boxplot processing_duration.plot(kind="hist", bins=50, subplots=True) plt.xlabel("Processing duration (seconds)") plt.show() processing_duration.plot(kind="box", vert=False) plt.xlabel("Processing duration (seconds)") plt.show() .. parsed-literal:: processingPrograms autoPROC 1876.0 xia2 3dii 1352.5 xia2 3dii (multi) 1472.0 xia2 dials 1316.0 dtype: float64 .. parsed-literal:: .. raw:: html .. parsed-literal:: .. raw:: html .. code:: ipython3 def comparison_scatter_plot(df, reference, compare): ncols = 2 nrows = int(math.ceil(len(compare) / ncols)) fig, axes = plt.subplots(nrows=nrows, ncols=ncols, sharex=True, sharey=True) for ax, comparison_prg in zip(axes.flatten(), compare): df.plot(kind="scatter", x=reference, y=comparison_prg, s=2, ax=ax) ax.set_title(comparison_prg) for ax in axes.flatten(): vmin = min(ax.get_xlim()[0], ax.get_ylim()[0]) vmax = max(ax.get_xlim()[1], ax.get_ylim()[1]) ax.plot([vmin, vmax], [vmin, vmax], c="black", linestyle="dotted") ax.set_aspect("equal") return fig, axes # Scatter plot for xia2 3dii and autoPROC vs xia2 dials compare = ("xia2 3dii", "autoPROC") fig, axes = comparison_scatter_plot( processing_duration, reference="xia2 dials", compare=compare ) for ax, comparison_prg in zip(axes.flatten(), compare): ax.set_title(comparison_prg) ax.set_xlabel("xia2 dials runtime (seconds)") ax.set_ylabel(f"runtime (seconds)") fig.suptitle(f"Runtime vs. xia2-dials for {visit}") plt.show() .. parsed-literal:: .. raw:: html .. code:: ipython3 # Compare resolution limits compare = ("fast_dp", "xia2 3dii", "autoPROC", "autoPROC+STARANISO") fig, axes = comparison_scatter_plot( grouped["resolutionLimitHigh"].min().unstack(), reference="xia2 dials", compare=compare, ) for ax, comparison_prg in zip(axes.flatten(), compare): ax.set_title(comparison_prg) ax.set_xlabel("xia2 dials d_min (Å^-1)") ax.set_ylabel(f"d_min (Å^-1)") fig.suptitle(f"Resolution vs. xia2-dials for {visit}") plt.show() .. parsed-literal:: .. raw:: html .. code:: ipython3 # Compare outer shell CC1/2 compare = ("fast_dp", "xia2 3dii", "autoPROC", "autoPROC+STARANISO") fig, axes = comparison_scatter_plot( grouped["ccHalf"].min().unstack(), reference="xia2 dials", compare=compare ) for ax, comparison_prg in zip(axes.flatten(), compare): ax.set_title(comparison_prg) ax.set_xlabel("xia2 dials d_min (Å^-1)") ax.set_ylabel(f"d_min (Å^-1)") fig.suptitle(f"Resolution vs. xia2-dials for {visit}") plt.show() .. parsed-literal:: .. raw:: html