Statistical Process Control at Scale with Azure Databricks

Accompanying our clients through their digital transformation journeys by solving their problems is something we are proud of at Datapao.

Throughout our journey, we are used to meeting challenges where we eliminate pain points and help our client’s decision-making process.

We often find ourselves working with large production facilities where Six Sigma and Total Productive Maintenance concepts are known only as a myth but never really applied at scale and in almost-real-time.

In this article, we will give you an overview of such an application to show how this works.

We will be storing and processing data from tens of thousands of products and average 25 attributes for each of them in this project for years. We need a unified analytics engine capable of processing big data and this is where Databricks on Azure comes into the picture.

For the analytical component, we will bring in SPC (Statistical Process Control) which is one of the concepts under the Six Sigma methodology and relies on samples coming from production. Here the objective is to inform the quality teams if these measurements were within the expected range or if there was a variation which could potentially jeopardize the entire production batch.

We decided to utilize one of the SPC tools called time-series Control Charts:

Figure 1: Sample Control Chart

There are several things happening on this chart. Let’s have a quick look:

  1. USL and LSL are the upper and lower specification limits. The measurement has to be within these limits to meet quality expectations. These are static hardcoded numbers. Our measurements must fall within these limits otherwise they won’t have an economic value.
  2. UCL and LCL are the upper and lower control limits. These numbers are coming from the process itself. We find them by adding and subtracting 3 standard deviations from the average, using the data of the previous one year. So they can change over time. The grey lines are the sigma levels.
  3. We dynamically update the Process Capability Index for the attributes to see where they are standing in comparison with the Six Sigma level and store this as a metric in the database. Control Limits will tell you if the process is stable or not, while the Specification Limits will tell you if it is capable or not.
  4. We can see from the sampled measurements in our chart that some trends are developing over time, the variation is increasing, and if it is not taken care of, they breach the specification limits. At that point, the product loses its economic value and is destroyed.

Clearly, we could create an alert that something fishy was going on before this breach happens if we had an architecture like this running in the factory:

Figure 2: Simplified Architecture

We get the daily data into an SQL database on Azure Data Factory and our scheduled ETL job with alerting capability runs on Azure Databricks, which in the end alerts and also reports some aggregated tables.

The engineers in the factory use the Microsoft Power BI user interface to see these Control Charts and additional reports to take corrective and/or preventive action.

In the beginning, it was obvious that when there was a specification or control limit breach, we had to alert immediately. But what if the measurements are within quality standards or stable?

In order to capture this kind of common cause variation we used some rules of thumb called the Nelson Rules:

  1. 7 or more consecutive points on one side of the average
  2. 7 consecutive points trending up or trending down
  3. 2 out of 3 consecutive points are located further than 2 sigma range or beyond
  4. 4 out of 5 consecutive points are located further than 1 sigma range or beyond
  5. 8 consecutive points with no points in 1 sigma range

The end result of these rules might be thousands of alerts every day which can create a lot of noise. One way to deal with this issue is to learn the importance-weights of the rules. Based on these weights we can compute an alert-score which will be a single metric. After talking to the business, we can decide a score threshold so that we can only alert if the score is higher than this decided number.

Just to give a rough idea, a scoring formula can look something like this. Note that learning these weights may not be trivial:

alert_score = (5*rule_1 + 1*rule_2 + 3*rule_3 + 7*rule_4 + 2*rule_5) / cpk_quality_score

The advantage of the SPC approach revealed itself as soon as we started the project. Not jumping head-first into Machine Learning,

  • we could learn about this domain more than we expected,
  • we could see our way forward more clearly as we learned about the data and the potential features with predictive power,
  • we could use the client’s lingo which provided productive communication.

Until we have more case studies or interesting projects to share with you, stay cool and contact us in case you are interested in more details. From all of us at Datapao, with love.