JAL 521Reputation points
About five years ago our company created an app to record the findings of a human evaluation.
A person examines a device, checks it against each criterion onscreen, and then marks it "Pass" or "Fail". Each criterion is one row in our DB.
The app hasn't changed much in the last five years. About 1.3 million evaluations done so far. About 50 million criterion-rows.
Typically it's a two-person process:
- Evaluator-1
- Evaluator-2 (a senior evaluator).
Evaluator-2 doesn't start fresh. Rather the rows of Evalutator-1 are first cloned (but marked as Evaluator-2).
After five years of success, suddenly we are seeing an occasional failure in the cloning process, where Evaluator-2 is mysteriously missing exactly one row. The cloning is done in a stored procedure that looks roughly like this:
Insert Into dbo.Workspace (PassOrFail, EvaluationID, Iteration)
Select PassOrFail, EvaluationID, 'Evaluator-2'
From dbo.Workspace where EvaluationID = @EvalID and Iteration = 'Evaluator-1'
-- The same SP ends with the following validation to verify that all rows got cloned:
Declare @count1 int = (Select count() From dbo.Workspace where EvaluationID = @EvalID and Iteration = 'Evaluation-1');
Declare @count2 int = (Select count() From dbo.Workspace where EvaluationID = @EvalID and Iteration = 'Evaluation-2');
If @count1 <> @count2 Throw 51000, N'Error in cloning', 1
- That last step (the validation-by-count) has been in place for five years, and has never thrown an exception - until now - and it is always exactly ONE ROW that is missing. If we delete the cloned rows and re-run the app, all rows clone properly.
We tried to brainstorm on this - maybe another app is deleting one cloned row? But we can't find anything in our system that would do this.
Frankly, we're fresh out of theories. Any ideas, anyone?
Note:
- The INSERT statement is not wrapped in a try-catch, nor in an explicit tran.
- Seems xact_abort is Off. I suppose we should turn it on?
Sign in to follow
0{count} votes
Viorel 82,476Reputation points
(Video) How to add Ranking, Likert, and Rating questions in Microsoft Forms2022-09-04T07:40:25.56+00:00 Do you mean that @count1 is 1 and @count2 is 0?
Dan Guzman 6,926Reputation points
2022-09-04T12:00:32.923+00:00 What transaction isolation level are you using?
CosmogHong-MSFT 6,616Reputation points • Microsoft Employee
2022-09-05T08:24:21.87+00:00 Does this happen every time, or does it happen occasionally?
Sign in to comment
6 answers
Sort by: Newest
Most helpful Newest Oldest
JAL 521Reputation points
2022-09-04T14:13:53.593+00:00 (Video) Insert Online Videos in Microsoft Word and Troubleshooting Tips"But if you are doing an INSERT and then a count, and you want to be sure that nothing changes, you need to have the either the serializable ro the snapshot isolation level. The former protects you against phantom reads, that is a row is inserted while the operation is running and before the SELECT COUNT runs. With SNAPSHOT you read from a version store to get the state of the database when the transaction started."
- A moment ago, I just posted some additional info regarding our numerous validations, our identity-column sequencing, and the interim period waiting for Evaluator-2 to assign the evaluation to himself. Given all these factors, I don't think isolation-level anomalies such as phantom rows are a factor here.
0No comments
Sign in to comment
JAL 521Reputation points
2022-09-04T14:06:26.553+00:00 When I said the provided code was "roughly" accurate, I was referring to
- renaming the columns
- omitting some columns irrelevant to this discussion
The code I gave you is actually very accurate.
0No comments
Sign in to comment
JAL 521Reputation points
2022-09-04T14:02:15.573+00:00 "We can't debug code we don't see."
Thanks for wanting to help me but I can't show company code. I had to handwrite this version, renaming all the columns, to comply with company policy. But the SP really is pretty simple - it's pretty much the code that I showed you.
I'm not expecting a necessarily "correct" answer - just hoping that experts like you would brainstorm for possibilities that I haven't thought of.
"But I like to add that beside that there is an error in the cloning code..."
If you're referring to any typos, that's probably because I hand-wrote the code for this post.
"... the issue could also be that one more row was added to the source while the operation is running, and that is why you get the error in validation."
(Video) Work with common questions across multiple forms using the Microsoft Forms APINot likely that a row got added late. The Evaluator-1 rows come from an original Criterion table and, in each case, the correct number of Evaluator-1 rows exist with Pass or Fail. If a row got added late, the Pass/Fail column on the new row would be null. And we have plenty of validations preventing Evaluator-1 from submitting his work if he hasn't marked all, say, 30 rows. Only when he has finished all 30 rows does the evaluation go into the senior queue where Evaluator-2 can find it and begin working on it. Evaluator-2 assigns the evaluation to himself, and then TRIES to open a web page to begin working on it. That's when the cloning happens. In these cases, Evaluator-2 has never been able to open the page, not even once, because the cloning-SP throws the exception, "Error in cloning". Instead of a page containing 30 items, all he sees is an error message.
Also, if a row got added late, we would see out-of-sequence numbers. The dbo.Workspace table has an integer identity column as primary key. We checked for numbers of out sequence but didn't find any,
As I now recall, it was in virtue of sequencing that we pretty much ruled out the idea that Evaluator 2 lost a row at some point. To summarize:
- Seems no row got added late.
- No row got deleted.
Therefore: Seems the cloning process simply failed to copy one of the rows.
Erland Sommarskog 67,721Reputation points • Microsoft MVP
2022-09-04T17:39:03.8+00:00 "But I like to add that beside that there is an error in the cloning code..."
If you're referring to any typos, that's probably because I hand-wrote the code for this post.
I might have left out a word or two there. What I meant to say that beside the possibility that there is a bug in the cloning code, I wanted to highlight another possibility. You seem to rule out that possibility, and obviously you know your system better than I do. (But, hey, I've been around long enough to more than once heard someone being dead sure that X cannot happen in their system, but it turned out that this was the case, after all.)
When it comes to the code not being the actual code, I understand that you don't want to post your actual code. But you will have to excuse me - I've been answering questions in SQL forums for many years, and it is not at all uncommon that the issue is with something the poster does not tell originally.
Anyway, in this particular case, we have absolutely found a smoking gun: SET ISOLATION LEVEL READ UNCOMMITTED. You will either have to change that to at least READ COMMITTED, or implement a retry when you get these count mismatches. With READ UNCOMMITTED all bets are off due to concurrent activity that causes pages splits etc. And note here that this may be operations with completely different evaluation IDS.
Erland Sommarskog 67,721Reputation points • Microsoft MVP
2022-09-04T17:39:23.697+00:00 By the way, the way these forum are intended to work is that you ask a question. People who think they have an answer, post an Answer. Sometimes we need clarification, and we post a Comment. The poster would then typically answer that question in a comment. (Or edit the original post and face the wrath of Ronen. :-) As the original poster you would post an Answer, if you find solution on your own, with or without help of other forum members.
And, yeah, it works so-so in practice. A straight line of posts would be simpler for everyone. But now you know how it is intended to work.
Sign in to comment
Erland Sommarskog 67,721Reputation points • Microsoft MVP
(Video) How to Create 'Frequently Asked Questions' Documents using Microsoft Word Custom Styles2022-09-04T13:33:02.047+00:00 Set Transaction Isolation Level Read Uncommitted;
Which basically means "I don't care if I get correct result or not".
It is not clear to me what you are doing, since you only gave rough outline of the code. But if you are doing an INSERT and then a count, and you want to be sure that nothing changes, you need to have the either the serializable ro the snapshot isolation level. The former protects you against phantom reads, that is a row is inserted while the operation is running and before the SELECT COUNT runs. With SNAPSHOT you read from a version store to get the state of the database when the transaction started.
With READ UNCOMMITTED all sorts of interesting thing can happen, like rows being read twice, or committed rows not being read at all, because there are concurrent operation occurring while the query is running.
I don't think I will be allowed to change it).
That sounds serious. Using READ UNCOMMITTED in production code without understanding the full implications is very careless. Even worse if you "you must change that". I would only agree to using that isolation level in production code with a special sign-off and a very good explanation of why it is needed, and a good analysis that there cannot be issues with it.
Well, at least you have implemented a test if the operation succeeded. Now you also need to implement a retry for when the operation processes the incorrect number of rows.
0No comments
Sign in to comment
JAL 521Reputation points
2022-09-04T13:18:43.797+00:00 Thanks guys for responding.
"What transaction isolation level are you using?"
At the top of the SP is this (not my decision and I don't think I will be allowed to change it).
Set Transaction Isolation Level Read Uncommitted;"Do you mean that @count1 is 1 and @count2 is 0?"
Typically Evaluator-1 has, say, 30 rows. But when the problem occurs, only 29 rows copied to the Evaluator-2 workspace. The least number of rows where this has happened is 9 rows for Evaluator-1, and then of course only 8 rows for Evaluator-2.
0No comments
Sign in to comment
Sign in to answer
Activity
Sign in to follow questions and users