Piero Georgio Verdini - PLC Review Comments

I am not 100% sure whether I am or not - so I have not prepared any document. However, I have worked on setting up several systems which are very close to the Pixel PLC, and the code written by Andromachi and me has been provided for the use of the Pixel community. My concerns can be summarized (in no particular order) as:

  1. There is a lot of code in the project just to handle hardware reconfiguration. Now, the Siemens software which is needed in order to program those PLC contains a tool that does exactly that, and I have no doubt that it works much better, in a safer and smoother way, and the diagnostics it generates for misconfiguration are much more understandable than anything that could be running in a PLC. I see no reason to keep any part of that functionality in the PLC code, rather I see it as an extremely dangerous opportunity for screwing up. There is no call for "dynamic" redefinition of PLC hardware address assignments, really, but there is a strong potential for potentially bad downloads. The S7 Driver which PVSS is using has some interesting quirks (which have forced us to build some layers of validation around all downloads), and the hardware configuration of the PLC is definitely not something you want to get a bad download for. Thus, my advice is to drop all that part of the code. It will also make understanding what happens easier for the future maintainers-to-be.
    >> There has been some misunderstanding about what the Pixel PLC code does and does not do:
    >> We do **not** attempt at all to set the hardware configuration in the code that we have written,
    >> but use the "HW-Config" tool provided by Siemens for that purpose (like everybody else).
    >> We believe this misunderstanding has been resolved now.
  2. The entire structure of the code is extremely farraginous (byzantine). There are almost more User Data Type definitions than there are Data Blocks (perhaps more?). Add to this that all data structures are defined in a low-level language (STL) and not in the higher level language used for the code (SCL), and you get a situation in which any changes to the data structures can lead to horrible difficulties. I teach Computer Science at the University of Perugia, and I always warn my students that if they only use one template (or structure, or UDT) once, they are much better off NOT defining it as a datatype but as a one-shot thing. I believe this is also the case here. Having 52 Data Blocks and 46 UDT is definitely NOT a Good Thing. By the way, are we sure that each and every value in the Data Blocks is initialized correctly, also in case of a Warm Restart? Those things tend to be retentive, and remember what was contained previously, and sometimes this is not quite what one desires.
    >> We believe that the issue whether or not to define the Data-Blocks in terms of user-defined data types (UDTs)
    >> or as "one-shot things" and whether to use the Siemens SCL editor or the Siemens Data-Block editor to declare them
    >> comes down to a matter of programming philosophy and personal taste.
    >> The Pixel PLC code actually **does** employ the fact that the Data-Blocks have been defined in terms of UDTs
    >> (the UDTs **are** actually used more than once), in order to simply the code: 
    >> by using SFC 20 ("BLKMOV") calls (which require the Data-Blocks have been defined in terms of UDTs)
    >> to transfer configuration parameters, process values and operator commands from one Data-Block to another,
    >> we reduce the code that would otherwise be needed for variable assignments.
    >> We believe the usage of "BLKMOV" calls makes the code actually also easier to maintain,
    >> as it does not require to add assignments to different places of the code in case additional variables are added to the UDTs/Data-Blocks.
  3. The same goes for the code. There are, if I counted correctly, 58 separate FC blocks, without considering FC6666 which actually has an illegal FC number for the specific PLC model in use. In the Strip Tracker code, we have 26, and they are already a lot, but the main logic of the OB1 cyclic execution takes less than one screenful. In the Pixel case, it starts getting hard to follow the flow of execution. Also, I just noticed that there is a call to synchronize the time with PVSS. That is A Very Bad Thing. Siemens PLC can and should get their time synchronization over NTP, from specialized time servers, not from some downloaded Data Block. This, together with the previous observations under point 1, seems to hint a modest understanding of the potentialities of Simatic Manager, the Siemens tool used to configure and program these PLC, and to betray a youthful enthusiasm which rushes to the keyboard rather than sitting a bit longer with a pad of paper, a pencil and a brain.
    >> FC 6666 is indeed an illegal FC number for the CPU 315 PLC model.
    >> We did not notice this issue when testing the code with the PLC simulator tool.
    >> The motivation for using such a high FC number was to clearly separate FC 6666,
    >> which is intended to be used for "stand-alone" (without connection to PVSS) commissioning,
    >> from the rest of the Pixel PLC code.
    >> The FC number has been corrected now (changed to FC 666).
    >> We have reduced the number of FC blocks present in the Pixel PLC code to 55 now. 
    >> To some extent, the different number of FC blocks present in the Pixel and Silicon Strip Tracker PLC codes,
    >> results from personal choices: personally, we prefer to separate functionality into a larger set of simple functions
    >> rather then a smaller set of more complex ones.
    >> We agree that the function for synchronizing the time between the PLC system and PVSS are unneccessary:
    >> We will use time servers for the synchronization of PLC time and PVSS time in the future
    >> and have removed the function FC 1240 from the Pixel PLC code.
    >> Thank you for pointing this out.
    >> (remark: the time servers are still to be setup by Frank Glege).
  4. Even though there is a lot of code to handle hardware errors (which can also be debugged more easily with the Siemens tool), the "diagnostic" Object Blocks, or if you wish, in a different language the Interrupt Service Routines devoted to handling hardware and software errors are totally absent. Perhaps this is a feature of the copy of the project that was distributed, but without them, all the code written to check for hardware errors and report them to PVSS will never be executed, period. As soon as an error pops out, be it hardware (oops, somebody stole my RTD module!) or software (maybe I shouldn't have addressed element 1000 of an 10-element array?), the PLC will attempt to execute the appropriate OB, and if it is missing it passes to the STOP state. Incidentally, with NO warranty whatsoever that the state of the outputs is inherently safe (i.e. if you want the PLC to open all its relays you have to program it in yourself. In the Strip Tracker, we do). I would also suggest dropping all/maybe only most of the error-checking stuff, program the appropriate Object Blocks to open all relays in case of error, and be aware that it is very, very difficult to substitute a hardware module from a remote location. Debug will have to happen in USC55.
    >> The diagnostic interrupt organization block OB 82 was indeed missing in the copy of the Step7 project that you got.
    >> Sorry.
    >> The organization blocks OB 80, OB 81, OB 82, OB 83, OB 84, OB 85, OB 86, OB 87, OB 88, OB 121 and OB 122 have now been added to the Pixel PLC code.
    >> Concerning the question whether the "STOP" state is inherently safe or not, the answer is that it **is** indeed inherently safe.
    >> This has been confirmed by CERN PLC expert Jeronimo Ortola and agrees with the experience we gathered during commissioning.
    >> Consequently, there is no need for writing code that opens the relays in case the PLC goes to the "STOP" state.
  5. Maybe this is not obvious, but the S7 driver used by PVSS has very little, if any, Access Control. This means that unless one declares a given Data Block inside a PLC as Read-Only, it will be possible to overwrite it. I do not see any mechanism preventing a runaway PVSS application from clobbering the System Configuration and the System Limits Data Blocks.
    >> The Data-Blocks DB 1 - DB 13 are foreseen to be marked as write-protected as soon as the Pixel PLC code goes to "production mode"
    >> (i.e. there is a "real" detector attached to the PLC system).
    >> Right now, the Data-Blocks are not write-protected as it simplifies the testing and commissioning of the Pixel PLC code by means 
    >> of VAT variable tables if they are not.
  6. Now, I have not scanned all the code (I don't find it readable, and when I tried to compile it on my machine I had two surprises: more than 100 warnings about Identifier Truncated, which is definitely not A Good Thing, since it may be hiding an "overlap" of two variables with similar names which only differ beyond the maximum significant variable name length, and that the code size grew to a level which was no longer donwloadable to the same PLC - I admit I am adding debug info to my objects. This second would become a non-problem, of course, if the unnecessary "functionalities" were dropped) but I believe it would be possible to impose "non-physical" limits on the alarm settings, e.g. a lower temperature threshold higher that the upper one, and possibly thresholds so low/high to delay the intervention of the interlocks indefinitely.
    >> We agree that the warnings are potentially dangerous.
    >> We have now shortened the names of variables so that names of variables cannot "overlap" and the code compiles without warnings.
    >> The case that the value for the "lower" alarm threshold might be above the value for the "upper" one 
    >> due to an error in the configuration parameters uploaded by PVSS was indeed not properly handled 
    >> in the version of the Pixel PLC code that has been subject of the review.
    >> This has been fixed now. 
    >> Thank you for bringing this issue to our attention. 
    >> (remark: that handling of this case was missing in the checks of the configuration parameters implemented in FC 410 and FC 411
    >>          has also noticed by Stefan Lueders)
    >> The other case that you mention, values for alarm delays that are so large that they practically inhibit the interlocks from ever triggering
    >> is properly handled by the functions FC 1xx which check the validity of configuration parameters uploaded by PVSS. 
  7. This is more of a philosophical item, but not really 100% so. As an old-time FORTH (not FORTRAN, well I did both...) programmer I strongly believe that floating point numbers have very little to do on microprocessors, microcontrollers and PLC, and should only be used when no alternative offers itself. I was therefore appalled to find out a large amount of floating point manipulation going on inside the PLC code. Now, I understand that there has been a bit of a problem with the way the Pt100(0) have been wired, but still, if you have a linear transformation function that tells you that T = a * Cts + b, you are not forced to convert your counts Cts to a floating-point temperature T and do all your alarming in FP, but you could (the old pad of paper, pencil and brain thing again!) work your way backwards, ask yourself "What is the threshold I want to put on this temperature measurement?", call it _T_, compute the equivalent number of ADC counts _Cts_ with the inverse of the previous transform Cts = a' * T + b' and confine yourself to integer manipulations on the PLC. That is what we do in the Strip tracker for our thermistors, and given that we have practicaly infinite space in the Database we can afford to store the thresholds in degrees and let Oracle compute the corresponding number of counts.
    >> The question of using floating point or integer arithmetic seems to come down to a matter of personal preferences again.
    >> We agree that floating point arithmetic uses more memory for storage of the process values in Data-Blocks 
    >> as well as more clock cycles to process them compared to integer comparissons.
    >> We find that the possibility to work with calibrated temperature and humidity values + common alarm thresholds 
    >> more than compensates the disadvantages in terms of memory consumption and processing time.
    >> Please keep in mind also that we actually use the calibrated temperature and humidity values in the dew-point computation,
    >> a computation which in our mind would be rather cumbersome if it would need to be implemented based on uncalibrated integer values.
  8. It's probably my old age and failing neurons, but I fail to see the usefulness of distinguishing between a "RESET" that clears the memory of alarms that have been active but no longer are and as a consequence also removes the interlock status from whichever relays were affected and an "ACKNOWLEDGE" that simply clears the memory of alarms that have been active but no longer are. Either the alarms are processed, and reflect upon the relay status, every OB1 execution, or the entire PLC system is useless. Thus, it seems a bit moot to differentiate between RESET and ACKNOWLEDGE since the alarm/relay relationships will be processed a few lines of code further. This is more of an "Operator" issue than a maintenance one, but still.
  9. Going to Interlock Groups, I see a lot of recalculations going on, such as recounting how many sensors belong to a given group at every OB1 cycle. Now, if you really trust your Data Blocks so little, perhaps other verifications are called for, such as the values for the thresholds being in a "safe range" and so on. Otherwise, I expect this kind of checking to be called for at download time only. The same consideration goes for the "active state" of sensors being checked (the manipulation of DB401 on the basis of DB101). Also, and I admit not having read all of the code, but I do find the variable naming scheme annoying, it seems to me that the error reporting is a bit weak. For one instance, in FC301() a system function is called to fill DB401 with "0" (which I already find not the most informative action, really) but nothing is done with the return value of the system function.
    >> We believe that the appreciation of the naming scheme that is used in the Pixel PLC code is coming down to a matter of personal preference:
    >> other reviewers remarked that they found the naming scheme quite helpful, actually.
    >> (remark: the naming scheme employed in the Pixel PLC code is actually based on the naming scheme of the Silicon Strip Tracker PLC code;
    >>          due to differences in functionality (e.g. calibration of temperature and humidity values, dew-point computation,...),
    >>          it was neccessary to **extent** (this is how we look at the naming scheme employed in the Pixel PLC code) 
    >>          the naming scheme defined by the Silicon Strip Tracker one.)  
    >> We have added handling of the return codes of the system functions SFC 20 ("BLKMOV") and SFC 21 ("FILL") now.
    >> Thank you for bringing this issue to our attention.
  10. I don't know the Pixel's mileage, but I tend to doubt any relay which has been switching more than 300,000 times. Now, if I look at OB35 (cyclic interrupt) I find (much to my dismay) that it is executed every 100 milliseconds (which means, according to numbers that have been floating around, more than five times for every OB1 execution!) and that it toggles the relay feeding an external WatchDog Timer with a period of 1.6 seconds. The math is easy: 1.6 seconds times 300,000 cycles is a bit more than 133 hours, or a bit more than five and a half days. Even assuming that one wouldn't need to replace the PLC module at once, but just switch over to another of the eight relays in the same module, that's one hardware replacement every 45 days of operation.
    >> The Relay is specified for 1 million switching cycles for a current of 0.5 A at a voltage of 24V DC.
    >> We expect the current to be significantly smaller in the CMS Pixel PLC system, 
    >> so the lifetime of 1 million switching cycles represents a lower limit.
    >> Assuming 1 million switching cycles as a conservative estimate of the Relay lifetime,
    >> the Relay might actually break after 2-3 weeks of operation when we toggle the heart-beat
    >> with a period of 1.6s, as originally intended.
    >> We agree that this is clearly not acceptable, 
    >> even if we switch channels in case the Relay breaks (each Relay module has 8 channels).
    >> We will change the heart-beat period to 2 minutes.
    >> (remark: this is the value used by the Silicon Strip Tracker.)
    >> This yields a lower limit for the lifetime of the Relay of 3-4 years,
    >> which we find acceptable.
    >> Thank you for bringing this issue to everyones attention.
  11. There seems to be no easily readable documentation really on "what if.." cases. For example, what is the Standard Corrective Action in case . A separate document outlining special cases might be useful. For example, what action is to be taken if the Cooling Plant signals a fault? If one or more of the racks housing CAEN Powers Supplies must be switched off? If the dry gas no longer is dry? And so on and so forth, including the very Pixel-specific instances I cannot think of.
    >> The reactions to failures of the cooling plant or the dry gas system
    >> are being discussed within the detector communities right now.
    >> The resultant requested actions will be implemented in the Tracker "Master" PLC, not the Pixel PLC.
    >> We intend to document these actions (prepare an action matrix) soon. They will likely change in the
    >> future as the machine becomes more stable and we gain more confidence in the system.
  12. I was also appalled when a message was circulated stating that a faster, more powerful CPU could be used. So what? This is exactly the attitude that has lead to brainless programming, and if you want to run the new version of XXX you'll just have to upgrade your computer... In any case, replacing the CPU will not solve all of the problems outlined.
    >> The PLC model (CPU 315) that we currently use is performant enough to process the code that has been subject to the review.
    >> In order to reduce the OB 1 cycle time, have more memory available for future extentions of the functionality/code,
    >> allow the information stored in Data-Blocks DB 1xx and DB 4xx and to be copied to "download" Data-Blocks DB 8xx
    >> (so that the PVSS native S7 driver does not need to connect to the "work" and "process value" Data-Blocks directly),
    >> and to speed-up the read-out rate of the S7 driver, the Pixel community decided to upgrade to a more performant PLC model
    >> (CPU 319, which processes the code about 10 times faster than the CPU 315 model, and has about 10 times more memory).
    >> It is our understanding that others are also following this path.

So, I'll draw my personal conclusions, which are very personal, and certainly biased by my oldfashioned look at computer science and at computing, but I will not at all feel insulted, demeaned or even relieved if you'll just choose to ignore them. The project is not fit for production operation, and were I a member of the Pixel Collaboration I would feel extremely unsafe knowing that the health of my detector depends on it. From my experience, it should be considered as a good prototype, but major rework is needed before it can be considered sufficiently reliable. The first points that should be addressed are, not necessarily in this order, the lack of the diagnostic Object Blocks (that's a show-stopper on its own), the needless and dangerous ability to reconfigure the hardware from PVSS, and the timing of OB35 and of the WatchDog Timer. Once these have been solved, a lot more work should be devoted to cleaning up the distinction between which actions are to be executed at startup, which at the reception of a downloaded block from PVSS, and which at every cyclic execution. When this is cleared, I still believe there will be need for a re-exam of the parts of the software that validate (sanity-check) the PVSS downloads (e.g. low alarm thresholds higher than high alarm thresholds). I apologize for not going in further detail, but the amount of time I was able to dedicate to this did not match my hopes.