Kernel module + Core 3 questions

cseiler · Post by **cseiler** » 19 May 2017, 10:22

Hello,

first of all thank you for the great product!

In order to understand better how to develop applications for the RevPi, I've taken a look at the current kernel module source and have a couple of questions about it:

The kernel module parses the configuration JSON file from the filesystem directly - and uses the set_fs(DS_KERNEL) trick to use the VFS API to read the file. Wouldn't it be better for the kernel module to provide a configfs-based system to configure the driver (see e.g. the LIO iSCSI target or the USB Gadget support for examples of the use of configfs for kernel configuration) and use a userspace helper to parse the current JSON configuration and map that to the configfs backend? (Note that set_fs is likely to be removed from the kernel at some point in the future, see https://lwn.net/Articles/722267/ for details.)
Why use a global lock piDev_g.lockPI for the process image for everything? Typically values that are read/written from/to that region are 4 bytes or smaller, where atomic operations can be used to ensure that these writes are consistent. Changing individual values could be sped up quite a bit by relying on atomic operations on the kernel <-> userland side. Sure, for large read/write calls one would still need the lock, and the gate thread would have to reimplement the copy operation in terms of a loop of atomic loads / saves (in addition to taking the lock) - and while I haven't measured that, I suspect that this could improve performance for the cases where the user only wants to set a single value.
In lieu of this: wouldn't it also be much faster to be able to mmap() the process image into a program's address space, and use direct memory operations from the program, instead of having to execute system calls? Sure, if an application wants to have consistency for areas larger than 4 bytes (the largest atomic op on 32bit ARM), it would still have to resort to read()/write() system calls, but for just switching a single input / output, for example, (or reading/writing single values of 4 bytes or less from/to field busses) this could speed up applications even further, because you'd save one or even two context switches to the kernel. Furthermore, the bounds checks could all happen on the initial mmap() system call (+ page faults by the CPU), further increasing performance.
Why implement waiting for events via a ioctl, KB_WAIT_FOR_EVENT? Why not make the file descriptor pollable via select/poll/epoll? This would allow easy integration into existing event loops.

In light of the recent announcement that the Core 3 is now available, I also have a couple of questions:

The Core 3 has an ARMv8 chip - but Raspbian only supports the 32bit architecture (and the chip in an effective ARMv7 mode). I would expect that the Revolution Pi would also only work in 32bit mode? Since there's only 1 GiB of RAM anyway, from that perspective this isn't a limitation - however, the ARMv8 architecture supports additional instructions (especially related to floating point arithmetics) in ARMv8 mode. Do you know if userspace applications are able to make use of the processor improvements even in 32 bit mode?
Alternatively, Debian itself appears to run natively (without any Raspbian patches) on the Raspberry Pi 3 in ARMv8 mode (with aarch64 as architecture), at least if I read the wiki correctly - haven't tried it myself. Are there any plans to support the Revolution Core 3 under aarch64 based directly off Debian (+ some firmware packages)? Especially since Debian Stretch (and by extension jessie-backports) now has kernel 4.9 + RT patch already included on some architectures, albeit not ARM?

Thanks in advance you for your replies!

Regards,
Christian

volker · Post by **volker** » 19 May 2017, 15:32

Hi Christian,
thanks for these awsome analysis and comments! While most of the topics need to be discussed with our kernel experts (and I'm sure they will give their comments and answers if you please have patience and give them some time) I would like to comment on one specific point which is elementary to a PLC like system based on EN61131-3:
Copying whole process images is a possible way of interaction between Io-drivers and application software and in fact we are currently using this way with logi.RTS. But on the other hand it is not a wise idea if different applications are using their own copy of the process image and start copying asynchronously data to and from the central process image used by the hard wired IO world. The main idea of EN61131 like systems to be able to predict system behavior in terms of guarantied cycle times etc. gets heavily undermined when choosing this way for multiple applications. The EN61131 idea is to have a strict linear process of decisions based on input data content in order to set synchronously all output data at the end of a cycle. As soon as you start discussing event driven sw including PLC then we are entering a complete different world with different rules. In this different world you may view different applications as encapsulated objects which each have their own decisions based on their rules. Concurrent influence on process values has to been handled by such systems to make things predictive. This is what makes modern operating systems so complex and often buggy and what is on the other hand something absolutely unacceptable in industrial automation. That is one reason why En61131 has been the major way of programming PLCs for such a long period of time and why EN61499 has such a slow start. But we at KUNBUS are open to such new ways of thinking and one of our current projects is a partnership to a software house implementing EN61499 software for our RevPi. So discussions of different ways of IO access are going on and we will see what the future brings...

lukas · Post by **lukas** » 19 May 2017, 17:22

Thanks cseiler for opening this thread, I'm delighted to get some hardcore questions here.

cseiler wrote:The kernel module parses the configuration JSON file from the filesystem directly - and uses the set_fs(DS_KERNEL) trick to use the VFS API to read the file. Wouldn't it be better for the kernel module to provide a configfs-based system to configure the driver (see e.g. the LIO iSCSI target or the USB Gadget support for examples of the use of configfs for kernel configuration) and use a userspace helper to parse the current JSON configuration and map that to the configfs backend? (Note that set_fs is likely to be removed from the kernel at some point in the future, see https://lwn.net/Articles/722267/ for details.)

One advantage of the JSON file is that it can be copied to other RevPi Cores, allowing for easy duplication of a configuration to multiple hosts. I'm not sure if the same could be achieved as easily with a configfs-based approach.

That said, I agree that kernel-native mechanisms are preferrable and the idea itself is great.

Thanks a lot for the heads-up regarding the removal of set_fs.

cseiler wrote:Why use a global lock piDev_g.lockPI for the process image for everything? Typically values that are read/written from/to that region are 4 bytes or smaller, where atomic operations can be used to ensure that these writes are consistent. Changing individual values could be sped up quite a bit by relying on atomic operations on the kernel <-> userland side. Sure, for large read/write calls one would still need the lock, and the gate thread would have to reimplement the copy operation in terms of a loop of atomic loads / saves (in addition to taking the lock) - and while I haven't measured that, I suspect that this could improve performance for the cases where the user only wants to set a single value.

Also a great suggestion. I guess the global lock is motivated by simplicity (and/or laziness

) and we could add fast paths if the values read/written are <= 4 bytes.

For the record, we would gladly accept pull requests on GitHub for contributions like this.

cseiler wrote:In lieu of this: wouldn't it also be much faster to be able to mmap() the process image into a program's address space, and use direct memory operations from the program, instead of having to execute system calls? Sure, if an application wants to have consistency for areas larger than 4 bytes (the largest atomic op on 32bit ARM), it would still have to resort to read()/write() system calls, but for just switching a single input / output, for example, (or reading/writing single values of 4 bytes or less from/to field busses) this could speed up applications even further, because you'd save one or even two context switches to the kernel. Furthermore, the bounds checks could all happen on the initial mmap() system call (+ page faults by the CPU), further increasing performance.

As Volker has already said, this architecture is largely driven by the traditional EN61131 model where access to the process image is controlled by a single, central entity to avoid synchronization issues. On top of that, this architecture allows for fine-grained access control (though this is not implemented in piControl currently), i.e. granting read or write access for specific regions in the process image only to certain clients.

However I do see how an mmap-able process image could be useful for some use cases. This could be enabled on driver load via a module parameter.

cseiler wrote:Why implement waiting for events via a ioctl, KB_WAIT_FOR_EVENT? Why not make the file descriptor pollable via select/poll/epoll? This would allow easy integration into existing event loops.

I think the event support we currently have is not the last word and will be the subject of further discussions internally. We welcome comments like this to come up with an architecture that works for everyone.

cseiler wrote:The Core 3 has an ARMv8 chip - but Raspbian only supports the 32bit architecture (and the chip in an effective ARMv7 mode). I would expect that the Revolution Pi would also only work in 32bit mode? Since there's only 1 GiB of RAM anyway, from that perspective this isn't a limitation - however, the ARMv8 architecture supports additional instructions (especially related to floating point arithmetics) in ARMv8 mode. Do you know if userspace applications are able to make use of the processor improvements even in 32 bit mode?

Unless I'm missing something, the 32-bit kernel is unable to execute 64-bit ELF binaries. It might be possible to use 64-bit inline Assembler in programs.

I'm aware of the BCM2837's 64-bit capabilities and think it's unfortunate that Raspbian doesn't make use of it (beyond unofficial efforts by the community). The Foundation's rationale is basically lack of manpower/time, the same problem as everywhere else.

So the upcoming Jessie image is based on Raspbian, which means everything is compiled for 32-bit ARMv6. Not ARMv7, which isn't supported by the BCM2835 on the CM1. The userland is identical on CM1 and CM3, only the kernel is special for each model because the CPU is initialized differently.

cseiler wrote:Alternatively, Debian itself appears to run natively (without any Raspbian patches) on the Raspberry Pi 3 in ARMv8 mode (with aarch64 as architecture), at least if I read the wiki correctly - haven't tried it myself. Are there any plans to support the Revolution Core 3 under aarch64 based directly off Debian (+ some firmware packages)? Especially since Debian Stretch (and by extension jessie-backports) now has kernel 4.9 + RT patch already included on some architectures, albeit not ARM?

The prospect of utilizing the 64-bit capabilities is tantalizing, but doing this properly and creating a fully-fledged, well-tested product, in addition to our Raspbian-based image, would require double the existing manpower. Of course if a customer shows up tomorrow and orders thousands of devices on the condition of 64-bit support, it would be a different story.

I haven't tested Debian aarch64 either but cannot imagine that it will support the chips on the RevPi Core logic board out of the box. The Foundation's kernel tree contains dozens of custom patches, e.g. the dwc_otg driver for the USB chip on the RevPi Core, which is the same as on the RasPi, as well as drivers for bcm283x cpufreq & thermal sensor. Many of these patches are not upstreamable for various reasons and they're definitely not included in a stock Debian kernel, so off the top of my head I'd expect at least USB and CPU frequency scaling to not work properly. There's a driver for the thermal sensor in 4.12 mainline, so that problem is solved.

Long story short, 64-bit is definitely interesting and we'd welcome reports from anyone who is trying this, but it's probably too experimental and not something that we or anyone else will be able to ship as an official product at least in the short term.

Once again thanks for your comments, keep 'em coming.

Lukas

cseiler · Post by **cseiler** » 19 May 2017, 19:47

Hello,

thank you very much for your quick responses, very much appreciated.

lukas wrote:I'm delighted to get some hardcore questions here.

lukas wrote:
cseiler wrote:The kernel module parses the configuration JSON file from the filesystem directly - and uses the set_fs(DS_KERNEL) trick to use the VFS API to read the file. Wouldn't it be better for the kernel module to provide a configfs-based system to configure the driver (see e.g. the LIO iSCSI target or the USB Gadget support for examples of the use of configfs for kernel configuration) and use a userspace helper to parse the current JSON configuration and map that to the configfs backend? (Note that set_fs is likely to be removed from the kernel at some point in the future, see https://lwn.net/Articles/722267/ for details.)
One advantage of the JSON file is that it can be copied to other RevPi Cores, allowing for easy duplication of a configuration to multiple hosts. I'm not sure if the same could be achieved as easily with a configfs-based approach.

II just think that parsing JSON in the kernel is probably something to avoid. For example, I was a bit impatient and upgraded my Pi to Jessie manually (currently we have one at our company for development), but kept the kernel, since Debian Jessie officially comes with 3.16, and your Wheezy image already had 4.1, so I assumed that userland shouldn't be a problem. The JSON files generated by the updated Pictory 1.2 caused a NULL pointer dereference in the old piControl.ko module though, until I removed 3 fields ("GUID" : "...", "Connections": [] and "extend" : []) from the JSON file. (Then it worked.) I'm confident that this specific issue won't occur with the Jessie version of the piControl.ko module, but this example illustrates the possible pitfalls of doing parsing in the kernel.

That said, I didn't suggest to get rid of the JSON file - I'm pretty sure you don't want to completely rewrite your Pictory configuration tool.

I merely suggested that a program is run at boot that parses the JSON file and then creates the corresponding configfs entries in order to configure the kernel module. This tool could also be run to reload the configuration. This offers a much cleaner interface from the perspective of the kernel module (while at the same time allowing for a better customization for people who want to do their own thing), and at the same time being compatible with the existing stack.

Obviously this will involve a certain amount of work, and I don't expect anything to happen on this front in the near future, my question was geared at pointing out how a long-term perspective could look like. If you do decide to work on this at some point, I'd be happy to discuss implementation details and provide feedback.

lukas wrote:
cseiler wrote:Why use a global lock piDev_g.lockPI for the process image for everything? Typically values that are read/written from/to that region are 4 bytes or smaller, where atomic operations can be used to ensure that these writes are consistent. Changing individual values could be sped up quite a bit by relying on atomic operations on the kernel <-> userland side. Sure, for large read/write calls one would still need the lock, and the gate thread would have to reimplement the copy operation in terms of a loop of atomic loads / saves (in addition to taking the lock) - and while I haven't measured that, I suspect that this could improve performance for the cases where the user only wants to set a single value.
Also a great suggestion. I guess the global lock is motivated by simplicity (and/or laziness ) and we could add fast paths if the values read/written are <= 4 bytes.

For the record, we would gladly accept pull requests on GitHub for contributions like this.

I can't guarantee that I'll find time for this, but I may take a jab at the issue.

lukas wrote:
cseiler wrote:In lieu of this: wouldn't it also be much faster to be able to mmap() the process image into a program's address space, and use direct memory operations from the program, instead of having to execute system calls? Sure, if an application wants to have consistency for areas larger than 4 bytes (the largest atomic op on 32bit ARM), it would still have to resort to read()/write() system calls, but for just switching a single input / output, for example, (or reading/writing single values of 4 bytes or less from/to field busses) this could speed up applications even further, because you'd save one or even two context switches to the kernel. Furthermore, the bounds checks could all happen on the initial mmap() system call (+ page faults by the CPU), further increasing performance.
As Volker has already said, this architecture is largely driven by the traditional EN61131 model where access to the process image is controlled by a single, central entity to avoid synchronization issues.

Well, but even with the current design you have synchronization issues in some cases: say you want to atomically updated two variables that are not directly next to each other. Then you have three possibilities:

Don't care about anything in between and do a single large write that encompasses both variables. This will overwrite anything in between, which might not be a good idea.
Do two consecutive writes. However, that will not be atomic anymore.
Do a read of the area containing both values, change both in the local copy, and write it back. However, this will again overwrite anything in between, this time maybe not with garbage (as in the first suggestion), but possibly with stale values.

Only for consecutive or single values will this provide an effective synchronization mechanism. And for single values you can easily use mmap() + atomic operations with much better performance characteristics.

lukas wrote:On top of that, this architecture allows for fine-grained access control (though this is not implemented in piControl currently), i.e. granting read or write access for specific regions in the process image only to certain clients.

Ok, yes, that's not possible with mmap - because access control there is only possible with page granularity, and the entire process image fits into a single page. (And there's no reason to increase the size, as far as I can tell.)

However, with that in mind: why not create two char devices? One that supports mmap and one that doesn't? This way it is possible to implement access control later (on the char device that doesn't support mmap), and adjust the permissions on other char device in such a way that it's not accessible to those who are supposed to have restricted access. (Maybe default permissions for the mmap-able chardev to root:root:0600, so that users that want mmap need to adjust permissions themselves?)

lukas wrote:
cseiler wrote:Why implement waiting for events via a ioctl, KB_WAIT_FOR_EVENT? Why not make the file descriptor pollable via select/poll/epoll? This would allow easy integration into existing event loops.
I think the event support we currently have is not the last word and will be the subject of further discussions internally. We welcome comments like this to come up with an architecture that works for everyone.

If you want to react to events dynamically, then having a poll()able file descriptor is probably the easiest way to do so, because this will automatically integrate into any event loop. There's a reason why the kernel implements file descriptors for eventfd(), signalfd(), timerfd(), etc. That said: it might be useful to have a ioctl() to generate a new file descriptor specifically for polling events, and not reuse the file descriptor of the char device itself. I have to admit that this is only a minor detail I noticed while skimming the code, I don't think I'll be using events myself at the moment.

lukas wrote:
cseiler wrote:The Core 3 has an ARMv8 chip - but Raspbian only supports the 32bit architecture (and the chip in an effective ARMv7 mode). I would expect that the Revolution Pi would also only work in 32bit mode? Since there's only 1 GiB of RAM anyway, from that perspective this isn't a limitation - however, the ARMv8 architecture supports additional instructions (especially related to floating point arithmetics) in ARMv8 mode. Do you know if userspace applications are able to make use of the processor improvements even in 32 bit mode?
Unless I'm missing something, the 32-bit kernel is unable to execute 64-bit ELF binaries.

No, it isn't. But that wasn't my question. For example, ARMv8 has twice as many VFP registers as compared to ARMv7 - I wondered whether it is possible to use these even with ARM32 code (it's been quite a while since I've actually written ARM assembly). But from your reaction my guess is the answer is "no".

lukas wrote:So the upcoming Jessie image is based on Raspbian, which means everything is compiled for 32-bit ARMv6. Not ARMv7, which isn't supported by the BCM2835 on the CM1. The userland is identical on CM1 and CM3, only the kernel is special for each model because the CPU is initialized differently.

Well, but the userland is not the main issue here - the performance-critical stuff is going to be in my own code, which I can compile for ARMv7.

(Btw.: it would totally be sufficient for Rasbpian to just create an aarch64 kernel, the 32bit userland will still work with that, but then one could have individual programs make use of advanced ARMv8 features.)

lukas wrote:
cseiler wrote:Alternatively, Debian itself appears to run natively (without any Raspbian patches) on the Raspberry Pi 3 in ARMv8 mode (with aarch64 as architecture), at least if I read the wiki correctly - haven't tried it myself. Are there any plans to support the Revolution Core 3 under aarch64 based directly off Debian (+ some firmware packages)? Especially since Debian Stretch (and by extension jessie-backports) now has kernel 4.9 + RT patch already included on some architectures, albeit not ARM?
The prospect of utilizing the 64-bit capabilities is tantalizing, but doing this properly and creating a fully-fledged, well-tested product, in addition to our Raspbian-based image, would require double the existing manpower. Of course if a customer shows up tomorrow and orders thousands of devices on the condition of 64-bit support, it would be a different story.

Sure, completely understandable. And sorry, while it's likely we'll order more devices in the future, it won't be on the order of 1000s.

(Note thought that 64bit itself is not really that interesting on the RevPI, as even the CM3's RAM and Flash fit into 32 bits each, the interesting part is the new instruction set, more capabilities by the CPU such as more registers, and the (k)ASLR possibilities when having a larger address space.)

lukas wrote:I haven't tested Debian aarch64 either but cannot imagine that it will support the chips on the RevPi Core logic board out of the box.

i only got my information from https://wiki.debian.org/RaspberryPi3, which doesn't mention USB at all, just that GPU, WiFi and Bluetooth doesn't work. WiFi and Bluetooth isn't available anyway in the RevPi, so only the GPU could be an isssue. (But at least not a huge deal for us, as we're not using it even with the current RevPi CM1 we have here.) Of course, people might not have tested the other parts at all and they might also not work.

Anyway, thanks again for your quick reply and thanks for the great product!

Regards,
Christian

cseiler · Post by **cseiler** » 22 May 2017, 10:12

Hello again,

cseiler wrote:
lukas wrote:
cseiler wrote:Why use a global lock piDev_g.lockPI for the process image for everything? Typically values that are read/written from/to that region are 4 bytes or smaller, where atomic operations can be used to ensure that these writes are consistent. Changing individual values could be sped up quite a bit by relying on atomic operations on the kernel <-> userland side. Sure, for large read/write calls one would still need the lock, and the gate thread would have to reimplement the copy operation in terms of a loop of atomic loads / saves (in addition to taking the lock) - and while I haven't measured that, I suspect that this could improve performance for the cases where the user only wants to set a single value.
Also a great suggestion. I guess the global lock is motivated by simplicity (and/or laziness ) and we could add fast paths if the values read/written are <= 4 bytes.

For the record, we would gladly accept pull requests on GitHub for contributions like this.
I can't guarantee that I'll find time for this, but I may take a jab at the issue.

I did have a look at this during the weekend and I would first like to understand a bit more what kind of concurrency model you're going for here before actually writing some code - the things is: my questions about this would be far more technical than the ones I've already asked here - so should I do that here or simply open an issue about this on github?

Regards,
Christian

lukas · Post by **lukas** » 22 May 2017, 14:10

cseiler wrote: I did have a look at this during the weekend and I would first like to understand a bit more what kind of concurrency model you're going for here before actually writing some code - the things is: my questions about this would be far more technical than the ones I've already asked here - so should I do that here or simply open an issue about this on github?

IMHO either way is fine. Asking questions in the forum has the advantage that lurkers (both inside Kunbus and externally) are enabled to learn, OTOH this user interface may not be the most convenient if you're used to GitHub or mailing lists.

Thanks,

Lukas

Kernel module + Core 3 questions

Kernel module + Core 3 questions

Re: Kernel module + Core 3 questions

Re: Kernel module + Core 3 questions

Re: Kernel module + Core 3 questions

Re: Kernel module + Core 3 questions

Re: Kernel module + Core 3 questions