llunak: libreoffice

Showing posts with label libreoffice. Show all posts

Thursday, September 15, 2022

16384 columns in Collabora Online

After the work to support 16384 columns in LibreOffice Calc I have also made sure that the feature works also in Collabora Online, thanks to the funding of NGI.

This project has received funding from the European Union’s Horizon 2020 research and innovation programme under grant agreement No 871498.

Since the Calc core is the same, the underlying functionality works just like in LibreOffice. But the online functionality presented some additional challenges that needed handling, as having more cells in a spreadsheet may mean more data sent over the network, slowing things down. There definitely used to be problems with large spreadsheets, as the tiled rendering used by Online in fact was already limited to 500000 rows compared to 1048576 rows of the desktop version (in fact, git history shows that this started at only 1000 rows initially and then was continually raised over the time as things improved). So together with raising column count to 16384 I have raised this to the normal 1048576 rows as well.

While working on this, one of the problems I needed to handle was a rather unusual one - some automated tests were failing because of timing out. Only with debug builds, because they of course do more checks compared to optimized release builds. And the problem turned out to be how some tests checked resulting spreadsheets after some operations. Since testing Calc running in a browser is more complicated than testing the desktop version, what the tests did was to select a large number of cells (e.g. one entire row), paste that to the clipboard and then the clipboard data was present in browsers HTML data, so the testing framework could test it and verify it is as expected. And this failed with unoptimized debug builds because even though the spreadsheet was almost empty, the unoptimized code checked every cell involved and together it added up enough to exceed the test timeout. Ironically enough, a significant portion was spent in code that was trying to optimize the size of the clipboard data. So I needed to improve the tests and optimize handling of unoptimized code, as strange as it may sound.

Now with hopefully all performance problems solved, Collabora Online 22.05 should support these spreadsheet sizes just as fine as the desktop version.

Friday, July 15, 2022

Making unsorted lookups in Calc fast

The VLOOKUP spreadsheet function by default requires the searched data to be sorted, and in that case it performs a fast binary search. If the data is not sorted (for example if it would be impractical to have the data that way), it is possible to explicitly tell VLOOKUP that the data is not sorted, in which case Calc did a linear one-by-one lookup. And there are other functions such as COUNTIF or SUMIF that essentially do a lookup too, and those cannot even be told that the data is sorted and so they processed the data linearly. With large spreadsheets this can actually take a noticeably long time. Bugreports such as tdf#139444, tdf#144777 or tdf#146546 say operations in such spreadsheets take minutes to complete, or even "freeze".

I wanted to do something about those for quite a while, as with the right idea making those much faster should be actually fairly simple. And the simple idea I had was to let Calc to sort the data first and then use fast binary search. These documents usually do lookups in the same fixed range of cells, so the linear search in the same unsorted data was rather a waste when done repeatedly. Surely Calc should be able to sort the data just once, cache it and then use that cached sort order repeatedly. In fact VLOOKUP already had a cache for results of lookup in the same area, used when doing lookup in the same row but different columns.

I finally found the time to do something about this when SUSE filled a bug to Collabora about their internal documents freezing on load and then crashing after 10 minutes. The LO thread pool class has a 10 minutes timeout as a safety measure after which it aborts, and the large number of lookups in the documents actually managed to exceed that timeout. So I can't actually say how slow it was before :), but I can quote Gerald Pfeifer from SUSE reporting the final numbers:

Before	After
76m:56s CPU time (crash)	9s CPU time (6s clock time)
126m:23s CPU time (crash)	25s CPU time (15s clock time)
160m+ CPU time (crash)	38s CPU time (23s clock time)
8m:56s CPU time (8m:56s clock time)	14s CPU time (5s clock time)

This work is available in LibreOffice 7.4, and the TDF bugreports show similar improvements. In fact tdf#144777, titled "countifs() in Calc is slower than Excel's countifs()", now has a final comment saying that MS Office 2021 can do a specific document in 26s and LO 7.4 can do it in 2s. Good enough, I guess :).

Thursday, May 12, 2022

Improving Calc support for 16384 columns

So I enabled support for up to 16384 columns in Calc by default some time ago, but just getting it to work was not necessarily the end of the work. Making Calc have 16 times more columns means that any operation that works on entire columns is suddenly 16 times slower, or even worse. Similarly this could easily lead to 16x more memory used. So the support not only needs to work, but it also needs to be usable.

It theory adding a number of empty columns to the end of a spreadsheet should not make a difference, but in practice it does. With 1024 columns it is not as necessary to ignore those empty columns as it is with 16k, and a lot of the code dates back to the times when Calc supported even fewer colums (256?), where a being little inefficient here or there didn't show. But now it suddently did.

For example, if you protect or hide all unused columns until the end of the spreadsheet, then hitting the right arrow key on the last accessible cell makes Calc check all cells to the right for whether it's possible to go into them. And checking whether a column is hidden requires searching the list of column information, which is not trivial (it's compacted in order not to waste memory). The barely noticeable cost of this with 1024 columns got large enough to cause noticeable delays. Fortunately the ColHidden() function is smart enough to return the first and last column in the compacted range where the flag is equal, the code doing the cursor navigation just up until now didn't bother using that information, but now it needed to do so.

Another example, and that's quite a large topic, is allocating columns. If most of those new columns are not actually used, then it makes sense to allocate them only when needed, right? That will save memory, and it will make things faster too, because there is no need to check those empty columns. That idea got implemented back when this 16k work was started by others, adding e.g. function GetColumnsRange() that clamped the range to the allocated columns, but the problem is that in practice this is not as simple as that.

One of the issues here is let's say the case of selecting an entire row (clicking the row number to the left of the table does that easily) and then hitting Ctrl+B to make the entire row bold. That should not clamp the column range to the allocated columns, because if I later enter something into cells in those columns, I expect that to be bold. But if Calc allocates all columns for this, maybe I do not intend to enter values anywhere else except the first rows, so allocating all columns will be a waste. The solution to this is having default column data. The ScTable class now, besides having a list of allocated ScColumn's also has a ScColumnData member that stores some data for all not-yet allocated columns. Set the bold flag for all allocated columns and also in the default, and problem solved.

Except then, GetColumnsRange() clamping to allocated columns becomes incorrect, because now it's possible to have set data even beyond allocated columns, such as this bold flag. So I changed GetColumnsRange() to simply return the given range, without any adjustments, and then added the better-named GetAllocatedColumnsRange() for cases where the code knows it wants only the allocated range.

Somewhat similarly to the bold case, merely showing or going to an unallocated column should not allocate it. Otherwise hit e.g. Ctrl+Right one time too many and the cursor going to column XFD would make all columns get allocated. But that causes yet another complication - I am now at an unallocated column and all operations should either detect the column is not allocated and return, or allocate the column if needed. The initial 16k work added CreateColumnIfNotExists() exactly to protect such accesses and allocate the column if needed. It's just that this needed adding to quite many places, and some were still missing it, and others were calling it unnecessarily causing unnecessary column allocations. So I needed to work on these over time. I eventually went as far as change Calc to initially allocate just one column. Since before that Calc used to allocate 64 columns by default, a number of places missing such checks kept working because normally people didn't work with more than 64 columns (and so this 64 default was a reasonable choice at the time, as there was really a lot to check and fix). Now that I have changed this to just one column and fixed all tests, it looks like I've rooted them all out (at least I'm still getting only very few bugreports about something breaking :) ).

Drawing, somewhat unexpectedly, turned out to be a possible performance problem too. There are few ways in which cells to the left can affect drawing of cells to the right. If you enter a too-long text into a cell, it will overflow to the right, into the space of the next cell, or possibly even several cells. So when Calc is drawing let's say a couple of cells around the 10000th column, it actually needs to check also all the 10000 columns before. Somebody back in the day thought about optimizing it, and so before Calc draws cells, function FillInfo() first collects information about all the cells to draw and also all the cells to the left. What possibly(?) was an optimization with 256 or 1024 column is a problem with 16384 columns. Even allocating and clearing all the memory actually had a noticeable performance impact. Sadly, as sometimes happens to be the case with optimizations from the OpenOffice.org times, whoever wrote this made it slow. Function FillInfo() collects all data necessary for drawing a cell into struct CellInfo, and all that info is collected also for all the cells to the left, even though most of it is not used for them. So I had to find out what was necessary and split that out (and provide proper abstraction, because real programmers back in the day used direct data access, right).

Some of the problems can be even a bit funny. Have you created e.g. a named range called DAY1, NUM1, LOG10 or even DOG10? Well, now you can't, since now those are valid cell addresses, going up to XFD1. So Calc now needed special backwards compatibility code for this.

I expect the real test of this comes when it becomes part of the LibreOffice 7.4 release or Collabora Online. But so far it seems to work rather well.

This work is funded/sponsored by DEVxDAO as part of its mission to support open source and transparent research and development of emerging technologies and frameworks.

Wednesday, April 20, 2022

Improving text layout performance

So I've been working on improving LO text layout performance, as specified by the TDF tender. As it says, text layout in LO can be rather slow, because of problems like repeated text layout calls for the same text.

Let's have a look at a perf profile for PDF export of the document from bug#116400 :

There are two major costs here:

The first one is splitting text into script runs (separating runs of e.g. latin text from RTL text). About 61% of time is spent in vcl::text::TextLayoutCache. Which is rather strange for something called 'cache'. But this is one of the cases of poor naming, as the class is actually not a cache, it is the result of the script run splitting. It is called TextLayoutCache probably because callers are supposed to cache it and pass the same item to several OutputDevice calls ... which mostly does not happen. So whenever a text is to be drawn, this gets recreated. To make things even worse, it is done for the entire string, even if only a part of it is drawn, so this supposed cache actually makes things slower.

This can be fairly easily fixed by turning the class into an actual cache. Each TextLayoutCache instance depends only on the string, so it's easy to keep a reasonable number of them in one global cache.

The second problem is breaking text into multiple lines at suitable places. In this case the problem was in the ICU library we use. The common scenario when breaking text is finding the first break and then continuing with the same text to find the following break, and so on. ICU code tries to cache this if the position in the text is close enough to the previous one, and if it's not close enough but after the last position, it tries to only walk back a bit instead of repeating the entire work from the beginning of the string. But for whatever strange reason this walking back didn't work, and it walked back until the very beginning. And the test handling the result of the walking back didn't check what the result was and reset the position to whatever the result was. So in practice the breaking almost always started from the beginning, even if the last position was a way more reasonable place to start breaking from. So 26% of time is spent breaking the same text over and over (and it's only 26% because script run splitting is even more expensive).

I've reported this to ICU together with a suggested fix to not reset position to beginning if the last position is better, they've confirmed the problem, but apparently want to look at why the walking back doesn't work in the first place, and there has not been an actual fix from them yet. So I've at least pushed my patch for now.

The resulting perf profile now looks much better:

As can be seen from the number of cycles at the bottom, this is now almost 10x faster (well, ok, 8x to be more precise). The script run splitting can't be seen anymore (it's ~0.1% now), text breaking is still there, but way smaller (6%, and that's 6% of a total that's 8x smaller, so it would be 0.75% compared to the original 26%). Not bad. The PDF generation still takes a couple of seconds (it's 400 pages after all), but it's way faster.

Other problem I noticed while working on this was the related bugreport #144515 (and a couple more that I've closed as its duplicates):

The primary cost here is OutputDevice::ImplLayout(), which lays out text into glyphs and their positions. It is possible to cache this using the SalLayoutGlyphs class, and e.g. Writer has already started caching that for repeated calls, but in this case it's Calc using the EditEngine class, which does no caching.

So as a fix I've moved the caching code to VCL and turned it into a generic SalLayoutGlyphsCache class, and then made this place use that cache ... which didn't really help that much. After investigation it turned out that EditEngine tries to fit the given text into the given paper size (Calc's cell in this case), and so it repeatedly asks to lay out the entire long string in the cell, then breaks the line at the needed width, and then it repeats the same with the rest of the string for the next line, and so on. Which again results in horrible O(N^2) performance that mostly repeats the same over and over again.

But that should be possible to avoid, right? If it's repeatedly the same text, just a subset with increasing starting index, then presumably the glyphs are the same, and their positions are also the same, just at an offset. Well, it's not that simple actually, as in some cases it's not possible to cut glyphs for text at a random place and hope it'll be exactly the same as text layout would give, since text layout may try e.g. to position several adjacent spaces more nicely. But after a couple of attempts Noel pointed out to me that Harfbuzz actually provides information about places where it's not safe to break. Finally, I noticed that the problem with a number of those bugreports is people having small Calc cells with long text, where most of the text is not seen. So for the default case it can be made to show only the start of the text that fits, and so I made it possible for EditEngine to stop breaking lines when the given size is filled in instead of trying to pointlessly process lines that won't be needed.

Again, the resulting profile looks much better:

The operation is now almost 100x times faster (*cough*, ok, 62x) and is just a relatively small part in the total picture. Let's call that good enough.

In other somewhat related news, the fontconfig library has a new stable release that finally includes some performance improvemens when looking up fonts (not updated on its webpage, but available for download in release list).

BTW, since this was a TDF tender work, these improvements have been funded by TDF donations.

Thursday, March 31, 2022

Clang 14 faster at building LibreOffice

So I've updated my Clang to the newly released version 14, and while doing the LibreOffice rebuild afterwards I noticed that it seemed to build faster. I didn't measure it, but the build finished sooner than I expected. So of course I've measured it.

As a simple reference I used my year-old post about Clang 11 building faster with PCH, where Calc'c Library_sc built in 4 minutes and 39 seconds. And indeed now it's faster:

1105.84user 88.35system 2:55.07elapsed 682%CPU (0avgtext+0avgdata 1666576maxresident)k
180904inputs+2272520outputs (41390major+23529938minor)pagefaults 0swaps

Just to make sure it's not something else, I went back to Clang 13 to test it:

1581.76user 93.84system 4:14.75elapsed 657%CPU (0avgtext+0avgdata 1916380maxresident)k
153912inputs+2316696outputs (41891major+27245569minor)pagefaults 0swaps

So yes, it's real, and it's the compiler. It also suggests that there was an improvement also between Clang 11 and Clang 13, although not as noticeable as this.

I have no idea why that is, it seems too big of a difference to be just something random, but I see nothing relevant in the release notes (and it's not DWARF5, I tested that one). I also have no idea if it's a code change or if it's the compiler being faster because it generates faster code and is self-built. But hey, it's nice. I still vaguely remember the times when I was trying to avoid full Calc rebuilds like a plague, but that seems like a long time ago.

Tuesday, March 8, 2022

Enabling Calc support for 16384 columns by default

Last couple of weeks I have been working on the 16k columns support in Calc. There's been a lot of work on this already by Noel and others, but so far this has been hidden behind the experimental option, and normally documents open only with the "normal" 1024 columns support. The goal of this work is to finish the 16k support stable enough for it to be the default, so that people who need this many columns can finally get them without any complications.

As of now all Calc tests pass with the default switched to 16k, and I've also dealt with all the known problems from tdf#133764 (minus few rare corner cases that I can deal with later). But I'm pretty sure there are more hidden problems lurking, either crashes because of incorrect bounds checking, or performance problems when some code suddenly deals with 16x more columns. So the next step is to enable this by default in master and collect ~~compl...~~ feedback from ~~guin...~~ testers :).

I have the change already in Gerrit and expect to push it later today. If you'll be ~~lucky~~ ~~unlucky~~ noticing any problems, please report them in bugzilla and set your bugreport to block tdf#133764, and I'll sort it out. One of the things that remains to be decided is how to handle this from the users point of view. So far it seems to it'll be fine to just load everything in 16k without saying anything, but a part of collecting feedback is checking whether and how much backwards compatibility handling would be needed. If all goes well, and so far I don't see why it shouldn't, 7.4 will ship with 16k columns being the default.

Just to be clear, this is only about columns, not rows. The experimental option also changes rows from 1048576 to 16777216, but that's not focus of the work, so the default will stay at 1048576. First of all, Excel also only supports 1m rows, so there's not(?) such a demand for it, and it'll also require more work (16m rows is a lot of pixels, and it doesn't fit into 32bits).

This work is funded/sponsored by DEVxDAO as part of its mission to support open source and transparent research and development of emerging technologies and frameworks.

Fun meaningless fact: Column 16384 is called XFD.

Friday, October 22, 2021

Optimizing LibreOffice for a larger number of users

Have you ever edited a document in LibreOffice in more than one window? Right, neither have I. Who'd think about LibreOffice and more than one user at the same time, right? Except ... somebody did and that's how collaborative editing based on LibreOffice works. For whatever strange reason, somewhen in the past somebody thought that implementing multiple views for one document in OpenOffice (StarOffice?) was a good idea. Just select Window->New Window in the menu and you can edit your favourite document in 50 views that each show a different part of the document and update in real time. And that, in fact, is how collaborative editing such as with Collabora Online works - open a document, create a new view for every user, and there you go.

But, given that this has never really been used that much, how well did the original relevant code perform and scale for more users? Well, not much, it turns out. Not a big surprise, considering that presumably back when that code was written nobody thought the same document could be edited by numerous users at the same time. But I've been looking exactly into this recently as part of optimizing Collabora Online performance, and boy, are there were gems in there. You thought that showing the same document in more views would just mean more painting also in those views? Nah, think again, this is OpenOffice code, the land of programming wonders.

Profiling the code

When running Online's perf-test, which simulates several users typing in the same document, most of the time is actually spent in SwEditShell::EndAllAction(). It's called whenever Writer finishes an action such as adding another typed characters, and one of the things it does is telling other views about the change. So here LO spends a little time adding the character and then the rest of the time is spent in various code parts "talking" about it. A good part of that is that whenever an action is finished, that view tells the others about it happening, and then all those views tell all other views about how they reacted to it, making every change O(n^2) with the number of views. That normally does not matter, since on the desktop n generally tends to be 1, but hey, add few more views, and it can be a magnitude slower or more.

Redrawing, for example, is rather peculiar. When a part of the document changes, relevant areas of the view need redrawing. So all views get told about the rectangles that need repainting. In the desktop case those can be cropped by the window area, but for tiled rendering used by Online the entire document is the "window" area, so every view gets told about every change. And each view collects such rectangles, and later on it processes them and tells all other views about the changes. Yes, again. And it seems that in rare cases each view really needs its own repaint changes (even besides the cropping, as said before). So there's a lot of repeated processing of usually the same rectangles over and over again.

One of the functions prominently taking place in CPU costs is SwRegionRects::Compress(), which ironically is supposed to make things faster by compressing a group of rectangles into a smaller set of rectangles. I guess one of the cases in OpenOffice where the developer theoretically heard about optimizations being supposed to make things faster, but somehow the practical side of things just wasn't there. What happens here is that the function compares each rectangle with each other, checking if they can be merged ... and then, if yes and that changes the set of rectangles, it restarts the entire operation. Which easily makes the entire thing O(n^3). I have not actually found out why the restarting is there. I could eventually think of a rather rare case where restarting makes it possible to compress the rectangles more, but another possibility is that the code dates back to the time when it was not safe to continue after modifying whichever container SwRegionRects was using back then, and it has stayed there even though that class has been using to std::vector since a long time ago.

Another kind of interesting take on things is the SwRegionRects::operator-= in there. Would you expect that rectangles would be collected by simply, well, collecting them and then merging them together? Maybe you would, but that's not how it's done here. See, somebody apparently thought that it'd be better to use the whole area, and then remove rectangles to paint from it, and then at the end invert the whole thing. The document area is limited, so maybe this was done to "easily" crop everything by the area? It works, except, of course, this is way slower. Just not slow enough to really notice when n is 1.

Other code that works fine with small numbers but fails badly with larger ones is VCLEventListeners, a class for getting notified about changes to VCL objects such as windows. It's simply a list of listener objects, and normally there aren't that many of those. But if LO core gets overloaded, this may grow. And since each listener may remove itself from the list at any point, the loop calling all of them always checks for each of them if the listener is still in the list. So, again, O(n^2). And, of course, it's only rarely that any listener removes itself, so the code spends a lot of time doing checks just in case.

But so that I do not talk only about old code, new code can do equally interesting things. Remote rendering uses LOK (LibreOfficeKit), which uses text-based messages to send notifications about changes. And the intuitive choice for writing text are C++ iostreams, which are flexible, and slow. So there will be a lot of time spent in creating text messages, because as said above, there are many changes happening, repeatedly. And since there are so many repeated messages, it makes sense to write extra class CallbackFlushHandler that collects these messages and drops duplicates. Except ... for many of the checks it first needs to decode text mesages back to binary data, using C++ iostreams. And in most cases, it will find out that it can drop some message duplicates, so all these string conversions were done for nothing. Oops.

And there are more ways in which things can get slower rather than faster. CallbackFlushHandler uses an idle timer to first process all data in bulk and flush the data at once only when idle. Except if it gets too busy to keep up, and it can easily get too busy because of all the things pointed out above, it may take a very long time before any data is flushed. To make things even worse, the queue of collected messages will be getting longer and longer, which means searching for duplicates and compressing it will get longer. Which in turn will make everything even slower, which again in turn will make everything even slower. Bummer.

All in all, if unlucky, it may not take that much for everything to slow down very noticeably. Online's perf-test, which simulates only 6 users typing, can easily choke itself for a long time. Admitedly, it simulates them all typing at the same time and rather fast, which is not very a realistic scenario, but typing hitting the keyboard randomly and quickly is exactly how we all test things, right? So I guess it could be said that Collabora Online's perf-test simulates users testing Collabora Online performance :). Realistic scenarios are not going to be this bad.

Anyway. In this YT video you can see in the top part how perf-test performs without any optimizations. The other 5 simulated users are typing elsewhere in the document, so it's not visible, but it affects performance.

Improved performance

But as you can see in the other two parts, this poor performance is actually already a thing of the past. The middle part shows how big a difference can even one change make. In this specific case, the only difference is adding an extra high-priority timer to CallbackFlushHandler, which tries to flush the message queue before it becomes too big.

The bottom part is all the improvements combined, some of them already in git, some of them I'm still cleaning up. That includes changes like:

SwRegionRects::Compress() is now roughly somewhere at O(n*log(n)) I think. I've fixed the pointless restarts on any change, and implemented further optimizations such as Noel's idea to first sort the rectangles and not compare ones that cannot possibly overlap.
I have also changed the doubly-inverted paint rectangles handling to simply collecting them, cropping them at the end and compressing them.
One of the things I noticed when views collect their paint rectangles is that often they are adjacent and together form one large rectangle. So I have added a rather simple optimization of detecting this case and simply growing the previous rectangle.
Since it seems each Writer view really needs to collect its own paint rectangles, I have at least changed it so that they do not keep telling each other about them all the time in LOK mode. Now they collect them, and only once at end they are all combined together and compressed, and often thousands of rectangles become just tens of them.
Another thing Writer views like to announce all the time in LOK mode are cursor and selection positions. Now they just set a flag and compute and send the data only once at the end if needed.
VCLEventListeners now performs checks only if it knows that a listener has actually removed itself. Which it knows, because it manages the list.
Even though LOK now uses tiled rendering to send the view contents to clients, LO core was still rendering also to windows, even though those windows are never shown. That's now avoided by ignoring window invalidations in LOK mode.
Noel has written a JSON writer which is faster then the Boost one, and made non-JSON parts use our OString, which is faster and better suited for the fixed-format messages.
I have converted the from-message conversions to also use our strings, but more importantly I have changed internal LOK communications to be binary rather than text based, so that they usually do not even have to be converted. Only at the end those relatively few non-duplicated text messages are created.
Noel has optimized some queue handling in CallbackFlushHandler and then I have optimized it some more. Including the high-priority timer mentioned above.
There have been various other improvements from others from Collabora as part of the recent focus on improving performance.
While working on all of this, I noticed that even though we have support for Link Time Optimization (LTO), we do not use it, probably because it was broken on Windows. I've fixed this and sorted out few other small problems, and releases in the future should get a couple percent better performance across the board from this.

This is still work in progress, but it already looks much better, as now most of the time is actually spent doing useful things like doing the actual document changes or drawing and sending document tiles to clients. LOK and Collabora Online performance should improve noticeably, recent (Collabora) versions 6.4.x should include already some improvements, and the upcoming Collabora Online 2021 should have all of them.

And even though this's been an exercise in profiling LibreOffice performance for something nobody thought of back when the original OpenOffice code was written, some of these changes should matter even for desktop LibreOffice and will be included starting with LO 7.3.

Wednesday, August 25, 2021

Skia on Mac

So, current LibreOffice git version now has support for drawing based on Skia also for the Mac. Both Raster and Metal (the Mac GPU framework). Below is the obligatory screenshot, and here is the hey-it-can-be-faster video.

Sunday, April 18, 2021

The effect of CPU, link-time (LTO) and profile-guided (PGO) optimizations on the compiler itself

In other words, how much faster will a compiler be after it's been built with various optimizations?

Given the recent Clang12 release, I've decided to update my local build of Clang11 that I've been using for building LibreOffice. I switched to using my own Clang build instead of openSUSE packages somewhen in the past because it was faster. I've meanwhile forgot how much faster :), and openSUSE packages now build with LTO, so I've built Clang12 in several different ways to test the effect and this is it:

The file compiled is LO Calc's document.cxx, a fairly large source file, in a debug LO build. The compilation of the file is always the same, the only thing that differs is the compiler used and whether LO's PCH support is enabled. And the items are:

Base - A release build of Clang12, with (more or less) the default options.
CPU - As above, with -march=native -mtune=native added.
LTO - As above, with link-time optimization used. Building Clang this way takes longer.
LTO+PGO - As above, also with profile-guided optimization used. Building Clang this way takes even longer, as it needs two extra Clang builds to collect the PGO data.
Base PCH - As Base, and the file is built with PCH used.
LTO+PGO PCH - As LTO+PGO, again with PCH used.

Or, if you want this as numbers, then with Base being 100%, CPU is 85%, LTO is 78%, LTO+PGO is 59%, Base PCH is 37% and LTO+PGO PCH is 25%. Not bad.

Mind you, this is just for one randomly selected file. YMMV. For the build from the video from the last time, the original time of 4m39s with Clang11 LTO PCH goes down to 3m31s for Clang12 LTO+PGO PCH, which is 76%, which is consistent with the LTO->LTO+PGO change above.

Tuesday, April 6, 2021

Clang precompiled headers and improving C++ compile times, conclusion

With Clang12 almost released, I guess it's high time to write a conclusion to the Clang11 changes that improve compilation times with PCHs. I originally planned to do this after the Clang11 release, but with the process to get the changes reviewed and merged having been so tedious I was glad it was finally over and I couldn't at the time muster the little extra effort to also write this down (I spent way more time repeatedly writing 'ping' and waiting for a possible reaction than writing the code, which was really demotivating). But although the new options are described in the Clang11 release notes, I think it'd be useful to write it down in more detail.

First of all, I've already written why C++ developers might care, but a thousand pictures can be worth more than a thousand words saying how this can save you even 60% of the build time:

In case you'd like to see a similar change in your LibreOffice compilation times, it should be as simple as this:

Get Clang11 or newer and use it to build LibreOffice.
Use --enable-pch (preferably --enable-pch=full) to enable PCH use.
Profit.

You may want to check that your ccache and icecream are not too ancient if you use them, but by now any reasonably recent version should do. And if you need to do changes that would repeatedly trigger larger rebuilds (such as changing a header), the trick to temporarily disable PCH builds is to do 'make BLOCK_PCH=1'. And since PCH builds sometimes cause build errors in non-PCH builds because of missing #include's of headers, it's a good idea to touch all your changed files and do 'make BLOCK_PCH=1' before pushing your change to Gerrit. This is all you should need to know, the LO build system will take care of everything else.

As for the rest of the world, this boils down to the two PCH-related sections in the Clang11 release notes.

The first one is using -fpch-instantiate-templates. It needs to be set when building the PCH, but it will work even if you just add it to the CXXFLAGS used for building everything. Recent enough CMake version should handle this option automatically, I have no idea about other build systems. It should be safe to enable the option for any building with PCH. It's not enabled by default in Clang only because of really corner cases with PCHs that are not self-contained. In other words, as long as your PCH works with an empty .cpp file, it's safe, and if your PCH is not self-contained, you'd be better off fixing that anyway.

The second part using -fpch-codegen -fpch-debuginfo is more complicated, as it requires build system support, and I'm not aware of any build system besides LibreOffice's providing it. This discussion in a CMake ticket provides an example of how to use the option that seems rather simple. For other build systems have a look at the description in the Clang11 release notes for all the details and possible complications, and consider whether it'd be worth it. Which it sometimes may not be, as this graph from my previous post shows (Clang means normal PCH build, Clang+ means only -fpch-instantiate-templates, Clang++ means all 3 options).

Note that you may get rather different results depending on how much you put in your PCHs. Unlike before, now the general rule should be that the more you add to your PCHs, even if shared only by several source files, the faster the builds usually will be. And since these options move building some code to the PCH-related object file, the improvement is usually even better for incremental builds than full rebuilds. I've been using PCHs this way for slightly more than a year, and I forgot already quite some time ago how slow it used to be.

Tuesday, January 21, 2020

Skipping functions from entire directories while debugging (e.g. skip all functions from system headers)

So, today I got finally so tired of navigating (or explicitly stepping over) all the internal functions in gdb (you know, all the inline functions from STL containers, from Boost, from this pointer wrapper class, that string class) that I finally googled 'gdb skip system functions'. And guess what, it's been there since gdb 7.12, from 3 years ago, and it's almost trivial, just adding something like this to ~/.gdbinit:

skip -gfi /usr/include/*
skip -gfi /usr/include/*/*
skip -gfi /usr/include/*/*/*
skip -gfi /usr/include/*/*/*/*

I feel so stu^H^H^Hproud for catching up only 3 years late.

Thursday, November 28, 2019

Skia branch merged to master

So, the branch implementing VCL drawing based on the Skia graphics library has been merged in.
All(?) the necessary info about how to enable it etc. are in this mail, but there are things that better fit a blog post than a mail, and in this case that's going to be a table and a picture showing how well it may perform. Note that these results are from running visualbackendtest, which is not really a benchmark, so these numbers should be taken with a grain of salt. It's just a test that draws a gradient, several big polygons (each circle is actually 720 lines) and short text.

And LibreOffice of course does many more things than just paint on the screen. And it's not just about performance of drawing (some of these e.g. do not double-buffer, which makes things like alpha blending complicated and slow). And for some of these we could discuss the complicated reasons for why the numbers are what they are. But still, some of the numbers are interesting:

Render method	FPS
Linux gen (X11)	86
Linux gtk3	70-90
Linux OpenGL	45
Linux Skia Vulkan (GPU)	65-90
Linux Skia raster (CPU)	5
Windows GDI	64
Windows OpenGL	40-60
Windows Skia Vulkan (GPU)	175-185
Windows Skia raster (CPU)	75-85

Saturday, November 9, 2019

Clang precompiled headers and improving C++ compile times, take #2

It's been almost half a year since I mentioned how precompiled headers do (not) improve C++ compile times. Quite a long time, filled with doing other things, life, occassionally working on getting my patch production-ready and, last but definitely not least, abandoning that patch and starting from scratch again.
It turns out, the problems I mentioned last time had already been more or less solved in Clang. But only for C++ modules, not for precompiled headers. *sigh* I had really mixed feelings when I finally realized that. First of all, not knowing Clang internals that well, it took me quite a long time to get to this point figuring it all out, probably longer than it could have. Second, I've been using C++ modules when building Clang itself and while it's usable, I don't consider it ready (for example, sometimes it actually makes the build slower), not to mention that it's non-trivial to setup, not standardized yet and other compilers (AFAIK) do not yet support C++ modules. And finally, WTH has nobody else yet noticed and done the work for precompiled headers too? After all the trouble with finding out how the relevant Clang parts work, the necessary patches mostly border on being trivial. Which, on the other hand, is at least the good news.
And so I'm switching for LibreOffice building to my patched build of Clang. For the motivation, maybe let's start with an updated picture from the last time:

This is again column2.cxx, a larger C++ file from Calc. The first row is again compilation without any PCH involved. The second row is unpatched Clang with --enable-pch=full, showing again that way too large PCHs do not really pay off (here it does, because the source file is large, but for small ones such as bcaslots.cxx shown last time it makes things slower). In case you notice the orange 'DebugType' in the third row that looks like it should be in the second row too, it should be there, but that's one of these patches of mine that the openSUSE package does not have.
The third row is with one patch that does the PerformPendingInstantiations phase also already while building the PCH. The patch is pretty much a one-liner when not counting handling fallout from some Clang tests failing because of stuff getting slightly reordered because of this. Even by now I still don't understand why PCH generation had to delay this until every single compilation using the PCH. The commit introducing this had a commit message that didn't make much sense to me, the test it added works perfectly fine now. Presumably it's been fixed by the C++ modules work. Well, who cares, it apparently works.
The last row adds also Clang options -fmodules-codegen -fmodules-debuginfo. They do pretty much what I was trying to achieve with my original patch, they just approach the problem from a different side (and they also do not have the technical problems that made me abandon my approach ...). They normally work only for C++ modules, so that needed another patch, plus a patch fixing some problems. Since this makes Clang emit all kinds of stuff from the PCH into one specific object file in the hopes that all the compilations using the PCH will need that too but will be able to reuse the shared code instead, LibreOffice now also needs to link with --gc-sections, which throws away all the possibly problematic parts where Clang guessed wrong. But hey, it works. Even with ccache and Icecream (if you have the latest Icecream, that is, and don't mind that it "implements" PCHs for remote compilations by simply throwing the PCH away ... it still pays off).
So, that it's for a single compilation. How much does it help with building in practice? Time for more pretty colorful pictures:

This is a debug LO build on my 4-core (8 HT) Ryzen laptop, Library_sm is relatively small (36 source files), Library_scfilt is larger (154 source files). Plain 'Clang' means unpatched Clang(v9), 'Clang+' is with the PerformPendingInstantiations patch (i.e. the third row above), 'Clang++' is both patches (i.e. the fourth row above). The setting is either --enable-pch=base for including only system and base LO headers in the PCH, or --enable-pch=full for including everything that makes sense. It clearly shows that using large PCHs with GCC or unpatched Clang just doesn't make sense.
Note that GCC(v9) and MSVC(v2017) are there more as a reference than a fair comparison. MSVC runs on a different OS and the build may be possibly slightly handicaped by some things taking longer in Cygwin/Windows. GCC comes from its openSUSE package, which AFAICT is built without LTO (unlike the Clang package, where it makes a noticeable difference).
And in case the graphs don't seem impressive enough, here's one for Library_sc, which with its 598 source files is too big for me to bother measuring it in all cases. This is the difference PCHs can make. That's 11:37 to 4:34, almost down to one third:

As for building entire LO from scratch, it can be like in the picture below (or even better). The effect there is smaller, because the build consists of other things than just building libraries, and some of the code built doesn't use PCHs. And it's even smaller than it could be, because I used --enable-pch=base, as that's what I've been using up to now (although now I'll switch to a higher level). That's about 1h42m without PCHs to 1h14m with unpatched Clang (27% percent saved), and 1h06m with patches (and the 8minutes difference is still 11% of the unpatched time). Not bad, given that this is the entire LO build. Those 6 minutes for ccache are there to show the maximum possible improvement (or rather nowhere near possible, since the compiler normally still has to do the work of actually compiling the code somehow).

In case you'd want to use this too, that's not up to me now. The patches are now sitting and waiting in the LLVM Phabricator. Hopefully somebody there still cares about PCHs too.

Thursday, October 3, 2019

LibreOffice drawing using Skia

This below is a screenshot of my LibreOffice build using the Skia library.

Now, admittedly, this looks way better than it should, as it is actually still far from finished. It is so far X11-only, using the venerable not-that-performant XPutImage(). No Windows, no Vulkan. Yet. Also, while it passes all VCL unit tests, that rather says something about the poor state of coverage of those tests, as they fail to hit any of those abort() calls I still have in a number of places. Well, maybe I should rather post the screenshot from yesterday:

Wednesday, September 11, 2019

Icecream 1.3 and Icemon 3.3 released

A new version 1.3 of the distributed C/C++ compilation tool Icecream has been released. To accompany it, version 3.3 of the GUI monitor Icemon has been released as well.

The changelogs are here and here. In a less changelog-y way, the changes are:

Compiler location are no longer hardcoded anywhere. Previously the compiler automatically packaged and sent to remote nodes was always /usr/bin/gcc (g++, clang, clang++). That might not match the actual compiler used and the workaround was to manually package the proper one using icecc-create-env. But now it's possible to build even with e.g. CXX=/my/own/build/of/clang and it'll simply work. This should also mean that explicitly setting $ICECC_VERSION now should be needed only for cross-compiling.
Slightly better job scheduling, both for remote and local builds. For example, the local machine should no longer be possibly overloaded by running way too many local preprocessor steps.
Better compression, both for sending data and packaged compilers. Compilation data is compressed using zstd if the other node supports it, compiler environments can be compiled using zstd or xz. This improves performance by reducing both network and CPU usage. Note that while compilation compression falls back to the older method if not supported by the other side, for compiler environments this is more tricky and so it has to be set up manually. You can set e.g. ICECC_ENV_COMPRESSION=xz , but the daemon will not fall back to using any other mechanism. Which means it will use only nodes that are at least version 1.3, the scheduler should also be from 1.3 (run another one if needed, the newest one wins) and the remote node needs to support the compression (1.3 uses newly uses libarchive, which supports zstd only in its relatively recent releases). So this is mainly useful if you have full control over the Icecream cluster, but by default the compression is the old gzip, for backwards compatibility.
Speaking of which, the maximum cache size for compiler environments now defaults to 256MiB. Use the --cache-size option of iceccd for different sizes.
Objective C/C++ support has been fixed.
Some special workarounds for GCC's -fdirectives-only option that is used when sending sources to remote nodes, as it breaks in some corner cases.
The --interface option of the daemons (and scheduler) now allow binding only to a specific network interface, if needed. Note that Icecream still assumes it runs in a trusted network and if that's not so it's up to you to ensure it by using tools such as a firewall.
Icemon now displays in the defailed host view what protocol a node supports (1.3 has protocol version 42, env_xz/env_zstd mean it supports compiler environments compiled using xz/zstd).
And various other fixes.

Thursday, May 23, 2019

Why precompiled headers do (not) improve C++ compile times

Would you like your C++ code to compile twice as fast (or more)?

Yeah, so would I. Who wouldn't. C++ is notorious for taking its sweet time to get compiled. I never really cared about PCHs when I worked on KDE, I think I might have tried them once for something and it didn't seem to do a thing. In 2012, while working on LibreOffice, I noticed its build system used to have PCH support, but it had been nuked, with the usual poor OOo/LO style of a commit message stating the obvious (what) without bothering to state the useful (why). For whatever reason, that caught my attention, reportedly PCHs saved a lot of build time with MSVC, so I tried it and it did. And me having brought the PCH support back from the graveyard means that e.g. the Calc module does not take 5:30m to build on a (very) powerful machine, but only 1:45m. That's only one third of the time.

In line with my previous experience, on Linux that did nothing. I made the build system support also PCH with GCC and Clang, because it was there and it was simple to support it too, but there was no point. I don't think anybody has ever used that for real.

Then, about a year ago, I happened to be working on a relatively small C++ project that used some kind of an obscure build system called Premake I had never heard of before. While fixing something in it I noticed it also had PCH support, so guess what, I of course enabled it for the project. It again made the project build faster on Windows. And, on Linux, it did too. Color me surprised.

The idea must have stuck with me, because a couple weeks back I got the idea to look at LO's PCH support again and see if it can be made to improve things. See, the point is, PCHs for that small project were rather small, it just included all the std stuff like <vector> and <string>, which seemed like it shouldn't make much of a difference, but it did. Those standard C++ headers aren't exactly small or simple. So I thought that maybe if LO on Linux used PCHs just for those, it would also make a difference. And it does. It's not breath-taking, but passing --enable-pch=system to configure reduces Calc module build time from 17:15m to 15:15m (that's a less powerful machine than the Windows one). Adding LO base headers containing stuff like OUString makes it go down to 13:44m and adding more LO headers except for Calc's own leads to 12:50m. And, adding even Calc's headers, results in 15:15m again. WTH?

It turns out, there's some limit when PCHs stop making things faster and either don't change anything, or even make things worse. Trying with the Math module, --enable-pch=system and then --enable-pch=base again improve things in a similar fashion, and then --enable-pch=normal or --enable-pch=full just doesn't do a thing. Where it that 2/3 time reduction --enable-pch=full does with MSVC?

Clang has recently received a new option, -ftime-trace, which shows in a really nice and simple way where the compiler spends the time (take that, -ftime-report). And since things related to performance simply do catch my attention, I ended up building the latest unstable Clang just to see what it does. And it does:

So, this is bcaslots.cxx, a smaller .cxx file in Calc. The first graph is without PCH, the second one is with --enable-pch=base, the third one is --enable-pch=full. This exactly confirms what I can see. Making the PCH bigger should result in something like the 4th graph, as it does with MSVC, but it results in things actually taking longer. And it can be seen why. The compiler does spend less and less time parsing the code, so the PCH works, but it spends more time in this 'PerformPendingInstantiations', which is handling templates. So, yeah, in case you've been living under a rock, templates make compiling C++ slow. Every C++ developer feeling really proud about themselves after having written a complicated template, raise your hand (... that includes me too, so let's put them back down, typing with one hand is not much fun). The bigger the PCH the more headers each C++ file ends up including, so it ends up having to cope with more templates. With the largest PCH, the compiler needs to spend only one second parsing code, but then it spends 3 seconds sorting out all kinds of templates, most of which the small source file does not need.

This one is column2.cxx, a larger .cxx file in Calc. Here, the biggest PCH mode leads to some improvement, because this file includes pretty much everything under the sun and then some more, so less parsing makes some savings, while the compiler has to deal with a load of templates again, PCH or not. And again, one second for parsing code, 4 seconds for templates. And, if you look carefully, 4 seconds more to generate code, most of it for those templates. And after the compiler spends all this time on templates in all the source files, it gets all passed to the linker, which will shrug and then throw most of it away (and that will too take a load of time, if you still happen to use the BFD linker instead of gold/lld with -gsplit-dwarf -Wl,--gdb-index). What a marvel.

Now, in case there seems to be something fishy about the graphs, the last graph indeed isn't from MSVC (after all, its reporting options are as "useful" as -ftime-report). It is from Clang. I still know how to do performance magic ...

Monday, May 20, 2019

Linux perf and KCachegrind

If you occassionally do performance profiling as I do, you probably know Valgrind's Callgrind and the related UI KCachegrind. While Callgrind is a pretty powerful tool, running it takes quite a while (not exactly fun to do with something as big as e.g. LibreOffice).

Recently I finally gave Linux perf a try. Not quite sure why I didn't use it before, IIRC when I tried it somewhen long ago, it was probably difficult to set up or something. Using perf record has very little overhead, but I wasn't exactly thrilled by perf report. I mean, it's text UI, and it just gives a list of functions, so if I want to see anything close to a call graph, I have to manually expand one function, expand another function inside it, expand yet another function inside that, and so on. Not that it wouldn't work, but compared to just looking at what KCachegrind shows and seeing ...

When figuring out how to use perf, while watching a talk from Milian Wolff, on one slide I noticed a mention of a Callgrind script. Of course I had to try it. It was a bit slow, but hey, I could finally look at perf results without feeling like that's an effort. Well, and then I improved the part of the script that was slow, so I guess I've just put the effort elsewhere :).

And I thought this little script might be useful for others. After mailing Milian, it turns out he just created the script as a proof of concept and wasn't interested in it anymore, instead developing Hotspot as UI for perf. Fair enough, but I think I still prefer KCachegrind, I'm used to this, and I don't have to switch the UI when switching between perf and callgrind. So, with his agreement, I've submitted the script to KCachegrind. If you would find it useful, just download this do something like:


$ perf record -g ...

$ perf script -s perf2calltree.py > perf.out
$ kcachegrind perf.out

Tuesday, February 18, 2014

$ICECC_VERSION

It's been brought to my attention that the Icecream documentation more or less suggests it is necessary to manually set up $ICECC_VERSION (which also involves creating the environment tarball with the compiler and so on). That is incorrect. I've already updated the documentation to say that, like with pretty much everything, Icecream simply figures it out itself by default.

So if you happen to use $ICECC_VERSION, unless you know why you do that (e.g. cross-compilation), don't. It's not only simpler but also better to leave it up to Icecream to package the system compiler as necessary, as it simply works, and avoids possible problems (such as updating the system compiler and forgetting to update the tarball).

Friday, July 12, 2013

Are you enjoying your Icecream?

A bug has crept into the Icecream 1.0 release that makes the daemon crash on its first start after boot if /var/run is cleared (e.g. it's tmpfs-mounted by systemd). This in practice means that on such systems Icecream does not work unless started manually. So in case your compiles have felt tad a bit slower recently, upgrade to the just released version 1.0.1. If you run openSUSE, an online update with the fix has already been published.

Thursday, July 11, 2013

Clang's AST dump AKA 'WTH is the compiler doing???'

It's no secret that I use the Clang compiler for development. Although GCC is still somewhat better when things like the performance of the resulting code matter, there are other features that matter more during development. And although again competition helps (it's not difficult to guess where the inspiration for the new error reporting in 4.8 comes from), there are features where I expect it'd be hard for GCC to match Clang. The capabilities and ease of writing Clang plugins is one thing, but there are more hidden secrets, like the AST dump.

If Clang is invoked also with -Xclang -ast-dump options, it'll dump its internal representation of the compiled source. Which can be pretty useful when the source code doesn't actually mean what one expects, or if there's something unexpected from elsewhere interfering. Consider the following (simple, for clarity) example:


#include <iostream>

using namespace std;



class A
    {
    };



class B
    {
    public:
        operator A() const { return A(); }
    };



class C
    : public B
    {
    };



void foo( const A& )
    {
    cout << "A" << endl;
    }



void foo( B& )
    {
    cout << "B" << endl;
    }



int main()
    {
    foo( C());
    }

Looking only at class C, it may perhaps come as a surprise to some that this prints "A" and not "B". And overlooking the missing const or not knowing that it will prevent passing the temporary to the function certainly helps with the surprise, but even if not, still, so what is actually going on? With larger codebase, that can be a lot of time to find out. But finding out what the compiler thinks about the code can help:


$ clang++ -Wall a.cpp -Xclang -ast-dump

...

 `-FunctionDecl 0x90ffb90 <line:29:1, line:32:5> main 'int (void)'

  `-CompoundStmt 0x9100078 <line:30:5, line:32:5>

    `-CallExpr 0x90fffb8 <line:31:5, col:13> 'void'

      |-ImplicitCastExpr 0x90fffa8 <col:5> 'void (*)(const class A &)' <FunctionToPointerDecay>

      | `-DeclRefExpr 0x90fff74 <col:5> 'void (const class A &)' lvalue Function 0x90fec80 'foo' 'void (const class A &)'

      `-MaterializeTemporaryExpr 0x9100068 <col:10, col:12> 'const class A' lvalue

        `-ImplicitCastExpr 0x9100058 <col:10, col:12> 'const class A' <NoOp>

          `-ImplicitCastExpr 0x9100048 <col:10, col:12> 'class A' <UserDefinedConversion>

            `-CXXMemberCallExpr 0x9100028 <col:10, col:12> 'class A'

              `-MemberExpr 0x9100008 <col:10, col:12> '<bound member function type>' .operator A 0x90fe740

                `-ImplicitCastExpr 0x90ffff8 <col:10, col:12> 'const class B' <UncheckedDerivedToBase (B)>

                  `-CXXTemporaryObjectExpr 0x90ffdd8 <col:10, col:12> 'class C' 'void (void)' zeroing

Knowing a bit about how compilers work helps a lot, but even without it this is quite simple to read. From bottom to up, there's a temporary object of class C created and it's cast to its base class B. That's the expected part, the unexpected part is the 3 AST nodes up, which show that the object is converted to class A by a user defined conversion using operator A(). Which, as the rest of this part of the AST dump shows, results in calling foo( const A& ). Mystery solved.

(Fun trivia: I once helped a GCC developer to disentangle a problem in a complex C++ testsuite using this. But don't tell anyone ;) . )