You signed in with another tab or window. Reload to refresh your session.You signed out in another tab or window. Reload to refresh your session.You switched accounts on another tab or window. Reload to refresh your session.Dismiss alert
Add and enable USE_DUFF_LOOP for Render_Glyph() and Render_Glyph_Blended() to expect a little faster blit on mobile platforms.
Since lots of functions require a conversion to UTF-8.( {unicode,text}{solid,blended,shaded}{normal,wrapped} == 12 fonctions),
this has been moved into TTF_Size_Internal(), TTF_Render_Internal() and TTF_Render_Wrapped_Internal(), while still been stack allocated.
Other minor changes:
Find_Glyph() gives back the glyph and either glyph->pixmap or glyph->bitmap depending on the data requested.
Rename font->outline as it somehow "shadows" "FT_Outline outline".
font->style isn't correctly initialized. it doesn't add font_face style the first time.
re-wrote this more clearly so font->style are exactly the styles handled by SDL_ttf.
set to upper case for "ft_render_mode_normal", "ft_kerning_default", "ft_render_mode_mono".
freetype.h says those lower case type are deprecated.
check that TTF_Font input pointer is not null
On 2018-11-06 13:14:07 +0000, Sylvain wrote:
Created attachment 3447
patch
patch is quite long, but the final code is 80 lines less. (with more function, more comment, and duff loops).
On 2018-11-06 14:43:26 +0000, Sam Lantinga wrote:
Can you rebase this patch on current mercurial?
Thanks!
On 2018-11-06 14:43:59 +0000, Sam Lantinga wrote:
Also, have you done performance testing to make sure this doesn't introduce regression on various platforms?
On 2018-11-06 15:37:43 +0000, Sylvain wrote:
Created attachment 3449
patch
patch updated.
I'll try some performance bench
On 2018-11-06 16:16:03 +0000, Sylvain wrote:
A quick test, on my pc, with clang and -O2
rendering 50 times the same string
SOLID/SHADED (end up using Render_Glyph)
before the current patch, takes 22-24 ms.
with the patch and DUFF_LOOPS activate it take 19-21 ms.
with the patch and without DUFF_LOOPS activate it take also 22-24 ms.
so a little fast thanks to duff_loops.
BLENDED (end up using Render_Glyph_Shaded)
59-64ms before or patch+no duff_loops
65-70ms with patch+duff_loops
here, we shouldn't active the duff_loops. But I believe on mobile, it will be faster.
(btw the times takes into account 50x SDL_FreeSurface).
On 2018-11-07 10:25:54 +0000, Sylvain wrote:
So I did tried on an android S7.
Only 1 string (multiple times the alphabet), at size 50, which ends up being a texture of 19380x59.
Not taking into account the SDL_FreeSurface.
Rendering in: (Blended, Shaded, Solid)
with code:
OLD: before this patch
DUFF_LOOPS: with this patch and duff loops
NO_DUFF: with this patch, and without the duff loops
Using freetype 2.9.1 (but shouldn't matter since glyph are rendered and cached).
Forcing -O2 in CLFAGS of SDL_ttf
Trying {arm,thumb}x{armeabi-v7a,arm64-v8a}
thumb: when settigs: LOCAL_SRC_FILES := SDL_ttf.c
arm: when settigs: LOCAL_SRC_FILES := SDL_ttf.c.arm
not sure if this has the same meaning in arm64
Haven't tried neon...
The current default SDL_ttf.c is thumb
I took always the best time of around 50 tries. because of cpu that have some variability.
most of the time the best times are (6ms, 2ms, 1ms) for (Blended, Shaded, Solid).
except:
arm64 + DUFF_LOOPS where it's (5,2,1) (in arm)
arm64 + DUFF_LOOPS where it's (5,1,1) (in thumb)
armeabi-v7a + DUFF_LOOPS where it's (5,1,1) (in thumb)
armeabi-v7a + NO_DUFF_LOOPS where it's (6,1,1) (in thumb)
Which means the the DUFF_LOOPS should be activated for this target.
On 2018-11-26 15:01:42 +0000, Sylvain wrote:
Created attachment 3503
SDL_ttf.c
Hey, a new version with more things:
The current advance isn't accurate:
First, with kerning, because it rounds and sum, instead of summing and rounding, so we lose precision.
Second, because the algorithm has improvements in FT:
Added the kerning mode smart. This is always enable.
From FreeType.h:
If you use strong auto-hinting, you must apply these delta values!
Otherwise you will experience far too large inter-glyph spacing at
small rendering sizes in most cases. Note that it doesn't harm to use
the above code for other hinting modes also, since the delta values
are zero then.
Added the subpixel text rendering, as it's only a matter of translating glyphs with less than 1px and using hinting 'light'.
It's the modern way: letters looks smoother and more uniformly spaced.
This mode is ten times slower as there is no cache possible. Though it remains fast.
To activate it, call TTF_SetFontHinting(font, TTF_HINTING_LIGHT_SUBPIXEL);
Independently of previous changes, there is an issue with current SDL_ttf process:
Once FT has rendered the glyph, metrics changes (width/height), but also the offsets where should be copied the bitmap.
It's can happen with subpixel, but also with italic where space between letter is sometimes totally wrong.
In absolute, this can happen even in normal mode.
In fact, you just get the real position and size, once the glyph is rasterized.
So this is fixed, but we have to clip the glyph against the whole surface before trying to render.
This is some kind of bound check, at glyph level, not at rendering time, so it doesn't hurt perf.
Since it becomes more complex, I have added a buffer to store position between size() and render_line().
It's more simple, and the string is decoded only once.
And, if you add textshaping, it's also more convenient because render_line() remain the same.
One more bug fix: Wrap() behaves badly with only one line that is unbreakable.
It draws it full length.
But in fact, it should clip to wrapLenth, (like it would do if this unbreakable line was in middle of others line).
I have tested this, and also compared subpixel result with FT demos programs to make sure it was ok.
I have run some random test (and will do more) and also some pixel to pixel comparaison (with textshaping).
No patch, but the full SDL_ttf.{c,h} files !
On 2018-11-26 15:02:18 +0000, Sylvain wrote:
Created attachment 3504
SDL_ttf.h
header file
On 2018-11-28 16:34:54 +0000, Sylvain wrote:
Created attachment 3506
Italic screenshot
Here's the italic rendering issue fixed now.
On 2018-11-28 16:40:46 +0000, Sylvain wrote:
Created attachment 3507
SDL_ttf.c
Here's new version (again)
Add a cache for FT_Get_Char_Index() (char -> index conversion, for 127 first ascii values)
Current head is slower at small size, because we call twice FT_Get_Char_Index() to convert a char to its index.
One to access cache_metrics and another one to access cache_bitmap/pixmap.
It happens after using index as cache-key :(
This was indirectly first with previous patch because we only called once FT_Get_Char_Index() and store it.
This is now improved with this cache for added, which is valid even after style/hinting change.
Change the minx/yoffset naming to left/top, now that matches FreeType examples.
Don't use DUFFS_LOOP...
From 1 to 50 size it is faster (this is what I tried before!), but after it's much slower.
In fact Duffs_Loop are quite constant, but this is the non-duff-loop which became faster:
We see that in all benchs with no Duffs loop:
Rendering at size 80 is the same as rendering as size 60.
And (not dumped in next the log), but rendering at size 70 is faster that at size 60).
I believe we kind of hit some compiler optimisation at this size ..
Re-write a little bit render_glyph and un-roll render_line, after various tries, to make it faster.
Write some more precise bench mark with PerformanceCounter()
Now:
at size 8, it's 2x faster (old 25000, head 41000, new 13000)
at size 100, it's 1.5 faster (old 106000, head 98000, new 61000)
Also some subpixel benchmark (only on the new version)
On 2018-11-28 16:42:39 +0000, Sylvain wrote:
Created attachment 3508
bench app
Small bench app for various size / modes
On 2018-11-28 16:47:39 +0000, Sylvain wrote:
Created attachment 3509
bench logs
bench outputs:
old: old version (~3 month ago) (before I start adding bugs).
currentHead: current head
new: this previous SDL_ttf.c
new_With_Duff_Loops: this previous SDL_ttf.c if you add Duffs Loop. (faster up to size 50, slow after...)
On 2018-11-29 13:13:17 +0000, Sylvain wrote:
Created attachment 3510
bench logs android (armeabi-v7a, arm64-v8a)
Previous Bench was on linux i7-3610QM CPU @ 2.30GHz.
Those ones are on samsung s7.
on arm64-v8a:
new version is better than old version.
starting at size 60-70-80, same phenomenon, new-no-duffs-loop is clearly better than new-with-duffs-loop
on armeabi-v7a: at size 70-80, very little better for duffs loop.
So in the end, because of this effect, we shouldn't enable this USE DUFFS LOOP.
on android, the practical usual sizes are 30 to 90.
NB:
v7a, int64 comparison is not working, need to use "clock_gettime(CLOCK_REALTIME, &res);" instead of PerformanceCounter
all bench android: there are high variation on the same run.
On 2018-11-29 13:21:09 +0000, Sylvain wrote:
Created attachment 3511
SDL_ttf.c
New version with a little adjustment for Outline Style so that it remains centred:
For Blended, pre-compute alpha_table, so that instead of doing
*dst++ |= pixel | ((Uint32)alpha_table[alpha] << 24);
It can be re-writen as:
*dst++ |= alpha_table[alpha];
A few percent of improvement.
For Solid/Shaded, we can 'ceil' glyph width to be a multiple of integer and copy faster with 32 bits or 64 bits instruction.
(64 bits fails on android arm-v7a, so only 32 bits is activated).
Gain is quite good:
on linux:
Size 61 shaded: instead of 42 us, it takes 27 us.
from old versions, it means: 71 us -> 27 us
Size 80 shaded: 40 -> 36
from old versions, it means: (also)70 us -> 36 us
on android, even better
armv7a:
Size 61 shaded: 106us, => 57 us
Size 80 shaded: 201 us => 96 us
arm64:
Size 61 shaded: 84 us => 44 us
Size 80 shaded: 106 us => 73us
It doesn't change output, since metrics during size calculation aren't change, only the rendered glyph is extra padded.
It doesn't change code complexity, since there is already a fallback to clip the glyph if it is out of the output surface.
On 2018-11-29 22:06:46 +0000, Sylvain wrote:
Created attachment 3514
bench logs for render_glyph_32
On 2018-12-03 10:12:20 +0000, Sylvain wrote:
Created attachment 3519
SDL_ttf.c
New version! I've added SSE2 and NEON Render_Glyph intrasics versions.
They work on un-aligned memory (loadu, storeu on SSE),
only the glyph width is rounded, as previous.
Not very familiar with SSE2 nor NEON, but I got them working.
NEON doesn't seem to run faster, so I still commented it out.
(there are probably room for improvement, doing prefetch or other instruction...)
Also, this is now build with macros so compiler knows how to optimize stuffs.
Same metrics, onlinux:
Size 61 shaded:
=> 25 us, with render_glyph_32
=> 22 us, with render_glyph_64
=> 23 us, with render_glyph sse2
Size 80:
=> 35 us, with render_glyph_32
=> 33 us, with render_glyph_64
=> 31 us, with render_glyph sse2
arm v7a (Render_Glyph_32):
Size 61 shaded: => 59
Size 80 shaded: => 114 ( a little slower but might be the testing as well).
arm 64 (Render_Glyph_64):
Size 61 shaded: => 40 us
Size 80 shaded: => 70 us
Fix also allocation on non scalable Fonts when converting them:
(nonscalable/pvfixed_20b.pcf.gz)
For instance:
src->pixel_mode = 1 (MONO) src->width=2 src->pitch=4
Would allocate, by doing pitch * 8 :
dst->width = 2, dst->pitch=32
src->pixel_mode = 1 (MONO) src->width=2 src->pitch=4
Now allocate:
dst->width=2 dst->pitch=8
On 2018-12-03 10:13:52 +0000, Sylvain wrote:
Created attachment 3520
bench logs for linux/android
Bench logs for previous version
On 2018-12-11 16:24:09 +0000, Sylvain wrote:
Created attachment 3541
SDL_ttf.c
Here's a new version
Spot & fix two new bugs:
Blended mode: line colour (underline/strikethrough) is always fully opaque when it should have user alpha opacity value (colour fg.a).
The issue is that it would display a very bright line, when the text is maybe 10% brightness (depending on chosen alpha).
Shaded mode: alpha palette is incorrect when for instance color alpha is 0, it is then set to OPAQUE (255) and disturbs a-diff ratio.
So the palette alpha is not regular from 0 to 255 and a text with 10% brightness is not correctly display: it looks like it has brightened edges.
A few additional optimisations:
Shaded mode: optimise surface creation:
divide by 255 (x/255) can be optimised as
(x + 1 + (x>>8))>>8, when positive
(x + 255 + (x>>8))>>8 when negative
(which a little a bit faster and non negligible on small font size, compared to the total time taken).
(NB: at small size to have a realistic bench value, kerning must be turned off because it has a too high timing cost).
Blended mode: optimise surface creation:
provide memory buffer to avoid an extra memset of the full surface from SDL_CreateRGBSurface.
All Blit Glyph routines are now available in duffs loop macro, and cleaned-up.
(still not sure if this is worth enabling on android).
Blended mode:
alpha_table doesn't need to be computed as "color | (alpha << 24)" but only "alpha << 24" since the background is already filled.
also add a disjunction case when Blended is opaque or not:
opaque doesn't need lookup alpha_table at all and can be done faster on the fly.
Optimise Blending routines SSE and NEON.
Especially in opaque blending:
NEON can process 16 pixels at once using vzipq_u8 (interleave)
that produces a two lanes value in output (uint8x16x2_t).
Render text Blended is now twice faster.
On 2018-12-11 16:39:28 +0000, Sylvain wrote:
Created attachment 3542
bench logs
Bench log with SSE2 version:
now that Kerning (call to FT_GetKerning()), for both logs is taking ~8300 us.
non-opaque means we has to use the alpha_table (as in all previous version),
whereas opaque is done on the fly (the reason is that there is nothing to compute for this value).
Now SSE/NEON Blended Non-Opaque also compute the alpha table on the fly and performs 10-20% better.
(and I removed the alpha_table).
When blitting a glyph, 'srcskip' is not needed: we choose the width and pitch so that srcskip is 0 and can be removed. Except for the case when the glyph is clipped.
(For fixed Font, the pitch may be larger for decoding, so one post-processing step can be needed to shrink them).
This bug report was migrated from our old Bugzilla tracker.
These attachments are available in the static archive:
SDL_ttf.c (SDL_ttf.c, text/x-csrc, 2018-11-26 15:01:42 +0000, 71464 bytes)SDL_ttf.c (SDL_ttf.c, text/x-csrc, 2018-11-28 16:40:46 +0000, 73696 bytes)SDL_ttf.c (SDL_ttf.c, text/x-csrc, 2018-11-29 13:21:09 +0000, 73732 bytes)SDL_ttf.c (SDL_ttf.c, text/x-csrc, 2018-11-29 22:05:20 +0000, 75406 bytes)SDL_ttf.c (SDL_ttf.c, text/x-csrc, 2018-12-03 10:12:20 +0000, 94135 bytes)SDL_ttf.c (SDL_ttf.c, text/x-csrc, 2018-12-11 16:24:09 +0000, 102792 bytes)SDL_ttf.c (SDL_ttf.c, text/x-csrc, 2018-12-11 21:56:02 +0000, 103540 bytes)Reported in version: unspecified
Reported for operating system, platform: Linux, x86_64
Comments on the original bug report:
On 2018-11-06 13:10:30 +0000, Sylvain wrote:
On 2018-11-06 13:14:07 +0000, Sylvain wrote:
On 2018-11-06 14:43:26 +0000, Sam Lantinga wrote:
On 2018-11-06 14:43:59 +0000, Sam Lantinga wrote:
On 2018-11-06 15:37:43 +0000, Sylvain wrote:
On 2018-11-06 16:16:03 +0000, Sylvain wrote:
On 2018-11-07 10:25:54 +0000, Sylvain wrote:
On 2018-11-26 15:01:42 +0000, Sylvain wrote:
On 2018-11-26 15:02:18 +0000, Sylvain wrote:
On 2018-11-28 16:34:54 +0000, Sylvain wrote:
On 2018-11-28 16:40:46 +0000, Sylvain wrote:
On 2018-11-28 16:42:39 +0000, Sylvain wrote:
On 2018-11-28 16:47:39 +0000, Sylvain wrote:
On 2018-11-29 13:13:17 +0000, Sylvain wrote:
On 2018-11-29 13:21:09 +0000, Sylvain wrote:
On 2018-11-29 22:05:20 +0000, Sylvain wrote:
On 2018-11-29 22:06:46 +0000, Sylvain wrote:
On 2018-12-03 10:12:20 +0000, Sylvain wrote:
On 2018-12-03 10:13:52 +0000, Sylvain wrote:
On 2018-12-11 16:24:09 +0000, Sylvain wrote:
On 2018-12-11 16:39:28 +0000, Sylvain wrote:
On 2018-12-11 21:56:02 +0000, Sylvain wrote:
On 2018-12-12 17:11:24 +0000, Sylvain wrote:
On 2018-12-18 15:24:53 +0000, Sylvain wrote:
On 2018-12-18 15:26:29 +0000, Sylvain wrote:
On 2018-12-18 15:26:59 +0000, Sylvain wrote:
On 2019-01-31 12:52:04 +0000, Sylvain wrote:
The text was updated successfully, but these errors were encountered: