[转帖]Java Warmup

java,warmup · 浏览次数 : 0

小编点评

This is a very interesting and informative post about Java warmup. The flame graphs provide a clear and detailed visualization of the different stages of the Java initialization process, from initial heap allocation to compiler compilation and execution. **Key takeaways:** * Java warm-up is a complex process that involves various stages and optimizations. * The flame graphs illustrate these stages in a clear and concise manner. * Understanding the warm-up process can help optimize Java applications and improve performance. **Additional points:** * The flame graphs show that the majority of the CPU time is spent on the compiler (C2 compiler). * There is a significant amount of time spent on kernel code execution, including page faults and file deletion operations. * The warm-up process takes around 600-630 seconds to complete. * The flame graphs provide a valuable insight into the complex dynamics of Java warm-up. **Overall, this is a well-written and informative post that provides a clear and concise understanding of Java warm-up.**

正文

https://www.brendangregg.com/blog/2016-09-28/java-warmup.html

 

I gave a talk at JavaOne last week on flame graphs (slides here), and in Q&A was asked about Java warmup. I wanted to show some flame graphs that illustrated it, but didn't have them on hand. Here they are.

These are 30 second flame graphs from a Java microservice warming up and beginning to process load. The following 30 flame graphs (3 rows of 10) have been shrunk too small to read the frames, but at this level you can see how processing changes in the first 10 minutes then settles down. Click to zoom (a little):

 

 

The hues are: yellow is C++, green is Java methods, orange is kernel code, and red is other user-level code. I'm using the Linux perf profiler, which can see everything, including kernel code execution with Java context. JVM profilers (which are what almost everyone is using) cannot do this, and have blind spots.

Now I'll zoom in to the first two:

Time = 0 - 30s

 

 

 

This is the first 30 seconds of Java after launch.

On the left, the Java heap is being pre-allocated by the JVM code (C++, yellow). The os::pretouch_memory() function is triggering page faults (kernel code, orange). Click Universe::initialize_heap to get a breakdown of where time (CPU cycles) are spent. Neat, huh?

Because you're probably wondering, here's clear_page_c_c() from arch/x86/lib/clear_page_64.S:

ENTRY(clear_page_c_e)
    movl $4096,%ecx
    xorl %eax,%eax
    rep stosb
    ret
ENDPROC(clear_page_c_e)

It uses REP and XORL instructions to wipe a page (4096 bytes) of memory. And that's a performance optimization added to Linux by Intel (Linux gets such micro-optimizations frequently). Now zoom back out using "Reset Zoom" (top left) or click a lower frame.

The tower on the right has many "Interpreter" frames in red (user-mode): these are Java methods that likely haven't hit the CompileThreshold tunable yet (by default: 10000). If these methods keep being called, they'll switch from being interpreted (executed by the JVM's "Interpreter" function) to being complied and executed natively. At that point, the compiled methods will be running from their own addresses which can be profiled and translated: in the Java flame graph, they'll be colored green.

The tower in the middle around C2Compiler::compile_method() shows the C2 compiler working on compiling Java just-in-time.

Time = 30 - 60s

 

 

 

We're done initializing the heap (that tower is now gone), but we're still compiling: about 55% of the CPU cycles are in the C2 compiler, and lots of Interpreter frame are still present. Click CompileBroker::compiler_thread_loop to zoom in to the compiler. I can't help but want to dig into these functions further, to look for optimizations in the compiler itself. Zooming back out...

There's a new tower, on the left, containing ParNewGenTask::work. Yes, GC has kicked in by the 30-60s profile. (Not all frames there are colored yellow, but they should be, it's just a bug in the flame graph software.)

See the orange frames in the top right? That's the kernel deleting files, called from java/io/UnixFileSystem:::delete0(). It's called so much that its CPU footprint is showing up in the profile. How interesting. So what's going on? The answer should be down the stack, but it's still Interpreter frames. I could do another warmup run and use my tracing tools (perf-tools or bcc). But right now I haven't done that so I don't know -- a mystery. I'd guess log rotation.

The frames with "[perf-79379.map]": these used to say "unknown", but a recent flame graph change defaults to the library name if the function symbol couldn't be found. In this case, it's not a library, but a map file which should have Java methods. So why weren't these symbols in it? The way I'm currently doing Java symbol translation is to take snapshots at the end of the profile, and it's possible those symbols were recompiled and moved by that point.

See the "Interpreter" tower on the far left? Notice how they don't begin with start_thread(), java_start(), etc? This is likely a truncated stack, due to the default setting of Linux perf to only capture the top 127 frames. Truncated stacks break flame graphs: it doesn't know where to merge them. This will be tunable in Linux 4.8 (kernel.perf_event_max_stack). I'd bed this tower really belongs on as part of the tallest Interpreter tower on the right -- if you can imagine moving the frames over there and making it wider to fit.

Time = 90+s

Going back to the montage, you can see different phases of warmup come and go, and different patterns of Java methods in green for each of them:

 

 

I was hoping to document and include more of the warmup, as each stage can be clearly seen and inspected (and there are more interesting things to describe, including a lock contention issue in CFLS_LAB::alloc() that showed up), but this post would get very long. I'll skip ahead to the warmed-up point:

Time = 600 - 630s

 

 

At the 10 minute mark, the microservice is running its production workload. Over time, the compiler tower shrinks further and we spend more time running Java code. (This image is a PNG, not a SVG.)

The small green frames on the left are more evidence of truncated stacks.

Generation

To generate these flame graphs, I ran the following the same moment I began load:

for i in `seq 100 130`; do perf record -F 99 -a -g -- sleep 30; ./jmaps
    perf script > out.perf_warmup_$i.txt; echo $i done; done

The jmaps program is an unsupported helper script for automating a symbol dump using Java perf-map-agent.

Then I post processed with the FlameGraph software:

for i in `seq 100 130`; do ./FlameGraph/stackcollapse-perf.pl --kernel \
    < out.perf_warmup_$i.txt | grep '^java' | grep -v cpu_idle | \
    ./FlameGraph/flamegraph.pl --color=java --width=800 --minwidth=0.5 \
    --title="at $(((i - 100) * 30)) seconds" > warmup_$i.svg; echo $i done; done

This is using the older way of profiling with Linux perf: via a perf.data file and "perf script". A newer way has been in development that uses BPF for efficient in-kernel summaries, but that's a topic for another post...

与[转帖]Java Warmup相似的内容:

[转帖]Java Warmup

https://www.brendangregg.com/blog/2016-09-28/java-warmup.html I gave a talk at JavaOne last week on flame graphs (slides here), and in Q&A was asked a

[转帖]Java 近期新闻:JEP 更新,GraalVM 贡献给 OpenJDK,JavaOne 重启

https://www.infoq.cn/article/kzzbQg5zgissaCcJlfey JEP 432记录模式(第二预览版)在上周从其 8294078 草案晋升为候选状态。相比 JEP 405 记录模式(预览版),该 JEP 更新了:对通用记录模式类型参数推断的支持、新增对记录模式出现在

[转帖]Java中线程的生命周期

https://blog.51cto.com/u_15773567/5832430 1 介绍 本篇文章我们讨论下Java中的一个非常核心的概念:线程的生命周期。在Java编程语言中,多线程编程非常重要。线程从创建到销毁是有生命周期的,在线程的生命周期中,线程会经历多种状态(state)。 2 线程状

[转帖]Java调优系列之工具篇之btrace、gperftools

https://github.com/landon30/Bulls/wiki/java-profiling-tools landon 网络游戏资深服务器架构师 2018-06-14 线上遇到了问题? 服务上线出问题,想增加打印日志怎么办? 线上怀疑某个接口慢,想打印接口耗时怎么办? 线上某个接口报错

[转帖]Java游戏服务器调优实践

https://www.jianshu.com/p/344f8141b63e Java Profiling Practice landon资深网络游戏服务器架构师 系统性能定义 Throughput 吞吐量,也就是每秒钟可以处理的请求数,任务数 Latency 系统延迟,也就是系统在处理一个请求或一

[转帖]java中方法不要写太长的真正原因

https://www.iteye.com/blog/enetor-976070 java中一般建议一个方法不要写的过长,不方便维护和阅读是其中的一个原因,但是其真正性能的原因大家知道吗? 我们知道,JVM一开始是以解释方式执行字节码的。当这段代码被执行的次数足够多以后,它会被动态优化并编译成机器码

[转帖]Java方法的JIT编译

https://www.jianshu.com/p/a6275e239eac Java方法执行一般会利用分层编译,先通过c1解释执行。方法执行编译等级逐渐提升,有机会通过JIT编译为特定平台汇编执行,以此获得最好的性能。 方法执行除了达到一定热度外,是否JIT编译也受到以下两个参数影响: -XX:+

[转帖]java profile配置

[转帖]JAVA 编程规范之建表规约

1、【强制】表达是与否概念的字段,必须使用 is_xxx 的方式命名,数据类型是 unsigned tinyint( 1 表示是,0 表示否)。 2、【强制】表名、字段名必须使用小写字母或数字,禁止出现数字开头,禁止两个下划线中间只出现数字。数据库字段名的修改代价很大,因为无法进行预发布,所以字段名

[转帖]Java使用火焰图查看系统瓶颈

场景 一般情况下,我们会对现有系统进行压测等方式,来了解系统最大的吞吐量等等,通过这种方式得知系统在生产环境下可扛住的压力,如果我们想了解在压测的链路过程中,是哪些地方执行时间过长,影响了系统的吞吐量,可以使用火焰图的方式来观察。 工具 生成火焰图需要两个工具: 1. async-profiler: