<?xml version="1.0" encoding="UTF-8"?>
<?xml-stylesheet type="text/xsl" media="screen" href="/~d/styles/rss2full.xsl"?><?xml-stylesheet type="text/css" media="screen" href="http://feeds.amddevcentral.com/~d/styles/itemcontent.css"?><rss xmlns:content="http://purl.org/rss/1.0/modules/content/" xmlns:wfw="http://wellformedweb.org/CommentAPI/" xmlns:dc="http://purl.org/dc/elements/1.1/" xmlns:atom="http://www.w3.org/2005/Atom" xmlns:sy="http://purl.org/rss/1.0/modules/syndication/" xmlns:slash="http://purl.org/rss/1.0/modules/slash/" xmlns:creativeCommons="http://backend.userland.com/creativeCommonsRssModule" xmlns:feedburner="http://rssnamespace.org/feedburner/ext/1.0" version="2.0">

<channel>
	<title>AMD Developer Central</title>
	
	<link>http://blogs.amd.com/developer</link>
	<description>Your central resource for tools, technologies, best practices, and expert guidance to optimize your software solution performance on AMD platforms.</description>
	<lastBuildDate>Tue, 31 Jan 2012 21:51:04 +0000</lastBuildDate>
	<language>en</language>
	<sy:updatePeriod>hourly</sy:updatePeriod>
	<sy:updateFrequency>1</sy:updateFrequency>
	<generator>http://wordpress.org/?v=3.0</generator>
		<atom10:link xmlns:atom10="http://www.w3.org/2005/Atom" rel="self" type="application/rss+xml" href="http://feeds.amddevcentral.com/AmdDeveloperBlogs" /><feedburner:info uri="amddeveloperblogs" /><atom10:link xmlns:atom10="http://www.w3.org/2005/Atom" rel="hub" href="http://pubsubhubbub.appspot.com/" /><creativeCommons:license>http://creativecommons.org/licenses/by-nd/2.0/</creativeCommons:license><image><link>http://creativecommons.org/licenses/by-nd/2.0/</link><url>http://creativecommons.org/images/public/somerights20.gif</url><title>Some Rights Reserved</title></image><feedburner:emailServiceId>AmdDeveloperBlogs</feedburner:emailServiceId><feedburner:feedburnerHostname>http://feedburner.google.com</feedburner:feedburnerHostname><item>
		<title>SlotMaximizer extension to LLVM can help you to quickly optimize OpenCL™ applications</title>
		<link>http://feeds.amddevcentral.com/~r/AmdDeveloperBlogs/~3/Peas1M1i410/</link>
		<comments>http://blogs.amd.com/developer/2012/01/23/slotmaximizer-extension-to-llvm-can-help-you-to-quickly-optimize-opencl%e2%84%a2-applications/#comments</comments>
		<pubDate>Mon, 23 Jan 2012 16:52:35 +0000</pubDate>
		<dc:creator>Mark Ireton</dc:creator>
				<category><![CDATA[AMD APP]]></category>
		<category><![CDATA[Inside Dev Central]]></category>
		<category><![CDATA[Code Optimization]]></category>
		<category><![CDATA[compiler]]></category>
		<category><![CDATA[Fusion]]></category>
		<category><![CDATA[GPGPU]]></category>
		<category><![CDATA[heterogeneous computing]]></category>
		<category><![CDATA[multi-core]]></category>
		<category><![CDATA[OpenCL]]></category>
		<category><![CDATA[Optimization]]></category>
		<category><![CDATA[Parallel Computing]]></category>
		<category><![CDATA[Parallel Programming]]></category>

		<guid isPermaLink="false">http://blogs.amd.com/developer/?p=2310</guid>
		<description><![CDATA[I would like to introduce you to SlotMaximizer.  SlotMaximizer is a transformation tool that automatically tunes OpenCL™ kernels, helping to increase developer productivity.  It aids developers to obtain increased performance, higher throughput, and better hardware utilization from their kernels with &#8230; <a href="http://blogs.amd.com/developer/2012/01/23/slotmaximizer-extension-to-llvm-can-help-you-to-quickly-optimize-opencl%e2%84%a2-applications/">Continue reading</a>]]></description>
			<content:encoded><![CDATA[<p>I would like to introduce you to SlotMaximizer.  SlotMaximizer is a transformation tool that automatically tunes OpenCL™ kernels, helping to increase developer productivity.  It aids developers to obtain increased performance, higher throughput, and better hardware utilization from their kernels with minimal effort while maintaining a small, readable and maintainable code base.  SlotMaximizer enables developers to focus on their original problems and algorithm strategies and leave the details of optimizing the code to the compiler. </p>
<p>In Auto-Tuning mode, SlotMaximizer attempts many potential optimizations on kernels directed by the developer.  Each of the many optimizations of a given kernel are executed using the actual data being processed by the application, and performance statistics of the optimized kernels processing this data are provided to the developer to select the appropriate version to leverage.</p>
<p>In production code the original kernel is used, and the selected configuration for the preferred optimizations is indicated to the compiler using an attribute that is provided to the developer as part of the Auto-Tuning output data.</p>
<p>SlotMaximizer is already incorporated into the AMD Catalyst™ drivers as a preview and can be used by anyone developing applications using the APP SDK.  It will be turned on by default to support application execution on end-user systems later in the year.</p>
<p>Improvements continue to be made to SlotMaximizer, download the latest beta version from   <a href="http://multicorewareinc.com/index.php?option=com_content&amp;view=article&amp;id=68&amp;Itemid=64">here</a>.  You will need to register on the Multicoreware web-site, but this will also give you access to report issues, ask questions, etc. using their forums.  The download will install over that included with your AMD Catalyst™ drivers, and includes documentation on use of SlotMaximizer.</p>
<p><strong>Mark Ireton is the Product Manager for Compute Solutions at AMD.</strong> His postings are his own opinions and may not represent AMD’s positions, strategies or opinions. Links to third party sites, and references to third party trademarks, are provided for convenience and illustrative purposes only. Unless explicitly stated, AMD is not responsible for the contents of such links, and no third party endorsement of AMD or any of its products is implied.</p>
<img src="http://feeds.feedburner.com/~r/AmdDeveloperBlogs/~4/Peas1M1i410" height="1" width="1"/>]]></content:encoded>
			<wfw:commentRss>http://blogs.amd.com/developer/2012/01/23/slotmaximizer-extension-to-llvm-can-help-you-to-quickly-optimize-opencl%e2%84%a2-applications/feed/</wfw:commentRss>
		<slash:comments>1</slash:comments>
		<feedburner:origLink>http://blogs.amd.com/developer/2012/01/23/slotmaximizer-extension-to-llvm-can-help-you-to-quickly-optimize-opencl%e2%84%a2-applications/</feedburner:origLink></item>
		<item>
		<title>Just Released! AMD CodeAnalyst 3.2 for Linux</title>
		<link>http://feeds.amddevcentral.com/~r/AmdDeveloperBlogs/~3/DtZkY8-MRJk/</link>
		<comments>http://blogs.amd.com/developer/2012/01/23/just-released-amd-codeanalyst-3-2-for-linux/#comments</comments>
		<pubDate>Mon, 23 Jan 2012 14:28:05 +0000</pubDate>
		<dc:creator>Gnanabaskaran</dc:creator>
				<category><![CDATA[Hard-Core Software Optimization]]></category>
		<category><![CDATA[CodeAnalyst]]></category>
		<category><![CDATA[CPU Performance Profiling]]></category>
		<category><![CDATA[Profiling]]></category>

		<guid isPermaLink="false">http://blogs.amd.com/developer/?p=2301</guid>
		<description><![CDATA[AMD CodeAnalyst for Linux® 3.2 has been released and can be downloaded from the AMD CodeAnalyst for Linux website: http://developer.amd.com/cpu/codeanalyst/codeanalystlinux Last quarter, we saw lot of active users suggesting new enhancements, posting queries and bugs. You are helping us to &#8230; <a href="http://blogs.amd.com/developer/2012/01/23/just-released-amd-codeanalyst-3-2-for-linux/">Continue reading</a>]]></description>
			<content:encoded><![CDATA[<p>AMD CodeAnalyst for Linux® 3.2 has been released and can be downloaded from the AMD CodeAnalyst for Linux website:</p>
<p><a href="http://developer.amd.com/cpu/codeanalyst/codeanalystlinux">http://developer.amd.com/cpu/codeanalyst/codeanalystlinux</a></p>
<p>Last quarter, we saw lot of active users suggesting new enhancements, posting queries and bugs. You are helping us to make AMD CodeAnalyst better. Thank you very much for your feedback.</p>
<p>In the AMD CodeAnalyst 3.2 release, we have added support to profile Java inline methods. To profile Java inline methods, AMD CodeAnalyst requires JDK version 1.7 or higher. We have added support for the new Linux distributions RHEL 5 Update 7 and RHEL 6 Update 2.</p>
<p>Multiple bugs are fixed to increase the profiling experience of Java Applications. Incorrect block analysis in Disassembly and Module data views are fixed. Added support for nested java inline functions. Various GUI crashes in Module Data View were fixed.</p>
<p>Up until version 3.1, AMD CodeAnalyst was using Qt version 3.3. In this release we have migrated to Qt version 4.x. This would make us to use the new features of Qt to improve the user experience.</p>
<p>Apart from adding new features, we are also committed to make our product more robust by fixing more bugs. If you have any suggestion to improve the tool or whenever you experience any bug, please let us know, through our <a href="http://forums.amd.com/devforum/categories.cfm?catid=218">forums</a> or comments to this blog.</p>
<img src="http://feeds.feedburner.com/~r/AmdDeveloperBlogs/~4/DtZkY8-MRJk" height="1" width="1"/>]]></content:encoded>
			<wfw:commentRss>http://blogs.amd.com/developer/2012/01/23/just-released-amd-codeanalyst-3-2-for-linux/feed/</wfw:commentRss>
		<slash:comments>0</slash:comments>
		<feedburner:origLink>http://blogs.amd.com/developer/2012/01/23/just-released-amd-codeanalyst-3-2-for-linux/</feedburner:origLink></item>
		<item>
		<title>Most awaited AMD CodeAnalyst v3.5 for Windows (Q4 release) is now available!!!</title>
		<link>http://feeds.amddevcentral.com/~r/AmdDeveloperBlogs/~3/j7lepL0o8vY/</link>
		<comments>http://blogs.amd.com/developer/2012/01/18/most-awaited-amd-codeanalyst-v3-5-for-windows-q4-release-is-now-available/#comments</comments>
		<pubDate>Wed, 18 Jan 2012 21:39:41 +0000</pubDate>
		<dc:creator>Sharon Troia</dc:creator>
				<category><![CDATA[Inside Dev Central]]></category>
		<category><![CDATA[Code Optimization]]></category>
		<category><![CDATA[Code Profiler]]></category>
		<category><![CDATA[CodeAnalyst]]></category>
		<category><![CDATA[OpenCL]]></category>
		<category><![CDATA[Parallel Computing]]></category>

		<guid isPermaLink="false">http://blogs.amd.com/developer/?p=2285</guid>
		<description><![CDATA[AMD CodeAnalyst v3.5 for Windows has been released and can be downloaded from AMD CodeAnalyst for Windows page at the following URL http://developer.amd.com/cpu/codeanalyst/codeanalystwindows Based on many valuable suggestions and feedback we received from our end-users, we are excited to say that &#8230; <a href="http://blogs.amd.com/developer/2012/01/18/most-awaited-amd-codeanalyst-v3-5-for-windows-q4-release-is-now-available/">Continue reading</a>]]></description>
			<content:encoded><![CDATA[<p>AMD CodeAnalyst v3.5 for Windows has been released and can be downloaded from AMD CodeAnalyst for Windows page at the following URL</p>
<p><a href="http://developer.amd.com/cpu/codeanalyst/codeanalystwindows">http://developer.amd.com/cpu/codeanalyst/codeanalystwindows</a></p>
<p>Based on many valuable suggestions and feedback we received from our end-users, we are excited to say that the following new features have been added in this release:</p>
<p>1.       Cache Line Utilization</p>
<p>2.       Data translation is made faster now with multiple threads</p>
<p>3.       OpenCL™ Kernel Source Code level support</p>
<p>4.       Improved Source &amp; Assembly Profile display</p>
<p>5.       Importing standalone oclt file into project session</p>
<p>6.       PID/TID Filtering</p>
<p>7.       Improved View Management for individual cores</p>
<p>8.       Timeline Navigation</p>
<p>9.       User configurable Maximum # of run lanes</p>
<p>10.   AMD CodeAnalyst has been outfitted with an array  of bugs fixes.</p>
<p>We hope that this release helps you with your work efficiently and pleasantly. If you encounter  any bugs or see where AMD CodeAnalyst could be improved, please reach us through our forums or by replying this blog page.</p>
<p><strong>Jeganathan Swaminathan is a Member of Technical  Staff at AMD.</strong> His postings are his own opinions and may not represent AMD’s positions, strategies or opinions. Links to third party sites, and references to third party trademarks, are provided for convenience and illustrative purposes only. Unless explicitly stated, AMD is not responsible for the contents of such links, and no third party endorsement of AMD or any of its products is implied.</p>
<p><em>OpenCL and the OpenCL logo are trademarks of Apple Inc. used with permission by Khronos</em></p>
<img src="http://feeds.feedburner.com/~r/AmdDeveloperBlogs/~4/j7lepL0o8vY" height="1" width="1"/>]]></content:encoded>
			<wfw:commentRss>http://blogs.amd.com/developer/2012/01/18/most-awaited-amd-codeanalyst-v3-5-for-windows-q4-release-is-now-available/feed/</wfw:commentRss>
		<slash:comments>0</slash:comments>
		<feedburner:origLink>http://blogs.amd.com/developer/2012/01/18/most-awaited-amd-codeanalyst-v3-5-for-windows-q4-release-is-now-available/</feedburner:origLink></item>
		<item>
		<title>AMD OpenCL™ APP SDK 2.6: introducing OpenCL™ 1.2 preview and Static C++ kernel language</title>
		<link>http://feeds.amddevcentral.com/~r/AmdDeveloperBlogs/~3/8yXlmPBIYWY/</link>
		<comments>http://blogs.amd.com/developer/2012/01/10/amd-opencl%e2%84%a2-app-sdk-2-6-introducing-opencl%e2%84%a2-1-2-preview-and-static-c-kernel-language/#comments</comments>
		<pubDate>Tue, 10 Jan 2012 21:30:31 +0000</pubDate>
		<dc:creator>Mark Ireton</dc:creator>
				<category><![CDATA[AMD APP]]></category>
		<category><![CDATA[Inside Dev Central]]></category>
		<category><![CDATA[APU]]></category>
		<category><![CDATA[GPGPU]]></category>
		<category><![CDATA[heterogeneous computing]]></category>
		<category><![CDATA[multi-core]]></category>
		<category><![CDATA[OpenCL]]></category>
		<category><![CDATA[Parallel Computing]]></category>
		<category><![CDATA[Parallel Programming]]></category>

		<guid isPermaLink="false">http://blogs.amd.com/developer/?p=2281</guid>
		<description><![CDATA[Several  weeks or so ago in my prior blog I included a teaser on the imminent availability of not only the latest AMD APP SDK, but also that we would be releasing a preview of many OpenCL™ 1.2 features.  In &#8230; <a href="http://blogs.amd.com/developer/2012/01/10/amd-opencl%e2%84%a2-app-sdk-2-6-introducing-opencl%e2%84%a2-1-2-preview-and-static-c-kernel-language/">Continue reading</a>]]></description>
			<content:encoded><![CDATA[<p>Several  weeks or so ago in my <a href="../2011/11/28/amd-opencl%E2%84%A2-app-sdk-preview/">prior blog</a> I included a teaser on the imminent availability of not only the latest AMD APP SDK, but also that we would be releasing a preview of many OpenCL™ 1.2 features.  In fact, there are so many great new features that are being introduced that I am struggling to order them.  The new SDK and other supporting drivers and files discussed in this blog can be downloaded <a href="http://developer.amd.com/SDKS/AMDAPPSDK/DOWNLOADS/Pages/default.aspx">here</a>.</p>
<p>Let’s start with the OpenCL 1.2 preview.    Key new features supported in the preview are Host access flags for memory objects, greater flexibility for 1D and 2D image and buffer representations, memory object migration, a new generalized image creation API and more, see list below   These preview features are currently only supported on the GPU, but during the first half of 2012 expect a complete implementation of all OpenCL required features and select extensions on both CPU and GPU.  Start making your plans now to join us <a href="http://developer.amd.com/afds">at AFDS ’12 in Belleview WA</a> this coming June to see demonstrations using the new capabilities.</p>
<p>OpenCL 1.2 aside, one of the new features that I am really excited about is the preview of the OpenCL Static C++ Kernel Language Extension.   This is a version of C++ that can be used to write OpenCL kernels.    “This is huge!” was a comment from one of our ISVs when I disclosed this capability during a meeting at SC’11.  Key capabilities of OpenCL Static C++ include kernel and function overloading, kernel and member templates, inheritance, friend classes, and more.  For full details on supported and unsupported C++ capabilities please read the specification for OpenCL Static <a href="http://developer.amd.com/sdks/AMDAPPSDK/assets/cplus_kernel_language.docx">C++ Kernel Language Extension</a></p>
<p>Both the OpenCL and the OpenCL Static C++ Kernel Language Extension previews are supported in a special driver release that can be downloaded from the OpenCL SDK download page.</p>
<p>Pressing forward our commitment to make OpenCL easier to use, with this new SDK we are including the new Khronos C++ Wrapper API.  This new API adds many benefits to the developer for host side programming in C++.  It is no longer necessary to specify platform, context, or queue OpenCL host side objects, greatly simplifying host side code, programs are automatically built on creation for all devices in the associated context, and error checking is included by default.  Finally the new make kernel function replaces the need for many OpenCL commands and helps improve type checking and overall code robustness.  All of these changes taken together mean that it is now possible to write a complete OpenCL Hello World, including host side code, in just 20 lines of C++.</p>
<p>This  release was designed to provide comprehensive performance improvements, both on the CPU side with key run-time performance improvements and support for both AVX and FMA4 instructions, and on the GPU side with the addition of asynchronous data copy and kernel execution (preview), support for atomic counters, and support for the cl_amd_media_ops2 OpenCL extension.</p>
<p>From AMD Catalyst™ software 11.11 we are now also including the OpenCL runtime in the AMD Catalyst software Linux drivers.  This ensures that the maximum number of end users are enabled for use of OpenCL applications.</p>
<p>Key features supported in SDK 2.6 and the AMD Catalyst™ software 11.12 drivers include:</p>
<ul>
<li>Inclusion of the Khronos C++ Wrapper API</li>
<li>OpenCL runtime integration into Linux in addition to Windows® AMD Catalyst™ drivers.</li>
<li>Multi GPU support on Linux platforms.</li>
<li>PX5 support.</li>
<li>Support for AVX extensions for CPUs that support this extension.</li>
<li>Support for FMA4  extensions in OpenCL built-in function libraries for CPUs that support this extension</li>
<li>Kernel Reflection, query kernel parameters and enable use of OpenCL kernels in data driven applications.</li>
<li>Support for Atomic counters on fusion devices.</li>
<li>khr_fp64 support on AMD Radeon™ HD 69xx graphics devices.</li>
<li>Redesign on OpenCL run-time on CPU significantly helps improve performance.</li>
<li>Support for the cl_amd_media_ops2 extension, exposing hardware capabilities for accelerating image related processing.</li>
<li>Async copies preview (set environment variable GPU_ASYNC_MEM_COPY=2 to enable).</li>
</ul>
<p>The OpenCL 1.2 preview includes the following capabilities:</p>
<ul>
<li>Host access flags for memory objects enables more efficient buffer handling</li>
<li>Pattern based GPU buffer and image initialization eliminates need for certain buffer/image transfers</li>
<li>Memory objects migration supports transfer of buffer prior to need</li>
<li>New generalized image creation API</li>
<li>Enhanced image/buffer map operations</li>
<li>OpenCL 1.2 CPU device partition including partition of a CPU after addition to a context</li>
<li>Generalized 1D and 2D images, image arrays,  and image&lt;-&gt; buffer interop</li>
</ul>
<p>More details on how to access the OpenCL 1.2 preview are provided on the <a href="http://developer.amd.com/appsdk">OpenCL SDK 2.6 developer  page</a>.</p>
<p>gDEBugger version 6.1 is a major performance and robustness improvements over version 6.0 and nan be downloaded for use with this SDK from <a href="http://developer.amd.com/gDEBugger">http://developer.amd.com/gDEBugger</a>.</p>
<ul>
<li>Integrated with Microsoft® Visual Studio®</li>
<li>Stand alone version</li>
<li>Registration no longer necessary for obtaining a license</li>
</ul>
<p>APP KernelAnalyzer v 2.0:</p>
<ul>
<li>Support for AMD Radeon™ HD 7000 series GPUs (compilation only – no analysis)</li>
<li>Support for AMD Catalyst™ software revisions through 11.11</li>
<li>Support for compiling kernels with the installed driver (select Installed Driver under the CAL version in the Options panel)</li>
<li>Format and Target Object Code are now separated.</li>
</ul>
<p>APP Profiler v2.4 includes several key new features, including:</p>
<ul>
<li>A kernel occupancy analyzer which estimates, for each kernel dispatch, the number of in-flight wavefronts on a compute unit as a percentage of the theoretical maximum number of wavefronts that the compute unit can support. In addition to reporting the occupancy percentage, the profiler can display a report which can help the developer to achieve a higher occupancy percentage.</li>
<li>The ability to navigate from the API trace to the source code that called an OpenCL API.</li>
<li>OpenCL API analysis which provides performance suggestions to the developer.</li>
<li>The ability to filter which OpenCL APIs will be traced.</li>
<li>Several UI enhancements, including the ability to rename sessions from the Session Explorer Window, and the ability to automatically delete profiler sessions when closing a Microsoft® Visual Studio solution®.</li>
<li>Preview: Support for profiling with AMD Radeon™ HD 7000 series GPUs (requires AMD APP SDK v2.6 and an AMD Catalyst™ software version that supports this hardware).</li>
</ul>
<p>On a final note – remember to install the AMD Catalyst™ software 12.1 drivers when they are released in January – there are significant new capabilities that are going to be made available then.  But more on that later.</p>
<p><strong>Mark Ireton is the Product Manager for Compute Solutions at AMD.</strong> His postings are his own opinions and may not represent AMD’s positions, strategies or opinions. Links to third party sites, and references to third party trademarks, are provided for convenience and illustrative purposes only. Unless explicitly stated, AMD is not responsible for the contents of such links, and no third party endorsement of AMD or any of its products is implied.</p>
<img src="http://feeds.feedburner.com/~r/AmdDeveloperBlogs/~4/8yXlmPBIYWY" height="1" width="1"/>]]></content:encoded>
			<wfw:commentRss>http://blogs.amd.com/developer/2012/01/10/amd-opencl%e2%84%a2-app-sdk-2-6-introducing-opencl%e2%84%a2-1-2-preview-and-static-c-kernel-language/feed/</wfw:commentRss>
		<slash:comments>4</slash:comments>
		<feedburner:origLink>http://blogs.amd.com/developer/2012/01/10/amd-opencl%e2%84%a2-app-sdk-2-6-introducing-opencl%e2%84%a2-1-2-preview-and-static-c-kernel-language/</feedburner:origLink></item>
		<item>
		<title>New Heterogeneous Compute Tools from MulticoreWare Inc.</title>
		<link>http://feeds.amddevcentral.com/~r/AmdDeveloperBlogs/~3/uIkmrRSIjuA/</link>
		<comments>http://blogs.amd.com/developer/2011/12/13/new-heterogeneous-compute-tools-from-multicoreware-inc/#comments</comments>
		<pubDate>Tue, 13 Dec 2011 19:25:38 +0000</pubDate>
		<dc:creator>Sharon Troia</dc:creator>
				<category><![CDATA[Inside Dev Central]]></category>
		<category><![CDATA[Developer tools]]></category>
		<category><![CDATA[GPGPU]]></category>
		<category><![CDATA[heterogeneous computing]]></category>

		<guid isPermaLink="false">http://blogs.amd.com/developer/?p=2249</guid>
		<description><![CDATA[We are excited to announce the beta availability of three new heterogeneous compute tools from MulticoreWare Inc (MCW).  These tools provide capabilities such as global memory for accelerators, global task management and path analysis to help developers achieve maximum benefit &#8230; <a href="http://blogs.amd.com/developer/2011/12/13/new-heterogeneous-compute-tools-from-multicoreware-inc/">Continue reading</a>]]></description>
			<content:encoded><![CDATA[<p>We are excited to announce the beta availability of three new heterogeneous compute tools from MulticoreWare Inc (MCW).  These tools provide capabilities such as global memory for accelerators, global task management and path analysis to help developers achieve maximum benefit from their OpenCL™ investment.</p>
<p>As you may be well aware, here at AMD we are continuing on our commitment of providing rich and comprehensive set of tools to help developers extract the benefits of heterogeneous computing platforms. We firmly believe industry standards, such as OpenCL, help achieve the true freedom of choice in GPU compute platforms. In order to drive further adoption of these standards, it is critical that we enable our software ecosystem partners including the tool vendors such as MulticoreWare Inc. to make sure they fully support OpenCL. We all share a common set of goals – i.e. help developers extract the maximum performance and achieve overall productivity on heterogeneous compute platforms.</p>
<p>  Here is the short summary for each of MCW tool:</p>
<ol>
<li>Global Memory for Accelerators (GMAC) enables development of code for CPUs and GPUs in a single pointer space, without requiring the developer to be aware of, or manage, the different memory domains and allocations.</li>
<li>Task Manager™ ensures effective sequencing of operations across both CPUs and GPUs, with resulting optimization of memory accesses, vector operations and more.</li>
<li>Parallel Path Analyzer (PPA) provides an interactive drill-down and Gantt-chart view of the execution time of various OpenCL kernels within a specific code-base. This can be used to optimize kernels within the critical path to achieve better performance.</li>
</ol>
<p>These are beta tools so we need your help to make sure they stand up to real world use.  We are working on a reward system for developers who give valuable feedback.  <span style="text-decoration: underline">Details will be coming soon</span>! </p>
<p>How to participate:</p>
<ol>
<li>Register at <a href="http://www.multicorewareinc.com/">www.multicorewareinc.com</a></li>
<li>Download and use the tool(s)</li>
<li>Share your comments, ask questions, or report issues in the <a href="http://www.multicorewareinc.com/index.php?option=com_kunena&amp;view=entrypage&amp;defaultmenu=78&amp;Itemid=76">interactive user forum</a></li>
</ol>
<p>By the way, did I mention these tools are free for  use on AMD platforms , just like all of the AMD tools. Look forward to hearing from you.</p>
<p>-Milind Kukanur</p>
<p><strong><em>Milind Kukanur is the Product Manager for Software Tools at AMD.</em></strong><em> His postings are his own opinions and may not represent AMD’s positions, strategies or opinions. Links to third party sites, and references to third party trademarks, are provided for convenience and illustrative purposes only. Unless explicitly stated, AMD is not responsible for the contents of such links, and no third party endorsement of AMD or any of its products is implied.</em></p>
<img src="http://feeds.feedburner.com/~r/AmdDeveloperBlogs/~4/uIkmrRSIjuA" height="1" width="1"/>]]></content:encoded>
			<wfw:commentRss>http://blogs.amd.com/developer/2011/12/13/new-heterogeneous-compute-tools-from-multicoreware-inc/feed/</wfw:commentRss>
		<slash:comments>0</slash:comments>
		<feedburner:origLink>http://blogs.amd.com/developer/2011/12/13/new-heterogeneous-compute-tools-from-multicoreware-inc/</feedburner:origLink></item>
		<item>
		<title>AMD OpenCL™ APP SDK Preview</title>
		<link>http://feeds.amddevcentral.com/~r/AmdDeveloperBlogs/~3/q9HpmZCVLQg/</link>
		<comments>http://blogs.amd.com/developer/2011/11/28/amd-opencl%e2%84%a2-app-sdk-preview/#comments</comments>
		<pubDate>Tue, 29 Nov 2011 00:23:08 +0000</pubDate>
		<dc:creator>Mark Ireton</dc:creator>
				<category><![CDATA[AMD APP]]></category>
		<category><![CDATA[Inside Dev Central]]></category>
		<category><![CDATA[AMD Developer Inside Track]]></category>
		<category><![CDATA[APU]]></category>
		<category><![CDATA[Fusion]]></category>
		<category><![CDATA[GPGPU]]></category>
		<category><![CDATA[heterogeneous computing]]></category>
		<category><![CDATA[OpenCL]]></category>
		<category><![CDATA[Parallel Computing]]></category>
		<category><![CDATA[Parallel Programming]]></category>

		<guid isPermaLink="false">http://blogs.amd.com/developer/?p=2246</guid>
		<description><![CDATA[It’s that time again, AMD APP SDK 2.6 is almost upon us and I would like to take this opportunity to introduce some key changes as we continue driving OpenCL™ development and continually roll out new features and enhancements. You &#8230; <a href="http://blogs.amd.com/developer/2011/11/28/amd-opencl%e2%84%a2-app-sdk-preview/">Continue reading</a>]]></description>
			<content:encoded><![CDATA[<p>It’s that time again, AMD APP SDK 2.6 is almost upon us and I would like to take this opportunity to introduce some key changes as we continue driving OpenCL™ development and continually roll out new features and enhancements.</p>
<p>You will have seen the <a href="http://www.khronos.org/news/press/khronos-releases-opencl-1.2-specification">recent announcement</a> of the ratification of the OpenCL 1.2 standard, to which AMD was one of the key contributors.  Demonstrating our commitment to OpenCL, SDK 2.6 (available on December 13) will be previewing core features of OpenCL 1.2.  OpenCL continues to increase in popularity as I discussed in my <a href="http://blogs.amd.com/developer/2011/11/14/opencl%E2%84%A2-the-leading-choice-for-gpgpu-developers/">prior blog</a>, and just this week Altera has also announced <a href="http://www.altera.com/corporate/news_room/releases/2011/products/nr-opencl.html">support for OpenCL</a>.  During the first half of next year we will be rolling out a complete implementation of OpenCL 1.2 for both CPU and GPU and demonstrating the latest capabilities.</p>
<p>As you are probably already aware, many of the key features of OpenCL are incorporated into the OpenCL run-time, which since last summer has been delivered on Windows® with the monthly Catalyst driver releases as well as with the SDK.  As of the Catalyst 11.11 driver release this month, the OpenCL run-time for Linux is now also included in the Linux Catalyst drivers.  This heralds an important change in how, and how often, we can provide updates for OpenCL.  As we move into 2012, as well as providing two to three SDK releases a year, where we will provide updated samples, API’s documentation etc., we will also be upgrading our OpenCL solution on a more frequent basis through the regular monthly Catalyst driver updates.  Look for a new section in the Catalyst driver release notes specific to OpenCL capabilities.  Including the OpenCL run-time in the Catalyst drivers is a key component in ensuring that the millions of PC’s out there that include AMD GPU technology that can be leveraged by your OpenCL enabled application.</p>
<p>Of course there are many new enhancements that that will be available with SDK 2.6 and Catalyst 11.12.  Join me here again in mid-December to read more.</p>
<p><strong>Mark Ireton is the Product Manager for Compute Solutions at AMD.</strong> His postings are his own opinions and may not represent AMD’s positions, strategies or opinions. Links to third party sites, and references to third party trademarks, are provided for convenience and illustrative purposes only. Unless explicitly stated, AMD is not responsible for the contents of such links, and no third party endorsement of AMD or any of its products is implied.</p>
<img src="http://feeds.feedburner.com/~r/AmdDeveloperBlogs/~4/q9HpmZCVLQg" height="1" width="1"/>]]></content:encoded>
			<wfw:commentRss>http://blogs.amd.com/developer/2011/11/28/amd-opencl%e2%84%a2-app-sdk-preview/feed/</wfw:commentRss>
		<slash:comments>9</slash:comments>
		<feedburner:origLink>http://blogs.amd.com/developer/2011/11/28/amd-opencl%e2%84%a2-app-sdk-preview/</feedburner:origLink></item>
		<item>
		<title>OpenCL™: the leading choice for GPGPU developers</title>
		<link>http://feeds.amddevcentral.com/~r/AmdDeveloperBlogs/~3/gbAHTRku_58/</link>
		<comments>http://blogs.amd.com/developer/2011/11/14/opencl%e2%84%a2-the-leading-choice-for-gpgpu-developers/#comments</comments>
		<pubDate>Mon, 14 Nov 2011 20:28:13 +0000</pubDate>
		<dc:creator>Mark Ireton</dc:creator>
				<category><![CDATA[AMD APP]]></category>
		<category><![CDATA[Inside Dev Central]]></category>
		<category><![CDATA[AMD Developer Inside Track]]></category>
		<category><![CDATA[APU]]></category>
		<category><![CDATA[Fusion]]></category>
		<category><![CDATA[GPGPU]]></category>
		<category><![CDATA[heterogeneous computing]]></category>
		<category><![CDATA[OpenCL]]></category>
		<category><![CDATA[Parallel Computing]]></category>
		<category><![CDATA[Parallel Programming]]></category>

		<guid isPermaLink="false">http://blogs.amd.com/developer/?p=2240</guid>
		<description><![CDATA[It&#8217;s quite amazing how far OpenCL™ has come these last two years, further than many people realize. A recent survey by Evans Data Corporation shows the inevitable success of OpenCL. In this survey respondents ranked the most popular APIs for &#8230; <a href="http://blogs.amd.com/developer/2011/11/14/opencl%e2%84%a2-the-leading-choice-for-gpgpu-developers/">Continue reading</a>]]></description>
			<content:encoded><![CDATA[<p>It&#8217;s quite amazing how far OpenCL™ has come these last two years, further than many people realize. A recent survey by Evans Data Corporation shows the inevitable success of OpenCL. In this survey respondents ranked the most popular APIs for multi-threaded development.  OpenCL is ranked #2 in North America, #2 in Europe and Middle East, and #3 in Asia-Pacific countries. The highest ranking that CUDA achieves, in any region, is #5.  Let me say that again, OpenCL is more popular than CUDA in all the high-tech regions of the world &#8211; everywhere!</p>
<p>Let me reflect for a moment on the factors that I believe have led to this. OpenCL has always been an open standard driven by key semiconductor and software companies, but this in itself is not enough.  No, in my mind, the real change is that the promise of multi-vendor cross-platform support is now a tangible reality.  AMD, Intel, and Nvidia already have OpenCL SDKs publically available, with AMD&#8217;s SDK supporting both GPU and X86 CPU. Furthermore, “…ARM is committed to OpenCL…..With our hardware architecture, and (just as importantly) our software architecture, we are supporting all the GPU computing APIs that are needed by the industry to take best advantage of heterogeneous computing” according to <a href="http://blogs.arm.com/multimedia/515-arm-discusses-future-in-keynote-heterogeneous-computing-cpus-and-gpus">Jem Davies</a><sup>1</sup>.  Software vendors are seeing the opportunity &#8212; already over <a href="http://www.amd.com/us/press-releases/Pages/amd-fusion-apus-acc-mr-50-app-2011mar07.aspx">50 leading applications are accelerated by AMD Fusion APUs</a><sup>2</sup>.</p>
<p>The other key element, of course, is the increasing availability of amazing compute capability in mainstream products. A few weeks ago my wife bought a new laptop containing an AMD A6-3400M APU with AMD Radeon™ HD graphics.  This platform is currently listed by HP as “Starting from $599”<sup>3</sup>.  AMD A-Series APUs can deliver supercomputer-like performance of more than <a href="http://www.amd.com/us/press-releases/Pages/amd-a-series-desktop-2011june30.aspx">500 GFLOPS of raw compute performance</a><sup>4</sup> and data transfer rates between the CPU and the GPU can be as high as <a href="http://www.amd.com/us/press-releases/Pages/app-sdk-2011aug08.aspx">15GB per second</a><sup>5</sup>. But perhaps more importantly, with the exception of the desktop enthusiast space AMD’s client roadmap is entirely based on APUs.  AMD had already shipped over <a href="http://www.amd.com/us/press-releases/Pages/amd-boosts-fusion-apus-2011aug22.aspx">12 million APUs</a><sup>6 </sup>as of August of this year, and we&#8217;re still counting!</p>
<p>Here&#8217;s to a strong future for OpenCL!</p>
<p><sup>1</sup>Jem Davies, Fellow &amp; VP of Technology in the Media Processing Division of ARM</p>
<p><strong><a href="http://blogs.arm.com/multimedia/515-arm-discusses-future-in-keynote-heterogeneous-computing-cpus-and-gpus">http://blogs.arm.com/multimedia/515-arm-discusses-future-in-keynote-heterogeneous-computing-cpus-and-gpus</a></strong></p>
<p><sup>2</sup><a href="http://www.amd.com/us/press-releases/Pages/amd-fusion-apus-acc-mr-50-app-2011mar07.aspx">http://www.amd.com/us/press-releases/Pages/amd-fusion-apus-acc-mr-50-app-2011mar07.aspx</a></p>
<p><sup>3</sup><a href="http://www.shopping.hp.com/">http://www.shopping.hp.com</a>, October 21<sup>st</sup> 2011.</p>
<p><sup>4</sup><a href="http://www.amd.com/us/press-releases/Pages/amd-a-series-desktop-2011june30.aspx">http://www.amd.com/us/press-releases/Pages/amd-a-series-desktop-2011june30.aspx</a></p>
<p><sup>5</sup>A8-3800 with Radeon™ HD 6550D graphics, 8GB DDR3-1333.  <a href="http://www.amd.com/us/press-releases/Pages/app-sdk-2011aug08.aspx">http://www.amd.com/us/press-releases/Pages/app-sdk-2011aug08.aspx</a></p>
<p><sup>6</sup><a href="http://www.amd.com/us/press-releases/Pages/amd-boosts-fusion-apus-2011aug22.aspx">http://www.amd.com/us/press-releases/Pages/amd-boosts-fusion-apus-2011aug22.aspx</a></p>
<p><strong>Mark Ireton is the Product Manager for Compute Solutions at AMD.</strong> His postings are his own opinions and may not represent AMD’s positions, strategies or opinions. Links to third party sites, and references to third party trademarks, are provided for convenience and illustrative purposes only. Unless explicitly stated, AMD is not responsible for the contents of such links, and no third party endorsement of AMD or any of its products is implied.</p>
<img src="http://feeds.feedburner.com/~r/AmdDeveloperBlogs/~4/gbAHTRku_58" height="1" width="1"/>]]></content:encoded>
			<wfw:commentRss>http://blogs.amd.com/developer/2011/11/14/opencl%e2%84%a2-the-leading-choice-for-gpgpu-developers/feed/</wfw:commentRss>
		<slash:comments>3</slash:comments>
		<feedburner:origLink>http://blogs.amd.com/developer/2011/11/14/opencl%e2%84%a2-the-leading-choice-for-gpgpu-developers/</feedburner:origLink></item>
		<item>
		<title>CodeAnalyst Linux 3.1 Released !</title>
		<link>http://feeds.amddevcentral.com/~r/AmdDeveloperBlogs/~3/GMFi7EDA6AQ/</link>
		<comments>http://blogs.amd.com/developer/2011/11/08/codeanalyst-linux-3-1-released/#comments</comments>
		<pubDate>Tue, 08 Nov 2011 05:23:25 +0000</pubDate>
		<dc:creator>Gnanabaskaran</dc:creator>
				<category><![CDATA[Hard-Core Software Optimization]]></category>
		<category><![CDATA[CodeAnalyst]]></category>
		<category><![CDATA[OProfile]]></category>
		<category><![CDATA[Profiler]]></category>

		<guid isPermaLink="false">http://blogs.amd.com/developer/?p=2230</guid>
		<description><![CDATA[AMD CodeAnalyst software for Linux® 3.1 has been released and can be downloaded from the AMD CodeAnalyst for Linux website: http://developer.amd.com/cpu/codeanalyst/codeanalystlinux Last quarter, a lot of active users suggested new enhancements, posted queries and bugs. You are helping us to &#8230; <a href="http://blogs.amd.com/developer/2011/11/08/codeanalyst-linux-3-1-released/">Continue reading</a>]]></description>
			<content:encoded><![CDATA[<p>AMD CodeAnalyst software for Linux® 3.1 has been released and can be downloaded from the AMD CodeAnalyst for Linux website:</p>
<p><a href="http://developer.amd.com/cpu/codeanalyst/codeanalystlinux">http://developer.amd.com/cpu/codeanalyst/codeanalystlinux</a></p>
<p>Last quarter, a lot of active users suggested new enhancements, posted queries and bugs. You are helping us to make AMD CodeAnalyst software better. Thank you very much for your feedback.</p>
<p>In this AMD CodeAnalyst 3.1 release, we have enabled public support for Comal Platforms. Now the public users can profile their applications on “Trinity” processor-based “Comal” platforms. We have added support for the new Linux distribution Ubuntu-11.10.</p>
<p>In the previous release, we redesigned the data navigation tabs. In this release, we have made some enhancements and fixed quite a few bugs in the “Source” view tab, all designed to improve the user experience while analyzing the profile data.</p>
<p>We have also added support to profile the applications compiled by Open64 compilers suite. This can help users to make use of AMD CodeAnalyst to profile and analyze the performance characteristics of the applications compiled using Open64 compilers.</p>
<p>Apart from adding new features, we are also committed to making our product highly robust by fixing bugs. If you have any suggestion to improve the tool or if you experience any bug, please let us know, through our <a href="http://forums.amd.com/devforum/categories.cfm?catid=218">forums</a> or comments to this blog.</p>
<img src="http://feeds.feedburner.com/~r/AmdDeveloperBlogs/~4/GMFi7EDA6AQ" height="1" width="1"/>]]></content:encoded>
			<wfw:commentRss>http://blogs.amd.com/developer/2011/11/08/codeanalyst-linux-3-1-released/feed/</wfw:commentRss>
		<slash:comments>1</slash:comments>
		<feedburner:origLink>http://blogs.amd.com/developer/2011/11/08/codeanalyst-linux-3-1-released/</feedburner:origLink></item>
		<item>
		<title>A new NUMA option for Windows® on the Hotspot JVM</title>
		<link>http://feeds.amddevcentral.com/~r/AmdDeveloperBlogs/~3/1QGgAl1pQ7c/</link>
		<comments>http://blogs.amd.com/developer/2011/10/27/a-new-numa-option-for-windows%c2%ae-on-the-hotspot-jvm/#comments</comments>
		<pubDate>Thu, 27 Oct 2011 22:19:54 +0000</pubDate>
		<dc:creator>Tom Deneau</dc:creator>
				<category><![CDATA[AMD Java Labs]]></category>
		<category><![CDATA[NUMA]]></category>

		<guid isPermaLink="false">http://blogs.amd.com/developer/?p=2207</guid>
		<description><![CDATA[Here we discuss a new option that AMD contributed to the Hotspot JVM that helps improve performance on some applications on NUMA systems running Windows® OS. On some applications and heap configurations we saw up to a 2.5X performance improvement. &#8230; <a href="http://blogs.amd.com/developer/2011/10/27/a-new-numa-option-for-windows%c2%ae-on-the-hotspot-jvm/">Continue reading</a>]]></description>
			<content:encoded><![CDATA[<p>Here we discuss a new option that AMD contributed to the Hotspot JVM that helps improve performance on some applications on NUMA systems running Windows® OS. On some applications and heap configurations we saw up to a 2.5X performance improvement.</p>
<table style="width: 350px;" cellpadding="4" align="right">
<tbody>
<tr>
<td><em><a rel="attachment wp-att-2219" href="http://blogs.amd.com/developer/2011/10/27/a-new-numa-option-for-windows%c2%ae-on-the-hotspot-jvm/numa_blog_diagram/"><img style="border: 1px solid #cccccc;" title="numa_blog_diagram" src="http://blogs.amd.com/developer/files/2011/10/numa_blog_diagram.png" border="0" alt="" width="350" height="191" /></a></em></td>
</tr>
</tbody>
</table>
<h3>What is NUMA?</h3>
<p>Many multiprocessor systems today, including those designed with AMD processors, use a Non-Uniform Memory Access (NUMA) memory design. In a NUMA design, the total set of processor cores is divided into NUMA nodes. Each NUMA node and has its own local memory controller with its own local memory and the processors in a node can also access in a cache coherent manner the memory owned by the other NUMA nodes. All the memory appears in a single address space. Generally the time to access local node memory is less than that to access remote node memory, hence the name NUMA.</p>
<p><em><a rel="attachment wp-att-2219" href="http://blogs.amd.com/developer/2011/10/27/a-new-numa-option-for-windows%c2%ae-on-the-hotspot-jvm/numa_blog_diagram/"></a></em></p>
<p> Now consider an application that has enough software threads that it will use the cores from many (or all) NUMA nodes. The following general guidelines for memory allocation can help performance in a NUMA system:</p>
<ol>
<li>Memory regions that are only accessed by the cores in one node should be allocated from that node.</li>
<li>Memory regions that are accessed globally by cores on many nodes should be interleaved across all those nodes.</li>
</ol>
<p>The first guideline seems obvious since the access time to local memory is less. However in many NUMA designs there is not a large difference in remote vs. local access time so this can be less critical.</p>
<p>A more important performance consideration can be to avoid having all the cores competing for the memory on one node. The memory controller on the heavily accessed node can be a bottleneck. Both guidelines above guard against this condition.</p>
<h3>NUMA and the Java Heap</h3>
<p>All Java Objects are allocated from the heap and are subject to being moved during garbage collection. How is the Java Heap allocated by the JVM? The heap generally consists of contiguous virtual address space but how it is mapped to physical memory depends on the OS’s underlying allocation policy and whether the JVM does anything to override that policy.</p>
<p>The default allocation policies for Linux® and Windows OSs are quite different.</p>
<ul>
<li>Linux uses a “first touch” policy which means it will try to allocate on the node on which the core that first touched the page resides.</li>
<li>Windows uses a “process home node” policy. Each process is assigned a home node at random. The OS will try to allocate memory on that home node first, even if the core that is touching the memory is running on some other node.</li>
</ul>
<p>The thread allocation policy should also be noted here and is similar on both OSes:</p>
<ul>
<li>The first thread will be scheduled on the least busy node (the “home node” on Windows) and subsequent threads will also try to schedule on that same node until that node is overloaded and other nodes are idle.</li>
</ul>
<p>From the above, we see that on Linux the way the Java Heap gets allocated depends on the way the threads start up and touch memory, and on the total size of the heap.</p>
<ul>
<li>If a large number of threads (enough to be scheduled across several NUMA nodes) start up early and start touching memory, the heap will tend to be interleaved across the NUMA nodes from the multiple threads touching memory.</li>
<li>If a small number of threads (or by limit one thread) starts up and touch enough memory before the larger number of threads spin up, the heap will tend to be allocated on one node first, then overflow to the next node. If the total size of the heap is close to the total amount of physical memory, the heap will tend to be spread across nodes, otherwise, it will tend to be heavily weighted to the startup nodes. In addition, even if the heap is spread across nodes, very large ranges of virtual address space will be mapped to the same node, which can be very different from the recommendation that memory be interleaved across the nodes.</li>
</ul>
<p>On Windows, we will always get the second bullet above, because of the “process home node” allocation policy.</p>
<h3>Java Deployment Workarounds</h3>
<p>Are there things that can be done at Java deployment time to alleviate in particular the bottleneck of cores from many NUMA nodes trying to access memory on one NUMA node?</p>
<ul>
<li>We mentioned above that for a heap size that is closer to the total amount of physical memory, the heap must by definition be spread across the nodes, although the interleaving may be poor and large ranges of virtual address space can be mapped to the same node. Thus this by itself is not an ideal solution.</li>
<li>The allocation problems occur when the application has enough software threads that it will use the cores from many (or all) NUMA nodes. Sometimes it is possible to split an application into several processes, each one having a smaller number of threads such that it will tend to use the cores from one NUMA node and to pick a heap that is small enough to fit on one NUMA node. In those cases, the OS will tend to allocate the threads and memory on that one node. In a more extreme step, the OS can be effectively taken out of the picture by affinitizing the processes manually to NUMA nodes using “numactl” on Linux or “start /affinity” on windows. You may have noticed several benchmark submissions taking this strategy.</li>
</ul>
<p>However, in many cases it is not possible to split an application into several processes, so you do need to run many threads in one process and you want a heap size that happens to fit on one or two NUMA nodes. In those cases, you’d like the JVM to help you achieve the allocation guidelines.</p>
<h3>JVM Memory Allocation Overrides</h3>
<p>The Hotspot JVM added an option a few years ago called UseNUMA. Internally this option divides the heap into two types of chunks, numa_local and numa_global. A numa_local chunk is local to one node, while a numa_global chunk is interleaved across all nodes. These types, when used wisely by the garbage collector, can help meet the allocation guidelines:</p>
<ul>
<li>Memory that could be accessed by many nodes such as the areas that have survived GC collections will be numa_global or interleaved.</li>
<li>A Thread Local allocation buffer used for new allocations by a thread will be selected from a pool of chunks which are numa_local to that node.</li>
</ul>
<p>Note that this policy treats objects that survive collections as global (interleaved), whereas some of these may in fact still be accessed by only one node. Still, the UseNUMA policy can be a big improvement over just using the default OS allocation policy for the heap.</p>
<p>The internal design requirements for numa_local and numa_global required that chunks be able to be resized dynamically and possibly have their type changed from local to global as the heap grew over time. While this meshed well with the memory allocation APIs of Linux and Solaris, unfortunately, it was not possible to implement this full UseNUMA design on Windows. So the UseNUMA option was effectively a no-op on Windows.</p>
<h3>Extending a subset of UseNUMA to Windows</h3>
<p>Here at AMD we noticed that while it was not possible to implement the full UseNUMA functionality on Windows, one could implement the numa_global or interleaved functionality. In this strategy, all parts of the heap would be interleaved across the NUMA nodes. The Windows API call VirtualAllocExNuma can be used to allocate memory on a specific node to achieve this interleaving.</p>
<p>We noticed that implementing just this numa_global interleaving subset could still result in some very good performance improvements on those applications where you do need to run many heap-accessing threads in one java process especially when you have a heap size that happens to fit on say one or two NUMA nodes. We saw up to 2.5x improvement on some standard benchmarks under these conditions. (Note that an application that runs few enough threads that they would map to a single NUMA node should already be getting good memory usage from the default Windows policy and so would not benefit from this memory interleaving)..</p>
<p>We submitted this functionality to the OpenJDK under a new option called UseNUMAInterleaving. This option is available on Windows, Linux and Solaris, although in most cases on Linux and Solaris you would just use the full UseNUMA option. In addition, on Windows, for now the UseNUMA option will map to UseNUMAInterleaving, since this subset is the only NUMA functionality available there. (It is possible in the future that a full UseNUMA would be implemented for Windows).</p>
<p>This new option is available in the early access releases of JDK7 Update 2, starting with build 08. So if you’re running a lot of threads in a single JVM, give it a try on your Windows Java deployment. We are interested in hearing of your experiences with this new option.</p>
<img src="http://feeds.feedburner.com/~r/AmdDeveloperBlogs/~4/1QGgAl1pQ7c" height="1" width="1"/>]]></content:encoded>
			<wfw:commentRss>http://blogs.amd.com/developer/2011/10/27/a-new-numa-option-for-windows%c2%ae-on-the-hotspot-jvm/feed/</wfw:commentRss>
		<slash:comments>0</slash:comments>
		<feedburner:origLink>http://blogs.amd.com/developer/2011/10/27/a-new-numa-option-for-windows%c2%ae-on-the-hotspot-jvm/</feedburner:origLink></item>
		<item>
		<title>5 minutes work for up to 5% more Hadoop performance</title>
		<link>http://feeds.amddevcentral.com/~r/AmdDeveloperBlogs/~3/Md8xKIFebrU/</link>
		<comments>http://blogs.amd.com/developer/2011/10/20/5-minutes-work-for-up-to-5-more-hadoop-performance/#comments</comments>
		<pubDate>Thu, 20 Oct 2011 18:35:59 +0000</pubDate>
		<dc:creator>Eric Caspole</dc:creator>
				<category><![CDATA[AMD Java Labs]]></category>
		<category><![CDATA[Code Optimization]]></category>
		<category><![CDATA[Java performance]]></category>
		<category><![CDATA[Linux]]></category>

		<guid isPermaLink="false">http://blogs.amd.com/developer/?p=2124</guid>
		<description><![CDATA[Lately we have been working on studying and trying to improve the performance of Apache Hadoop. In the spring my coworker Shrinivas Joshi created a great Hadoop tuning reference poster: http://blogs.amd.com/developer/2011/07/12/apache-hadoop-performance-tuning-methodologies-and-best-practices-%E2%80%93-a-downloadable-reference-guide/ This week I made a pleasant discovery that would &#8230; <a href="http://blogs.amd.com/developer/2011/10/20/5-minutes-work-for-up-to-5-more-hadoop-performance/">Continue reading</a>]]></description>
			<content:encoded><![CDATA[<p>Lately we have been working on studying and trying to improve the performance of Apache Hadoop. In the spring my coworker Shrinivas Joshi created a great Hadoop tuning reference poster:</p>
<p><a title="poster" href="http://blogs.amd.com/developer/2011/07/12/apache-hadoop-performance-tuning-methodologies-and-best-practices-%E2%80%93-a-downloadable-reference-guide/" target="_blank">http://blogs.amd.com/developer/2011/07/12/apache-hadoop-performance-tuning-methodologies-and-best-practices-%E2%80%93-a-downloadable-reference-guide/</a></p>
<p>This week I made a pleasant discovery that would be easily overlooked by almost everyone, so I wanted to point it out for Hadoop users. One of my Hadoop test beds is a one machine pseudo-cluster with one AMD OpteronTM™ 6128 processor and 5 hard drives. This system happens to have Ubuntu 10.10, Hadoop 0.20.3 and JDK 6u27 installed. My test case here is 16GB terasort.</p>
<p>As you may know, lzo compression (see <a title="lzo" href="http://www.oberhumer.com/opensource/lzo/" target="_blank">http://www.oberhumer.com/opensource/lzo/</a> ) is commonly used in Hadoop for compressing the map output, for example, see <a title="lzo-hadoop" href="http://wiki.apache.org/hadoop/UsingLzoCompression" target="_blank">http://wiki.apache.org/hadoop/UsingLzoCompression</a>. Compressing the map output can help improve the performance by decreasing the amount of writes to the disks, which should help reduce I/O wait time depending on your disk configuration.</p>
<p>I happened to download the latest lzo source package v2.06 and built it with the default options with the default gcc 4.4.5 in Ubuntu 10.10. This is a small library and it gets compiled in under a minute with parallel compilation. As usual it gets installed in /usr/local/lib. It turned out that the terasort run is noticeably faster with v2.06 than the Ubuntu 10.10 default lzo which is v2.03.</p>
<p style="text-align: center">
<div id="attachment_2151" class="wp-caption aligncenter" style="width: 247px"><a rel="attachment wp-att-2151" href="http://blogs.amd.com/developer/2011/10/20/5-minutes-work-for-up-to-5-more-hadoop-performance/graph-pct/"><img class="size-large wp-image-2151 " src="http://blogs.amd.com/developer/files/2011/10/graph-pct-237x211.jpg" alt="" width="237" height="211" /></a><p class="wp-caption-text">Percent improvement with v2.06 over default v2.03</p></div>
<p>Nothing else was changed in the environment. The execution time got about 5% faster with the latest v2.06 compared to the default lzo library, with low run-to-run variance. I checked another testbed which is running Fedora 14, and it turns out Fedora 14 also has v2.03 by default. So I recommend checking the lzo version regardless of your linux distro.</p>
<p>Of course your workload is different from terasort, and the benefit of this change depends on the size of your map output. But overall that is a nice 5% speedup in this workload for a couple minutes’ work downloading, building, and installing a newer version of an open source library that you are probably already using.</p>
<div><em>Eric Caspole is a Software Engineer at AMD. His postings are his own opinions and may not represent AMD’s positions, strategies or opinions. Links to third party sites, and references to third party trademarks, are provided for convenience and illustrative purposes only. Unless explicitly stated, AMD is not responsible for the contents of such links, and no third party endorsement of AMD or any of its products is implied.</em></div>
<img src="http://feeds.feedburner.com/~r/AmdDeveloperBlogs/~4/Md8xKIFebrU" height="1" width="1"/>]]></content:encoded>
			<wfw:commentRss>http://blogs.amd.com/developer/2011/10/20/5-minutes-work-for-up-to-5-more-hadoop-performance/feed/</wfw:commentRss>
		<slash:comments>0</slash:comments>
		<feedburner:origLink>http://blogs.amd.com/developer/2011/10/20/5-minutes-work-for-up-to-5-more-hadoop-performance/</feedburner:origLink></item>
	</channel>
</rss>

