Co-Authored By:Mahmut Aktasoglu, Intel Software Solutions Group, [email protected] Shetty, Intel Software Solutions Group, [email protected] TQ8Minimum cost supplier Test Query ID4-socket Intel® Xeon® Processor E7-8890 v34-socket Intel® Xeon® Processor E7-8890 v4Performance Increase Test query 2 is the perfect value proposition scenario in our experiments. It is a query with a relatively simpler query plan, and spends most of query execution time in AVX2-optimized scan operator. This query calculates an aggregate value over a data set from a given year with some additional predicates (i.e., processing roughly 12 billion rows). Hence performance difference in query run times can be directly attributed to improvements from vectorizing the algorithms. As query plans gets more complicated, measurable benefits from utilizing instruction level parallelism will also reduce (following Amdahl’s law), but still impactful: A reduction of up to 25% in query run times on average with AVX2. We encourage readers of this blog to explore the possibilities with Intel SIMD instruction set extension for their use case by consulting to Intel® 64 and IA-32 Architectures Software Developer ManualsOpens in a new window and easy-to-use Intel® Intrinsics GuideOpens in a new window web browser tool. TQ12Small quantity order revenue TQ10670.59510.611.31 TQ4Promotion effect TQ5Top supplier TQ12593.99495.981.19 TQ9593.99451.821.31 TQ1161.23137.791.17 TQ8142.49122.301.16 TQ25.314.971.06 TQ11Shipping modes and order priority TQ1149.70137.791.08 TQ2Revenue change forecast TQ6105.0274.321.41 TQ779.2351.751.53 Test Query IDNotes TQ11113.7198.721.15 TQ5492.94401.171.22 TQ7Returned item report DisclaimerIntel, the Intel logo, Intel Xeon, Intel SSDs, and Intel Raid Controllers are trademarks of Intel Corporation in the U.S. and/or other countries.*Other names and brands may be claimed as the property of othersSoftware and workloads used in performance tests may have been optimized for performance only on Intel microprocessors. Performance tests, such as SYSmark and MobileMark, are measured using specific computer systems, components, software, operations and functions. Any change to any of those factors may cause the results to vary. You should consult other information and performance tests to assist you in fully evaluating your contemplated purchases, including the performance of that product when combined with other products.For more information, visit www.intel.com/benchmarks. With an increasing amount of data being stored in modern data warehouses, organizations need solutions that scale proportionally. Although building clusters of servers and distributing data across the systems is one of the more common approaches to scale, it is not desirable from a performance or TCO point of view.Intel and Microsoft collaborate to ensure our data platform solutions delivers cutting edge performance and the best ROI. In a recent Accelerating Analytics WhitepaperOpens in a new window, we showcased the capabilities of a data platform, powered by Intel® Xeon® processor E7 v4 family, Intel® Data Center SSDs and Intel® RAID Controllers; coupled with Microsoft® SQL Server 2016 and Windows Server 2016. We loaded a 100TB dataset following TPC-HOpens in a new window specifications, which models a data warehouse with approximately 600 billion records in its largest table. Using SQL Server’s columnar database engine, the records were loaded into a clustered columnstore index at an impressive rate of 1.6TB/h. The resulting columnar-store database took about 35TB of total disk space.In this blog, we will show our new results from the same workload after upgrading our processors to the latest Intel® Xeon® processor E7-8890 v4. We kept the rest of the configuration unchanged, demonstrating how SQL Server 2016 performance scales by simply upgrading your processors. Spoiler Alert:Up to 36% shorter query run times on average on a 100TB database with 4-socket Intel® Xeon® processor E7-8890 v4 processor2Up to 2.3x performance improvement in the SQL Server 2016 columnstore engine by utilizing Intel® AVX2 instruction set.4Test QueriesWe used the same 7 queries from previous experimentsOpens in a new window, plus 5 additional ad-hoc queries derived from TPC-H specs which involves aggregations over large number of records, to highlight AVX2 performance gains. You can find brief descriptions of the queries in the table below. Average Performance Increase1.36x TQ1Pricing summary report Test Query IDScalarAVX2Performance Increase TQ6Product type profit measure TQ4914.44412.932.21 TQ211.524.972.31 The new Intel® Xeon® E7 v4 performanceThe Intel Xeon processor E7 v4 family is the newest member of data center processors, designed for your performance needs in your data platform. Compared to its predecessor, Intel Xeon E7 v4 family can have up to 48 logical cores (33% increase vs. Xeon E7-8890 v3) in terms of the maximum degree of parallelism that can be attained.In the graph below, we show performance of the same queries executing on exactly the same configuration, the only difference being the processors. Before each query run, we clear all SQL Server caches. Then, each query is run twice to warm up SQL Server column-store object pool cache. We plot the query run time of the second run. With the new Intel Xeon E7-8890 v4 processor, we see performance gains across the board. For the same 7 queries we used in our previous workOpens in a new window, we observed up to 36% performance gains on average compared to Xeon E7-8890 v3 for complex ad-hoc queries against a 100TB dataset. As simple as a processor upgrade, customers can unlock the potential of SQL Server 2016 by coupling it with Intel’s latest data platform CPUs. TQ3Important supplier stock identification TQ33761.522987.221.25 TQ9Shipping priority Average Performance Increase1.31x  We have installed a newer build of Microsoft Windows Server 2016. 36% more performance average on 7 ad-hoc data warehousing queries, comparing Intel® Xeon® Processor E7-8890 v4 to Intel® Xeon® Processor E7-8890 v3 on a 4 socket server with 6TB DDR4 memory, 100TB storage for data (64 Intel® SSDs DC3500 series) running Microsoft SQL Server 2016 RTM on Windows Server 2016. SQL Server 2016 supports Intel’s SSE4.2, AVX and AVX2 instruction set extensions. Up to 2.3x performance improvement and 25% average reduction in query run times on 7 ad-hoc data warehousing queries, comparing customized SQL Server 2016 RTM with AVX2 instruction set support to same engine without AVX2 support, on a 4 socket server with Intel® Xeon® Processor E7-8890 v4 processors, 6TB DDR4 memory, 100TB storage for data (64 Intel® SSDs DC3500 series) running on Windows Server 2016. Reduction in query run time calculated as: ( (1 – (AVX2 run time/ scalar run time) ) x 100). Average reduction in run time is geometric mean of reduction in query run times for the 7 queries. TQ10Local supplier volume Microsoft SQL SERVER 2016 optimized for Intel AVX2Our next experiment demonstrates yet another value of Intel® Xeon® processor family technology. With its latest release, SQL Server supports Intel® SIMD instruction sets in columnstore batch-mode operations. What does this mean? SQL Server 2016 has different versions of the same algorithms using vector instructions. When you upgrade your database software to SQL Server 2016, it will detect all vector instruction set extensions available on the CPU and choose the algorithms that provides the best performance.To show the performance benefits of using vector instructions in SQL Server’s columnstore engine, we designed an experiment. In this experiment, we run same queries, on the same platform with the only difference being the instruction set being used by SQL Server. We have implemented SQL Server 2016 prototype that can override selection of algorithms via a trace flag. By using this trace flag, we turn off AVX2 support for the first set of runs (i.e., scalar). In the second run, we let SQL Server 2016 choose the best version of algorithms, which uses AVX2 instructions in our case.Similar to our previous experiment, we clear SQL Server caches in-between each query run, and then run the queries twice. We use the run time from the second execution of the query. We plot the results in the chart below. In this chart, you can see the performance boost of up to 2.3x increase by utilizing Intel® AVX2 instruction set4.