While the previous section described optimizations applicable to all projects, this section details optimizations that should not be applied prior to gathering profiling data. This may be because the optimizations are labor-intensive to implement, may compromise code cleanliness or maintainability in favor of performance, or may resolve problems that only appear at certain magnitudes of scale.
As described in this StackOverflow article, it is generally more efficient to iterate over jagged arrays than over multidimensional arrays, as multidimensional arrays require a function call.
NOTES:
These are arrays of arrays, and are declared as type[x][y]
instead of type[x,y
].)
This can be discovered by inspecting the IL generated by accessing a multidimensional array, using ILSpy or similar tools.)
When profiled in Unity 5.3, 100 fully sequential iterations over a three-dimensional 100x100x100 array yielded the following timings, which were averaged over 10 runs of the test:
Array type | Total time (100 iterations) |
---|---|
One-Dimensional Array | 660 ms |
Jagged Arrays | 730 ms |
Multidimensional Array | 3470 ms |
The cost of the additional function call can be seen in the disparity between the cost of accessing multidimensional arrays vs. one-dimensional arrays, and the cost of iterating over a non-compact memory structure can be seen in the difference between accessing jagged arrays vs. one-dimensional arrays.
As demonstrated above, the cost of the additional function call heavily outweighs the cost imposed by using a non-compact memory structure.
For highly performance-sensitive operations, it is recommended to use a one-dimensional array. For all other cases where an array with multiple dimensions is required, use a jagged array. Multidimensional arrays should not be used.
When pooling Particle Systems, be aware that they consume at least 3500 bytes of memory. Memory consumption increases based on the number of modules activated on the Particle System. This memory is not released when Particle Systems are deactivated; It is only released when they are destroyed.
As of Unity 5.3, most Particle System settings can now be manipulated at runtime. For projects that must pool a large number of different particle effects, it may be more efficient to extract the configuration parameters of the Particle Systems out onto a data-carrier class or structure.
When a particle effect is needed, a pool of “generic” particle effects can then supply the requisite particle effect object. The configuration data can then be applied to the object to achieve the desired graphical effect.
This is substantially more memory-efficient than attempting to pool all possible variants & configurations of Particle Systems used in a given Scene, but requires substantial engineering effort to achieve.
Internally, Unity tracks lists of objects interested in its callbacks, such as Update
, FixedUpdate
and LateUpdate
. These are maintained as intrusively-linked lists to ensure that list updates occur in constant time. MonoBehaviours are added to/removed from these lists when they are Enabled or Disabled, respectively.
While it is convenient to simply add the appropriate callbacks to the MonoBehaviours that require them, this becomes increasingly inefficient as the number of callbacks grows. There is a small but significant overhead to invoking managed-code callbacks from native code due to trampolining. This results both in degraded frame times when invoking large numbers of per-frame methods, and in degraded instantiation times when instantiating Prefabs that contain large numbers of MonoBehaviours (NOTE: The instantiation cost is due to the trampoline overhead of invoking Awake and OnEnable callbacks on each Component in a prefab.).
When the number of MonoBehaviours with per-frame callbacks grows into the hundreds or thousands, it is advantageous to remove these callbacks and instead have MonoBehaviours (or even standard C# objects) attach to a global manager singleton. The global manager singleton can then distribute Update
, LateUpdate
and other callbacks to interested objects. This has the additional benefit of allowing code to smartly unsubscribe from callbacks when they would otherwise no-op, thereby shrinking the sheer number of functions that must be called each frame.
The greatest saving is usually realized by eliminating callbacks which rarely execute. Consider the following pseudo-code:
void Update() {
if(!someVeryRareCondition) { return; }
// … some operation …
}
If there are large numbers of MonoBehaviours with Update callbacks similar to the above, then a significant amount of the time spent running Update callbacks are spent trampolining into MonoBehaviours that then exit immediately. If these classes instead subscribed to a global Update Manager only while someVeryRareCondition
were true, and unsubscribed thereafter, time would be saved both on trampolines and on the evaluation of the rare condition.
It is tempting to use plain C# delegates to implement these callbacks. However, C#’s delegate implementation is optimized for a low rate of subscription and unsubscription, and for a low number of callbacks. A C# delegate performs a full deep-copy of the callback list each time a callback is added or removed. Large lists of callbacks, or large numbers of callbacks subscribing/unsubscribing during a single frame results in a performance spike in the internal Delegate.Combine
method.
For cases where adds/removes occur at high frequencies, consider using a data structure designed for fast inserts/removes instead of delegates.
Unity permits developers to control the priority of background threads that are being used to load data. This is particularly important when trying to stream AssetBundles onto disk in the background.
The priority for the main thread and graphics thread are both ThreadPriority.Normal
– any threads with higher priority preempt the main/graphics threads and cause framerate hiccups, whereas threads with lower priority do not. If threads have an equivalent priority to the main thread, the CPU attempts to give equal time to the threads, which generally results in framerate stuttering if multiple background threads are performing heavy operations, such as AssetBundle decompression.
Currently, this priority can be controlled in three places.
First, the default priority for Asset loading calls, such as Resources.LoadAsync
and AssetBundle.LoadAssetAsync
, is taken from the Application.backgroundLoadingPriority setting. As documented, this call also limits the amount of time that the main thread spends integrating Assets (NOTE:
Most types of Unity Assets must be “integrated” onto the Main thread. During integration, the Asset initialization is finalized and certain thread-safe operations are performed. This includes scripting callback invocations, such as Awake callbacks. See the “Resource Management” guide for further details.), in order to limit the impact of Asset loading on frame time.
Second, each asynchronous Asset loading operation, as well as each UnityWebRequest request, returns an AsyncOperation
object to monitor and manage the operation. This AsyncOperation
object exposes a priority property that can be used to tweak an individual operation’s priority.
Finally, WWW objects, such as those returned from a call to WWW.LoadFromCacheOrDownload
, expose a threadPriority property. It is important to note that WWW objects do not automatically use the Application.backgroundLoadingPriority
setting as their default value – WWW objects always default to ThreadPriority.Normal
.
It’s important to note that the under-the-hood systems used to decompress and load data differ between these APIs. Resources.LoadAsync
and AssetBundle.LoadAssetAsync
are operated by Unity’s internal PreloadManager system, which governs its own loading thread(s) and performs its own rate-limiting. UnityWebRequest
uses its own dedicated thread pool. WWW
spawns an entirely new thread each time a request is created.
While all other loading mechanisms have a built-in queuing system, WWW does not. Calling WWW.LoadFromCacheOrDownload
on a very large number of compressed AssetBundles spawns an equivalent number of threads, which then compete with the main thread for CPU time. This can easily result in frame-rate stuttering.
Therefore, when using WWW to load and decompress AssetBundles, it is considered best practice to set an appropriate value for the threadPriority
of each WWW object that is created.
As mentioned in the section on Transform Manipulation, moving large Transform hierarchies has a relatively high CPU cost due to the propagation of change messages. However, in real development environments, it is often impossible to collapse a hierarchy to a modest number of GameObjects.
At the same time, it is good development practice to only run enough behavior to maintain the believability of the game world while eliminating behavior the user will not notice – for example, in a Scene with a large number of characters, it is always more optimal to only run Mesh-skinning and animation-driven Transform movement for characters that are on-screen. There is no reason to waste CPU time calculating purely visual elements of the simulation for characters that are off-screen.
Both of these problems can be neatly addressed with an API first introduced in Unity 5.1: CullingGroups.
Instead of directly manipulating a large group of GameObjects in the scene, change the system to manipulate the Vector3 parameters of a group of BoundingSpheres within a CullingGroup. Each BoundingSphere serves as the authoritative repository for a single game-logical entity’s world-space position, and receives callbacks when the entity moves near/within the frustum of the CullingGroup’s main camera. These callbacks can then be used to activate/deactivate code or components (such as Animators) governing behavior that should only run while the entity is visible.
C#’s string library provides an excellent case study in the cost of adding additional method calls to simple library code. In the section on the built-in string APIs String.StartsWith
and String.EndsWith
, it was mentioned that hand-coded replacements are 10–100 times faster than the built-in methods, even when unwanted locale coercion was suppressed.
The key reason for this performance difference is simply the cost of adding additional method calls to tight inner loops. Each method that is invoked must locate the address of the method in memory and push another frame onto the stack. Neither of these operations are free, but in most code they are sufficiently small to ignore.
However, when running small methods in tight loops, the overhead added by introducing additional method calls can become significant – and even dominant.
Consider the following two simple methods.
Example 1:
int Accum { get; set; }
Accum = 0;
for(int i = 0; i < myList.Count; i++) {
Accum += myList[i];
}
Example 2:
int accum = 0;
int len = myList.Count;
for(int i = 0; i < len; i++) {
accum += myList[i];
}
Both methods calculate the sum of all integers in a C# generic List<int>
. The first example is a bit more “modern C#” in that it uses an automatically generated property to hold its data values.
While on the surface these two pieces of code appear equivalent, the difference is notable when the code is analyzed for method calls.
Example 1:
int Accum { get; set; }
Accum = 0;
for(int i = 0;
i < myList.Count; // call to List::getCount
i++) {
Accum // call to set_Accum
+= // call to get_Accum
myList[i]; // call to List::get_Value
}
So there are four method calls each time the loop executes:
myList.Count
invokes the get
method on the Count
propertyget
and set
methods on the Accum
property must be calledget
to retrieve the current value of Accum
so that it can be passed to the addition operationset
to assign the result of the addition operation to Accum
Example 2:
int accum = 0;
int len = myList.Count;
for(int i = 0;
i < len;
i++) {
accum += myList[i]; // call to List::get_Value
}
In this second example, the call to get_Value
remains, but all other methods have either been eliminated or no longer execute once per loop iteration.
As accum
is now a primitive value instead of a property, method calls do not need to be made to set or retrieve its value.
As myList.Count
is assumed to not vary while the loop is running, its access has been moved outside of the loop’s conditional statement, so it is no longer executed at the beginning of each loop iteration.
The timings for the two versions reveal the true benefit of removing 75% of the method call overhead from this specific snippet of code. When run 100,000 times on a modern desktop machine:
The primary issue here is that Unity performs very little method inlining, if any. Even under IL2CPP, many methods do not currently inline properly. This is especially true of properties. Further, virtual and interface methods cannot be inlined at all.
Therefore, a method call declared in the source C# is very likely to end up producing a method call in the final binary application.
Unity provides many “simple” constants on its data types for the convenience of developers. However, in light of the above, it is important to note that these constants are generally implemented as properties that return constant values.
Vector3.zero’s property body is as follows:
get { return new Vector3(0,0,0); }
Quaternion.identity is very similar:
get { return new Quaternion(0,0,0,1); }
While the cost of accessing these properties is usually tiny compared to the actual code surrounding them, they can make a small difference when they are executed thousands of times per frame (or more).
For simple primitive types, use a const
value instead. Const
values are inlined at compile time – the reference to the const
variable is replaced with its value.
Note: Because every reference to a const
variable is replaced with its value, it is inadvisable to declare long strings or other large data types const
. This unnecessarily bloats the size of the final binary due to all the duplicated data in the final instruction code.
Wherever const
isn’t appropriate, make a static readonly
variable instead. In some projects, even Unity’s built-in trivial properties have been replaced with static readonly
variables, resulting in small improvements in performance.
Trivial methods are trickier. It is extremely useful to be able to declare functionality once and reuse it elsewhere. However, in tight inner loops, it may be necessary to depart from good coding practices and instead “manually inline” certain code.
Some methods can be eliminated outright. Consider Quaternion.Set
, Transform.Translate
or Vector3.Scale
. These perform very trivial operations and can be replaced with simple assignment statements.
For more complex methods, weigh the profiling evidence for manual inlining against the long-term cost of maintaining the more-performant code.