by Hannes Ricklefs
1 August 2017
Cloudy with a chance of Rendering
Siggraph Talk 2017
by Daniel Bergel, Craig Dibble, Pauline Koh, James Pearson, Hannes Ricklefs
Disney’s The Jungle Book required MPC to deliver work of an unprecedented visual complexity and quality. To enable Disney to fully realise their creative vision, MPC wanted to ensure it had burst compute capacity available through flexible and scalable cloud based resources.
The major technical challenge was to provide this burst capacity whilst meeting the strict security requirements of our client, some- thing which had not previously been achieved for a production of this scale or sensitivity. The project needed dedicated resources across Technology, Operations and Production to holistically capture and address everyone’s requirements and process constraints.
Across all these domains the project was considered a huge success. This talk presents the key challenges faced including a technical overview of the architecture, the essential management tools, and the interaction with production from identifying appropriate job types to effective utilisation of these virtual resources.
Even before this project MPC and Technicolor (MPC’s parent company) had partnered with Google and their Cloud Platform Team (GCP) to discuss the use of cloud resources for large scale VFX rendering. In parallel Google was also engaging with Disney to seek security approval. Therefore MPC decided to utilise GCP as the vendor to provide cloud based compute capacity.
The goal for this project was to provide a minimum of 10,000 additional cores to the existing local render resources. To ensure these cores were provided in alignment with Disney’s strict security standards, Technicolor and Disney engaged Independent Security Evaluators (ISE) to perform design reviews and security audits of the proposed architecture. The initial phase of the project consisted of providing ISE with detailed documentation and diagrams of the system architecture, including system configuration, iteratively addressing any identified short comings between the design and implementation and any security vulnerabilities.
Changes to the design continued throughout the project, predominately driven by the availability of appliances, such as Avere, and knowledge gained from more production testing. All subsequent changes to the initial design and implementation were resubmitted to ISE for approval.
The final design provided burst capacity for MPC’s London site through a remote production zone. Working with Google their Belgium data center was identified as the preferred location with the lowest latency (min 9ms) compared to any of their other locations. The remote production zone was limited to providing render resources without Internet connectivity or storage of film content.
During the initial phases of the project all services such as storage, software and configuration were sourced from the London site. MPC uses a central NFS based software server, which given the 9ms latency caused various third-party and custom software to fail. In particular python based applications and libraries suffered due to the amount of plugins required to load when starting up the MPC production runtime environment. This was quickly mitigated through a dedicated software server within the remote production zone.
To achieve the required network bandwidth MPC and Tech- nicolor partnered with SohoNet to build dedicated 10Gb/s VPN connectivity between MPCfis internal production zone and GCP remote production zone.
One of the main security requirements was to disable all Internet connectivity from the remote and internal production zones. This became one of the major technical challenges due to the inherent nature of the Google API and Avere vFXT appliance requiring Internet access to talk to the underlying Google infrastructure. MPC engineered a Tiered Proxy solution that enabled the bootstrapping and API requirements for the GCP resources and Avere appliance whilst meeting the security guidelines and commitments.
There was great willingness from all parties to make this project a success to the extent that Avere sent engineers on site to create a custom build of their appliance to meet the security requirements. This enabled MPC to optimise access to local production storage via the Avere vFXT appliance. This was a crucial component to remove latency issues that were caused due to the 9ms latency between the remote production zone and the local MPC site.
Compute Node Types
GCP compute nodes can be provisioned under a variety of machine types and availability models. MPC chose a mixture of nodes: Pre-emptible (PVM), potential to be interrupted at any time with a max runtime of 24h but lower price point and Permanent (PERM) instances with guaranteed and unlimited uptime but at a higher price point.
The initial machine configuration chosen was 8 core x 30GB PVM instances and 16 core x 60GB PERM instances. Fairly quickly the initial machine profile for PVMs was upgraded to 16 core x 104GB, these high memory profiles enabled thousands of tasks to be completed in a very short time frame. The final phase utilised a balancing ratio of 40-60% PVM for the compute instances. The final Cloud setup contained 4 different node types:
- Render nodes: Dedicated to run the Render tasks
- File server: Dedicated to provide space for temporary, scratch content and dedicated software server
- VPN Endpoints: Build in collaboration with SohoNet to provide managed VPN, according to network throughput requirements
- Avere Virtual Appliance: Custom MPC Avere build to create optimal performance and ensure that no production content is stored at rest in the Cloud.
Job Identification and Steering
The job types chosen for execution on GCP had to follow the criteria of limited IO and ideally compute heavy with limited output, to keep the cost of egress to a minimum. After various tests the decision was made to only run Katana-Pixar Renderman tasks on GCP as they matched these criteria the closest in comparison to other task types such as simulations, compositing or automation through asset management [Butler et al. 2008].
For The Jungle Book MPC pushed for a lot of the pre Lighting department’s visual quality control (QC) steps to be based on Pixar’s Renderman renders [Romeo et al. 2015], rather than standard play- blasts or OpenGL based representations of the shot. The benefit of catching any potential issues due to renderer differences early and removing the back and forth between departments outweighs the increased time in generating these QC images. GCP was predominantly used during the final phases of The Jungle Book to enable production to push through final deliveries using local ren- der resources while continuing to utilise this newly established QC process through burst capacity.
Rough guidelines were put in place for job selection. During production it was recommended to choose the lower-third priority tasks since these had minimal chance of interruption delays im- pacting internal deliveries. Ideally jobs should have short render times to minimise loss of investment if network or pre-emption interruption occurs. In combination with preferably high memory requirements to free up valuable high memory local capacity. In general QC render tasks fit this profile extremely well.
In practice, this worked out to be: lower priority (due in 1week+) Katana-Pixar Renderman tasks, previews or 1k monos, with rendertimes between 1-5 hours.
The majority of the custom tooling developed for this project was built around identification and tagging of jobs to be run on GCP resources. For example, by default the job tagging script would fetch priority flagged jobs for the specified show and discipline (light) which were ready to go but unlikely to run locally that night (lower 1/3 of jobs). Additional filtering included scene, shot, time estimate, and title keyword matching. Scene and shot filters enabled quick identification of shots with similar assets to ensure best reuse of files stored within the Avere cache to reduce the load on the local storage and reduce latency.
The other set of custom tools addressed some of the automatic provisioning needs of failed or downed PVMs. Each day limits for how many PVMs and PERM resources required were defined by production. To ensure this limit was kept cron based scripts would count and re-provision any terminated PVMs to ensure the production backlog of tasks would be continuously being worked on within the defined limits.
Conclusion and Future Work
Overall the project was considered a great success across the Technology, Operations and Production teams. At peak MPC was able to extend its compute capacity by over 14k cores.
Some of the biggest benefits of using cloud resources was the agility in provisioning different machine profiles as well as cost models in short spaces of time; minutes rather than days or weeks as standard for leased or purchased equipment. The backlog of task types for The Jungle Book shifted daily, monitoring these changes on a daily basis and constant direct feedback with production on priorities enabled the team to make daily decisions of the mix of PVM and PERM resources while accurately forecasting cost and predicted delivery of these tasks. Additionally testing different ma- chine profiles enabled much quicker turn around times for updating local compute resources with optimised configurations.
A lot of experience was gained throughout this process. MPC is continuing to push forward in the area of cloud computing as there are substantial benefits to be gained. Particular areas that will require further development are around testing and performance, appliances such as Avere were only available late in the project. Having this appliance has the potential for a much wider selection of task types to be run on GCP. Reporting and metrics to provide more granular reports and visibility of resource spend and cost. In addition to automation and management around the provisioning of cloud resources and dynamic scaling of the task backlog of the local dispatcher.
Greg Butler, Anders Langlands, and Hannes Ricklefs. 2008. A Pipeline for 800+ Shots. In ACM SIGGRAPH 2008 Talks (SIGGRAPH ’08). ACM, New York, NY, USA, Article 72, 1 pages. DOI:http://dx.doi.org/10.1145/1401032.1401125
Marco Romeo, Jared Auty, and Damien Fagnou. 2015. Intelligent Rendering of Dailies: Automation, Layering and Reuse of Rendered Assets. In Proceedings of the 12th European Conference on Visual Media Production (CVMP ’15). ACM, New York, NY, USA, Article 15, 2 pages. DOI:http://dx.doi.org/10.1145/2824840.2824858