BLOG

Optimizing database spend with scheduled cluster scaling

By Erich KubaTue Mar 14 2023

Managing spend has become a high-priority focal area for almost all businesses within the current economic climate. One of the lesser-known features of the MongoDB Atlas platform is that you can scale your database infrastructure up and down through code by using the MongoDB Atlas Administration API.

In this article, we'll take a look at how we can leverage that capability to optimize our database spend.

We've open-sourced our MongoDB Atlas Scheduled Scaling Agent and will discuss how you can quickly and easily implement it to manage your clusters, but before we get into the technical how-to of the article, let's look at two financial strategies that the solution presents.

1. Reducing spend

Many database workloads, especially those where human interaction (such as website or app engagement) drives load, have strongly cyclical load patterns. In these scenarios, there may be an opportunity to automate scaling a database cluster down during off-peak times and to bank the savings (~30% of spend).

An example of this strategy would be a business that implements an M40 cluster to meet its peak demand, which can be scaled down to an M30 during off-peak periods (due to reduced load).

2. Optimizing spend

Equally, given the same access patterns, a business could choose not to bank the savings but rather optimize their spend by scaling down a level during off-peak times when demand is low and repurposing those cost savings to finance scaling up an additional level during high-demand periods. Implementing this strategy delivers twice the database processing capacity during high-demand periods for about the same spend.

An example of this strategy would be a business that implements an M40 cluster (~$245 per week) and is comfortable with its overall spend. However, it identifies that the M40 instance is excessive during off-peak times yet marginal during high-load periods. In this case, scaling down to an M30 during off-peak periods (12 hours a day on weekdays and 24 hours a day on weekends) and utilizing those savings to finance a more powerful M50 during high-demand periods (for a new overall spend of ~$249 per week) effectively repurposes the excess off-peak capacity to create additional headroom for those peak/high-load periods with almost no extra cost.

Another option, when the working sets are large, is to scale down from the M-series instances to R-series instances (i.e. M40 to R40), which deliver an equivalent amount of RAM per instance class, but with half the CPU capacity.

Understanding your specific workload requirements and matching your scaling strategy to meet those requirements is essential and only something you can do. That said, the Scheduled Scaling Agent will execute your strategy quickly and easily without too much fuss.

Implementation

The code below shows the main body of the MongoDB Atlas Scheduled Scaling Agent. It's a simple application that performs scaling actions as defined by you, the developer, by using an internal cron and is easily deployed on a low-spec virtual machine or, better yet, within a container.

// Atlas Cluster Parameters
const { ATLAS_API_PRIVATE_KEY, ATLAS_API_PUBLIC_KEY } = process.env;
const ATLAS_PROVIDER: MongoDBAtlasProvider = 'AWS';
const ATLAS_PROJECT_ID = 'Your Atlas Project ID';
const ATLAS_CLUSTER_NAME = 'Your Cluster Name';

// Scale up scheduling & configuration
const SCALE_UP_DAYS: CrontabDays = ['Mon', 'Tue', 'Wed', 'Thu', 'Fri'];
const SCALE_UP_HOUR: string = '6';
const SCALE_UP_MINUTE: string = '0';
const SCALE_UP_INSTANCE_SIZE: MongoDBAtlasInstanceSize = 'M40';

// Scale down scheduling & configuration
const SCALE_DOWN_DAYS: CrontabDays = ['Mon', 'Tue', 'Wed', 'Thu', 'Fri'];
const SCALE_DOWN_HOUR = '18';
const SCALE_DOWN_MINUTE: string = '0';
const SCALE_DOWN_INSTANCE_SIZE: MongoDBAtlasInstanceSize = 'M30';

const TIMEZONE = 'Pacific/Auckland';

const scaleUpCronExpression = `${SCALE_UP_MINUTE} ${SCALE_UP_HOUR} * * ${SCALE_UP_DAYS.join(',')}`;
const scaleDownCronExpression = `${SCALE_DOWN_MINUTE} ${SCALE_DOWN_HOUR} * * ${SCALE_DOWN_DAYS.join(',')}`;

ValidateApiKeys(ATLAS_API_PRIVATE_KEY, ATLAS_API_PUBLIC_KEY);
ValidateCronExpression(scaleUpCronExpression, 'scale-up');
ValidateCronExpression(scaleDownCronExpression, 'scale-down');

const scaleUpTask = CronManager.RegisterScaleUpClusterCronjob(scaleUpCronExpression, TIMEZONE, {
  apikey: { private: ATLAS_API_PRIVATE_KEY, public: ATLAS_API_PUBLIC_KEY },
  logger: ProductionConsoleLogger,
  projectId: ATLAS_PROJECT_ID,
  clusterName: ATLAS_CLUSTER_NAME,
  provider: ATLAS_PROVIDER,
  instanceSize: SCALE_UP_INSTANCE_SIZE,
});

const scaleDownTask = CronManager.RegisterScaleDownClusterCronjob(scaleDownCronExpression, TIMEZONE, {
  apikey: { private: ATLAS_API_PRIVATE_KEY, public: ATLAS_API_PUBLIC_KEY },
  logger: ProductionConsoleLogger,
  projectId: ATLAS_PROJECT_ID,
  clusterName: ATLAS_CLUSTER_NAME,
  provider: ATLAS_PROVIDER,
  instanceSize: SCALE_DOWN_INSTANCE_SIZE,
});

ProductionConsoleLogger.Write(
  LoggerMessageType.Info,
  'The Cloudize MongoDB Atlas Scheduled Scaling Agent has started and is running',
);

RegisterShutdownHandler(ProductionConsoleLogger, async () => {
  scaleUpTask.stop();
  scaleDownTask.stop();
});

The first step to implementing the solution is to fork the repo, as you will be modifying a few settings that govern when the application performs scale-ups and scale-downs and what instance classes to scale to.

A few settings will need to be overridden to get things working. We'll deal with them here.

Authentication

The MongoDB Atlas Administration API. utilizes API Keys for access control. Each key consists of a public key and a private key. We advise you to grant the minimum possible permissions to the API Keys that you generate (through the MongoDB Atlas Web Console) and that you restrict them to the IP blocks that represent your infrastructure (specifically where your Scheduled Scaling Agent will be running). The application expects your API Keys to be provided via the following two environment variables within the container or VM where the Scheduled Scaling Agent will run.

ATLAS_API_PRIVATE_KEY // Your MongoDB Atlas private key 
ATLAS_API_PUBLIC_KEY  // Your MongoDB Atlas public key

Cluster Related Settings

In addition to providing the API Keys, you will need to configure a few constant parameters related to your Atlas Cluster, as well as the scale-up and scale-down rules. We've tried to make them as simple and self-explanatory as possible. These specifically are the settings you should configure to match your environment:

// Atlas Cluster Parameters
const { ATLAS_API_PRIVATE_KEY, ATLAS_API_PUBLIC_KEY } = process.env;
const ATLAS_PROVIDER: MongoDBAtlasProvider = 'AWS';
const ATLAS_PROJECT_ID = 'Your Atlas Project ID';
const ATLAS_CLUSTER_NAME = 'Your Cluster Name';

// Scale up scheduling & configuration
const SCALE_UP_DAYS: CrontabDays = ['Mon', 'Tue', 'Wed', 'Thu', 'Fri'];
const SCALE_UP_HOUR: string = '7';
const SCALE_UP_MINUTE: string = '0';
const SCALE_UP_INSTANCE_SIZE: MongoDBAtlasInstanceSize = 'M40';

// Scale down scheduling & configuration
const SCALE_DOWN_DAYS: CrontabDays = ['Mon', 'Tue', 'Wed', 'Thu', 'Fri'];
const SCALE_DOWN_HOUR = '19';
const SCALE_DOWN_MINUTE: string = '0';
const SCALE_DOWN_INSTANCE_SIZE: MongoDBAtlasInstanceSize = 'M30';

const TIMEZONE = 'Pacific/Auckland';

Test and Monitor

The only thing left to do is to run some tests and monitor the console output of the application.

2023-03-12T16:23:56.831Z INFO    The Cloudize MongoDB Atlas Scheduled Scaling Agent has started and is running
2023-03-12T18:00:00.463Z INFO    Scaling up the Testing cluster to M40
2023-03-12T18:00:02.389Z INFO    The update to the Testing cluster has been accepted and is in progress.
2023-03-13T06:00:00.625Z INFO    Scaling down the Testing cluster to M30
2023-03-13T06:00:02.459Z INFO    The update to the Testing cluster has been accepted and is in progress.

Final Comments and Considerations

All pricing quoted in this article is accurate as of the publication date and is for deployments within AWS in the Sydney region.
This strategy is not appropriate for scaling NVME-based instances (due to the significant time and network costs specific to scaling NVME-based instances).
Lastly, please consider that every scale-up and scale-down operation effectively flushes the entire working set, so allow your database enough time to hydrate before the bulk of the load arrives. We like the idea of the scale-ups being scheduled at least two hours before you expect peak load.

If you have an idea that needs to go to the cloud, give us a call. We'd love to discuss it and how we can help.