I joined Renewalytics as a founding engineer in mid-2024. The product is a forecasting and energy trading platform for renewable plants across India. When I walked in, the whole thing was one monolithic Node.js app on MongoDB, running on a single DigitalOcean droplet. It worked, until it didn't.
This is what I learned cleaning that up and scaling it to serve 3,400+ MW of capacity.
Reflux is the heart of it. We pull weather data, plant configs, and historical generation, then predict how much power a solar or wind plant will generate over the next 48 hours. That prediction goes straight into trading decisions. Over-predict and you pay penalties to the grid. Under-predict and you leave money on the table. There's no neutral mistake.
The pipeline is straightforward on paper: ingest weather APIs, normalize the data, run the models, push predictions to the trading dashboard. The reality is messier, because every step has failure modes. We target 95%+ accuracy on day-ahead forecasts and most of the engineering effort is just defending that number.
Plant operators live in this. They see their forecasts, manage scheduling, send revisions (CTU/STU schedules to SLDC), and pull reports. I rebuilt the old React dashboard as a Next.js app in TypeScript, mostly because the old one was slow and operators were copy-pasting numbers into Excel to work around it.
I used to think the hard part was the ML models. It's not. The hard part is that the weather APIs you depend on have gaps, delays, and quiet outages. Half my "model accuracy" work has been fallback chains.
async function getWeatherData(location: Location, timestamp: Date) {
const primary = await primaryAPI.get(location, timestamp).catch(() => null);
if (primary && primary.confidence > 0.9) return { data: primary, source: "primary" };
const secondary = await secondaryAPI.get(location, timestamp).catch(() => null);
if (secondary) return { data: secondary, source: "secondary", degraded: true };
return interpolateFromHistory(location, timestamp);
}Every fallback step costs you 2-5% accuracy. So we track which one fired and surface that to the traders. They'd rather know the forecast is degraded than trust a number that's silently wrong.
Early 2025, a CVE dropped on a package three levels deep in our dependency tree. We had the patch out in under 4 hours, before anything tried to exploit it.
That wasn't luck. It was automated dependency scanning in CI plus a written incident playbook from a previous near-miss. The post-mortem for that one became the template for the next.
The single-droplet setup broke once we crossed maybe 15 plants. I moved everything to Docker behind Nginx, split out the API, the forecasting engine, and the portal into separate services, and offloaded reads to replicas.
I did the migration over three weekends with zero downtime by running both stacks in parallel and shifting traffic with DNS weights. The first weekend was terrifying. By the third it was boring, which is exactly how it should feel.
Reliability beats features, every time. Operators don't care about your beautiful chart if yesterday's forecast was 10 minutes late. Fix the boring things first.
If a human has to do it, it will eventually be forgotten. Backups, SSL renewal, log rotation, dependency updates. Automate it or it will bite you the one week you're on leave.
Write the post-mortem even when nothing bad happened. The near-miss docs have saved me more than the real incident reports.
Domain knowledge compounds faster than you expect. After ~18 months I can talk to plant operators about inverter efficiency and curtailment patterns without translating. That changes what software I build, not just how I build it.