Job Summary
We are seeking a Mid-Level Monitoring Specialist to own and evolve our monitoring practice. In this dedicated monitoring role, you will configure and manage our Checkmk monitoring platform across multiple customer environments, ensuring proactive issue detection and delivering customer-centric reporting on system health. You will work closely with clients and internal teams to anticipate problems before they impact users and to communicate insights clearly.
Responsibilities
- Configure and Manage Checkmk: Install, configure, and maintain the Checkmk monitoring system for both our internal infrastructure and client environments. This includes deploying agents on customer servers, setting up monitoring plugins for various applications, and ensuring all systems are correctly enrolled in monitoring.
- Monitor Systems Proactively: Continuously watch dashboards and alerts to detect issues early. Tune thresholds and alert settings so that we catch performance degradations or failures in their infancy, allowing for rapid response before they escalate into customer-impacting incidents.
- Thresholds and Notifications: Define and adjust alert thresholds for metrics (CPU, memory, disk, network, etc.) based on each application's normal behaviour. Configure notifications (email, SMS, etc.) so the right on-call engineers and customer contacts are alerted with appropriate urgency. Regularly test and refine the notification processes to avoid alert fatigue while not missing critical events.
- Application & Log Monitoring: Extend monitoring beyond basic server metrics to include application-level performance (e. g. , databases, web services, APIs) and log monitoring. Use Checkmk (and related tools) to track application-specific KPIs and to scan logs for errors or anomalies. Ensure that important application events and log warnings trigger alerts or are included in reports for review.
- Incident Investigation & Resolution: Investigate alerts and anomalies to identify root causes. Work alongside system administrators or cloud engineers to resolve underlying issues (e. g. , restarting services, adjusting resource allocations, patching software). If an alert indicates a broader problem, coordinate with the support team to ensure timely resolution and document the findings.
- Client Consulting & Support: Act as a monitoring subject-matter expert in client engagements. During customer onboarding, gather monitoring requirements and configure Checkmk to meet each client’s needs. Provide professional services by advising clients on best practices for monitoring their specific applications and workloads. Be responsive to client inquiries related to monitoring and assist with any custom checks or reporting they request.
- Customer-Centric Reporting: Prepare and deliver regular monitoring reports for clients (weekly, monthly, or as needed). These reports should highlight key metrics, uptime/downtime statistics, trend analyses, and any notable incidents or near-misses. Translate technical data into clear insights and actionable recommendations that non-technical stakeholders can understand. During review meetings or calls, clearly communicate how the infrastructure is performing and where improvements could be made.
- Documentation & Knowledge Sharing: Document monitoring configurations, custom plugins, and standard operating procedures for common issues. Maintain an up-to-date knowledge base or runbook for the monitoring environment. Share knowledge with internal team members, conducting brief training on Checkmk usage or alert handling as needed, to ensure our 24/7 support team can interact with the monitoring system effectively.
- Continuous Improvement: Continually enhance our monitoring capabilities. Evaluate new features in Checkmk, consider integration with other monitoring or AIOps tools, and contribute ideas to improve efficiency (for example, scripting repetitive tasks or automating agent deployment). Help shape our monitoring roadmap and influence how we deliver high-value managed monitoring services to clients.
Requirements
- Experience: 3+ years of experience in IT infrastructure monitoring, systems administration, or a related role. A solid background in managing servers and understanding system health indicators is essential for this mid-level position.
- Monitoring Tools: Hands-on experience with monitoring platforms like Checkmk is highly preferred. Exposure to similar tools such as Nagios, Zabbix, Icinga, or Prometheus/Grafana is acceptable if you can quickly learn Checkmk. You should understand how to configure checks, set thresholds, and manage alert rules in at least one monitoring system.
- System Administration Skills: Strong sysadmin skills on Linux (and preferably Windows) systems. You are comfortable installing software agents, editing configuration files, managing services, and using the command line. Familiarity with managing plugins or extensions for monitoring various applications (databases, web servers, etc.) is important. Basic networking knowledge (TCP/IP, ping, firewall rules) is needed to troubleshoot connectivity between monitored nodes and the monitoring server.
- Analytical & Troubleshooting Ability: Proven skills in diagnosing system issues. When an alert comes in, you can interpret what the metric means, correlate it with other data (logs, recent changes, etc.) , and pinpoint potential causes. You have a methodical approach to problem-solving and can handle incidents under time pressure, restoring services quickly when failures occur.
- Attention to Detail: Ability to fine-tune monitoring settings and notice patterns. For example, identifying that a certain service regularly spikes at a certain time and adjusting thresholds or investigating why. You ensure monitoring coverage is complete (no important system left unmonitored) and that false positives are minimized.
- Communication Skills: Excellent written and verbal communication skills. You must be able to write clear reports for customers and explain technical issues and solutions in plain language. Within the team, you document your work and can guide others on how to interpret monitoring data or handle alerts. Customer-facing experience is a big plus, as this role involves explaining reports and possibly teaching clients how to view their dashboard.
- Customer Focus: A customer-centric mindset. Our monitoring service is a product we deliver to clients, so understanding their business needs and tailoring the monitoring to meet those needs is crucial. You take pride in keeping clients informed and happy by ensuring their systems are stable and by being proactive in preventing issues.
- Education: A bachelor's degree in computer science, Information Systems, or equivalent experience in the field is preferred. Relevant certifications (systems administration or cloud) can substitute for formal education.
Nice-to-Have Skills
- Scripting & Automation: Ability to write basic scripts in Python, Bash, or PowerShell to automate monitoring tasks. This might include automating agent deployments, writing custom Checkmk local checks or plugins, or parsing log files. While not mandatory, scripting skills can greatly enhance efficiency in this role.
- Cloud Platform Knowledge: Familiarity with public cloud environments (AWS, Azure, or Google Cloud). Since we operate a managed public cloud, knowing the services and metrics of cloud platforms (like CloudWatch, Azure Monitor) can help integrate cloud resource monitoring with Checkmk and understand our clients’ cloud infrastructure better.
- Log Management Tools: Experience with log analysis or SIEM tools (e. g. , ELK/Elastic Stack, Splunk, Graylog). This experience can complement Checkmk's capabilities, especially if you have used log monitoring or alerting in those systems. It’s a plus for handling the log monitoring aspect of the role.
- Monitoring Certifications: Any certification or formal training in monitoring or IT operations (for example, a Checkmk certification, if available, or Nagios Certified Professional, etc.) is a bonus. Likewise, Linux or cloud certifications (RHCSA, AWS SysOps, etc.) can be advantageous.
- DevOps and IaC: Exposure to DevOps practices and Infrastructure-as-Code tools (Ansible, Terraform, etc.) is nice to have. Our environment might leverage automation for setting up monitoring across many servers, so understanding these concepts can be beneficial for scaling our monitoring solution.
- Client-Facing Experience: Prior experience in a consulting, professional services, or MSP (Managed Service Provider) context. If you have worked directly with external clients to deliver IT solutions, you’ll fit right in with our customer-focused culture.
Benefits
- Impactful Role: Play a key role in a small, agile company. As our dedicated Monitoring Specialist, you will have a high degree of ownership and your work will significantly influence our service quality. Your efforts directly translate into improved uptime and satisfaction for our customers.
- Leadership & Growth: Report directly to the CTO and collaborate with senior leadership on important decisions. This close mentorship and high visibility can accelerate your professional growth. You’ll have the chance to lead the monitoring function and potentially grow into a lead or architect role as the company expands.
- Diverse Tech Exposure: Work with a variety of technologies and client environments. Each customer might use different stacks (Linux/Windows, various applications, cloud services), giving you broad exposure and continuous learning opportunities. If you enjoy learning new systems and tools, this role offers that diversity.
- Training & Development: We invest in our team. Expect opportunities for training such as attending Checkmk workshops, obtaining relevant certifications, or participating in industry conferences. We encourage continuous learning to keep your skills sharp and up to date.
- Collaborative Culture: Join a close-knit team where your voice is heard. We foster an environment of knowledge sharing and teamwork. Even though you’ll own the monitoring area, you’ll work closely with our support engineers, cloud architects, and developers. Everyone pitches in to solve problems together.
- Competitive Package: We offer a competitive salary appropriate for a mid-level position, along with benefits such as health insurance, paid time off, and a flexible work schedule. We understand the importance of work-life balance and offer on-call rotations that are manageable, with comp time when you’ve had to respond to after-hours issues.
- Customer Engagement: Unique to this role, you will get to interact with our clients regularly and see first-hand how your work benefits them. There’s a great sense of accomplishment in providing a service that helps keep businesses running smoothly, and our customers often recognize and appreciate the monitoring insights you deliver.
If you are passionate about ensuring systems run flawlessly and enjoy the mix of hands-on technical work with customer-facing responsibilities, we'd love to hear from you. As our Monitoring Specialist, you will be the guardian of our clients’ IT health, using cutting-edge monitoring tools to deliver peace of mind in a proactive and consultative way.
Join us to take our monitoring services to the next level!