Try these strategies to satisfy auditors' requirements without wasting your time or testing your patience.
The beauty of building extra-large Linux clusters is it's easy. Hadoop, OpenStack, hypervisor, and high-performance computing (HPC) installers enable you to build on commodity hardware and deal with node failure reasonably simply. Learning and managing Linux administration on a small scale involves basic day-to-day tasks; however, when planning and scaling production to several thousand node clusters, it can take over your life, including your weekends and holidays.
Specific requirements for encrypting people-related data in transit and at rest have been heavily discussed elsewhere, so I won't be covering them here. Rather, we'll focus on preparations to keep an audit off the backs of your Linux admin team.
1. Fundamentals: Connecting your cluster to the world
It's tempting to build a cluster on a standalone network with admin access on a second corporate LAN interface. Like Oracle databases in the past, Hadoop and HPC clusters tend to execute all running tasks in a cluster with a single user identification (UID) account (e.g., "hadoop").
Audit needs to prove not only how personal data is stored, but also how data is manipulated, aggregated, or anonymized, and that includes who can create, change, or log in these application-specific accounts. That's you and your admin team in the spotlight.
2. Don't let software installers create accounts or Linux groups
Use your favorite configuration manager or identity manager to create needed accounts on each cluster node (or directory) first. If the Hadoop account and group already exist, the cluster software installer will use those instead. There are several reasons we want this behavior, as outlined below in the next three steps.
3. Maintain UID/GID consistency everywhere
For traceability later, ensure your organization has a consistent UID/group identification (GID) strategy—a way individuals and groups can be identified within the system. For your cluster's software, the unique application UIDs and GIDs need to fit into that matrix across the organization's infrastructure, not just in your cluster.
4. Sudo, not Sudon't
If you are manually distributing sudoers files into your cluster or managing a site-specific scripting environment, it will be up to you and your team to prove you know exactly the state the sudoers file on cluster node 47 was in at a moment several weeks in the past. That is a headache we can all do without.
For self-protection, your team needs to have a strategy to make this centrally managed and under version control. This can be achieved by using a tool like Ansible during node OS setup or versioning machine images for auto-deployment.
This post by David Dingwall of Fox Technologies was originally published on OpenSource(dot)com. To read the last 3 tips, visit Opensource.com.