Minutes of the SIPB Meeting of 2010-01-30


The meeting was called to order at 2:05:24 by broder.

In attendance were
        Voting members: broder, ezyang, fawkes, gdb, jesstess, kaduk
                        quentin | pweaver, geofft, ccpost
        Associate members: arolfe, jhawk, kcr | mitchb
        Prospectives: dhanus
        Guests: '()

Discussion:
        broder: We have this machine room in W20 and we are acquiring some
                machine room space in W91. This meeting is about figuring
                out how that will work and how we can work with OIS to
                get this transition done.

                I have written on the board a list of SIPB services/machines
                in the machine room. I would like to keep this discussion
                as high level as possible. Technical details will be considered
                later.

                Board contents:
                XVM Prod                                RTFM
                XVM Dev                                 AFS
                Scripts/SQL                             SIPB Tor
                news                                    SIPB NOC
                SIPB VoIP                               LAMP
                charon                                  iBuild
                debuild
                stuff/lost-contact
                SIPBv6
                multics

                For starters, we will ignore LAMP since we handled that
                in email
        jhawk: Would it be useful to categorize public-facing services and
               internal services
        broder: Maybe.
        quentin: Perhaps it would be better to represent "high-availability"
                 and less-availability.

        broder: Lets actually just star things that we want to move:
                - XVM Prod
                - Scripts prod, half-stared
                - stuff
                - AFS

        broder: charon, I would like to see move?
        jhawk: Why?
        broder: Nobody ever touches it, so it is not particularly important to
                keep it in W20.
        jhawk: It has pending work.
        quentin: If it doesn't get done before the move, I presume it can be
                 done as part of the move.
        jhawk: Alright, half-star it.
        broder: multics, for the same reason.
        jhawk: but ... uptime?
        [discussed.]

        broder: -IPv6 has to stay because it lives on W20 and provides service
                 to W20.
                -debuild is going away. crossed-off
                -iBuild is nowhere near production, so it should stay.
                -SIPB VoIP can move to XVM, maybe?
                -SIPB NOC: I presume you'd like something to the affect of
                          one server in W91, one in W20.
        quentin: Yes.

        kcr: I am skeptical of the desirability of keeping LAMP in W20. It is
             big and heavy.
        quentin: But its not big and heavy from a power perspective.
        kcr: I am skeptical of the extent to which W20 will remain available,
             and LAMP probably wants to be there.
        broder: Could LAMP's webservice move to Scripts?
        quentin: Maybe.
        jhawk: Isn't LAMP's uptime higher than scripts?
        quentin: Not really.
        geofft: I have to question, with the increasing transition to digital
                TV whether LAMP will remain legal if analog disappears.
        quentin: It will not...
        [distinction made that there are concerns about both digital cable
         and IPTV]
        jhawk: There is not any plans to remove the analog infrastructure.
        quentin: Yes, there are good reasons not to move LAMP to W91. It is
                 large physically, but not power-wise. It is basically
                 immovably mounted to its racks.
        kcr: Staking out more space in W91 is an interesting consequence of
             moving LAMP.
        quentin: The CD changers are like glued into the racks.
        kcr: I'm aware of that, but the physical racks are moveable.
        quentin: I am pretty sure it is standard policy that things in W91
                 need to be on their racks.
        broder: I am fine with LAMP's changers remaining in W20 as long as the
                webservice moves to scripts.
        quentin: It has some special Apache configuration that scripts doesn't
                 support.


        quentin: There is some extent to which it is reasonable to say that
                 VoIP needs physical hardware for timing in the same way that
                 IPv6 does.
        geofft: Its possible that we could have a single machine running
                SIPBv6 and SIPB VoIP.

        quentin: News is moving to W91....
        broder: Onto the isilon ... which we don't have a mark for.
        quentin: Just give it a star.

        broder: I count approximately 30u of servers moving to W91. I'm
                intentionally counting on the high side.
        geofft: Can you mark these estimates on the board.
        broder: Sure.

        [general counting of service's physical sizes, see board contents
        at bottom of minutes]

        quentin: This might be able to fit in our current rack.
        jhawk: There is obviously desire for space for expansion.

        [discussion of physical size of services staying in W20 and further
         networking complications]
        quentin: It's important to note that magic network tricks can
                 include fiber-channel.

        [discussion of uses of the RAID: SCSI disks are expensive]
        arolfe: There are some 4-year old 1u servers my CSAIL group may
                be getting rid of, that take SATA drives.
        [argument about what machines might need new hardware / disks]
        broder: It doesn't really sound like we have any needs we don't
                have the resources to meet.
        geofft: Are we considering future services?
        broder: Sure.
        geofft: There has been discussion of at some point doing debuild for
                the world.
        quentin: Could this be a good use of the RAID?

                Another use that would be cool is tie that to a Sun machine
                exporting ZFS. This would be a good way to make ZFS available.
        geofft: I would really like to see the RAID being used to provide
                space to SIPB services.
        broder: Does the RAID want to move to W91? If we want to keep talking
                about putting new services onto it that are sounding
                production-like, we want to consider moving it.
        quentin: Yes, but there are not going to start out that way and they
                 mostly don't exist yet. For development it makes sense to keep
                 it in W20.
        broder: OIS was very clear that they would not give us any upgrade to
                our network capacity in W20.
        quentin: So they WON'T give us any of the dark fiber already running
                 from W20 to W91? That's not an upgrade, it's already there.
        broder: We will talk to them about this.

        broder: Looking at the list of services, its pretty clear that scripts
                is going to have to play stupid networking games, XVM is going
                to have to play slightly less stupid games.
        quentin: I would kind of like to move XVM without bringing any of the
                VMs down.
        broder: You realize that you will be the one sitting up all night to
                deal with that?
        quentin: Yes.

        mitchb: What the heck is the time scale here?
        broder: We can start moving in roughly 3-months or so.
        quentin: Scripts could move much sooner than that if networking tricks
                were in place.
        mitchb: As I mentioned at the previous meeting, I don't want to get
                in the trap of we've already moved completely and they change
                their mind and we can't ever get physical access.
        quentin: I think that we are only looking at moving things that don't
                 require unscheduled off-hours access. In general the only
                 thing that is likely to cause the need for that is a disk
                 failure.
        mitchb: The problem is that I don't want to have to run to W91 and
                then stand there for 2 hours waiting for them to let us in
                when there is an emergency.
        broder: Remember that we are looking at getting remote consoles too.
        mitchb: That doesn't help when the system is completely locked up
                and needs a hard reboot.
        quentin: We have pretty good redundancy these days. There is no service
                 that should be in a position that it will completely crash
                 if the machine isn't rebooted immediately. It's already
                 been said that we can schedule off-hours access for regular
                 maintenance.
        mitchb: The question is not can we wait twelve hours, but also can
                the maintainers afford to do maintenance during work hours?
                I don't want to be surprised when the things that were hinted
                at but not explicitly said turn out different than we might
                have hoped.

                The real question here is how much flexibility we are working
                with on that scheduling. Is it a case where we call up DOST
                and say we need to do some maintenance and they say "oh sure,
                I'll stay late tonight," or one where they say "okay, next
                time I'll be staying late is in a couple weeks,
                see you then."
        geofft: When I've talked to DOST in the past, they seemed like they
                were very willing to work around our schedule, not make us
                conform to there's.

                For things like disk replacement, can we get DOST to do it?
        broder: Yes.
        mitchb: But we don't want to be giving them the root passwords to our
               services so they can fsck the disk once its installed, etc.
        jhawk: No services that are moving can't afford to wait twelve hours
               on a disk failure. What services don't feel comfortable moving
               in the absence of unscheduled off-hours maintenance.
        kaduk: If, for whatever reason, the direct card access NEVER pans out,
               what services are can you not handle being there.
        mitchb: I would be uncomfortable with the entire AFS cell being in that
                position.

        quentin: For a lot of these, I would like the ability to have remote
                 reset.I suspect the power strips in W91 may already have
                 remote power management.
        [this is discussed at some technical length and added to the "to talk
         about with IS&T list"]
         mitchb: Having remote power management and remote console mostly
                mitigates my concern over lack of access. Though I certainly
                would still like to see things work out as promised. For
                political reasons, if nothing else.
        jhawk: I think it is valuable to remember that new maintainers should
               be familiar with the machines running their services. We don't
               want a world where we never have that physical access. It's
               further important to remember that some new people want to
               learn about hardware.
        geofft: I think this is largely mitigated by having dev services, but
                certainly we would all prefer to see physical access pan out,
                as their is certainly value to it. I would still like to see
                remote power for convenience.

        quentin: I question how many of the starred services have the ability
                 to configure remote console exporting at the BIOS level.
        jhawk: What's the value of BIOS redirection?
        quentin: It allows you to have a lot more control over things like
                 booting from removable media, etc.

        ezyang: OIS was very clear that it was okay for us to tell DOST
                to do "foo" over the phone. (this includes inserting media,
                one presumes)
        geofft: That's pretty standard.
        quentin: BIOS redirection still provides an appreciable amount of
                 convenience for certain operations.

        [Discussion of physical security. Not recorded because of sensitivity:
          general notion was the desire to set up some sort of list of approved
          access list. We don't grant all the membership access to the machine
          room now and with W91 employees arbitrating we don't want things to
          become more lax.]

        jhawk: It is useful to discuss a general figure for miscellaneous
               expenses related to the move, so that we can allocate for such
               parts as are our responsibility.
        broder: That is a good point. It is also important to remember that
                SIPB still has a decently large amount of money from athena
                fees going away.

                What else should be discussed with OIS?
        quentin: Schedule. I think that SIPB could start moving things as
                soon as a month from now if VLAN tricks were handled.
        broder: I think what I would like to see is simply putting our racks
                on 18.181.
        quentin: I think that this may mean routing traffic through other
                 buildings, but here has been some discussion of making some
                 subnet location-independent.
        broder: I think I'd like to see .181 be that subnet

        broder: Are there any objections to scheduling a meeting with OIS to
                discuss these things and move forward on this plan?
        mitchb: As long as things are stretched out  enough that we aren't
                done moving until we know what the real answer about access is.
        broder: I think most of these things can be moved in a large move, but
                certainly some of these services are going to want to move
                incrementally.

        quentin: Mostly unrelated, can we talk about scheduling an OC11 tour?
        broder: Sure. We briefly discussed the general notion of some sort
                of tech talks to foster the relationship and awareness between
                OIS and SIPB. This is a good place to start.

        broder: What I would like to know is can I, with the boards approval,
                move forward on executing this plan. With, at the very least,
                starting the discussion of the technical aspects of the plan
                and hopefully implementing. What can I commit to,
                on behalf of the board, basically?
        quentin: I think we need a meeting with Mark, first.
        jhawk: I think that you are generally authorized to move things
               forward and make plans modulo a schedule. Once there is a
               schedule, this will need to be approved by the individual
               service maintainers.

        geofft: Do we want to have a formal motion about this delegating
                power to make these decisions?
        [consensus is 'not really']

        broder: With a longer outlook, I want to note that I am not running
                for re-election, since I am graduating. However, I do plan to
                work with my successor as much as possible, and to some
                extent take point on this to the extent my successor
                will let me.

Final Board Contents:

        *  = move
        @  = half-move ["half star"]
        X  = move to XVM
        -- = kill with fire

        14u *XVM Prod                       RTFM
             XVM Dev                    6u *AFS
        6u  @Scripts/SQL                X   SIPB Tor
        2   *news                       1u @SIPB NOC
        X?   SIPB VoIP                      LAMP
        1u  *charon                         -iBuild-
            -debuild-                   2u *Linerva
        2*  stuff/lost-contact
        X?  SIPBv6                      Total: 35u
        ?   multics

        To Discuss With IS&T:
        ---------------------
        OC11 Tour
        VLANs
        Schedule
        Light up dark fiber
        Getting into W91
        Remote power
        Physical Access
        Allocation for miscellaneous expenses

Other Other:
        geofft: moira6 is down, moira7 is attempting to start.

The meeting was adjourned at 3:15:57.


        Minutes taken and submitted by fawkes.
