Monday, December 3, 2012

Grumpy Old Man Plays the Role of Query Optimizer

Another in the continuing saga of MongoDB. As you maye gleaned from other posts, I am yet to become an all in fan of the product. I have some appreciation of its capabilities, but still am finding it syntactically and semantically hard going.
The class from the 10gen team is very well done. a few glitches, but nothing major. I have been on expensive classes where the material has been of lesser quality.
So kudos to 10gen - especially Andrew.
The strange bits. And this week there are 2. First off there is thing called the "aggregation pipeline". Cool concept, you specify as steps in a pipeline what to do with the data, so you can do things like group, sort, and generally report usefully on the data contents. Cleverly (of course), the results from each stage of an execution pipeline are mongo (JSON) documents. So you can operate on them just like any others. Nice.
But, in order to do quite complicated queries, you have to specify every step of the pipeline yourself. And you have to figure out the order you perform them in(to get the right results and to optimize the performance). If I sort before I filter, then I am probably doing too much work, for example. In SQL databases - at least the better versions, the query optimizer is supposed to figure this out for you. So, niow I am having to be my own query optimizer. Not happy about that. Yeah there are reasons, sharding might be tricky to optimize (I don't know).
The second - and this is quite uncomfortable feling, but I dare say I will get used to it is referencing. This is a bit tricky so I will illustrate (I hope correctly!)
if I want to group by the value of category, where category is a key in a document,
I would have to right something like {"group":{"_id":"$category"....}}

I read this as make the _id field the value obtained from the category key. Each one in quotes (programming in strings again, uh), and then the $ to dereference the name of the category key so I can use its value. That's an awful lot of symbology to remember.

Grumpiness quotient has gone up this week!

Tuesday, November 20, 2012

Grumpy Old Man and MongoDB - Indexes and things

This week (week 4 in the excellent 10gen class on MongoDB) has us looking at things like indexes, profiling, etc.
I am getting used to the syntax (but still dislike the "programming in quotes" model and the use of cryptic special values for specifying sort sequence, etc.).
Lovely looking feature for geospatial indexes, but quite tricky to use. At the base level, the distance measures on the spherical model are expressed in radians. So we have to do that conversion somewhere. PITA so far. I can see why, but that isn't exactly habdy. Would like (and will build) some other mechanisms to sort that out.
If for no other reason, the radians based model doesn't distinguish well between directionality. maybe I want coffee shops within 10 miles North of me (because I am heading that direction, none south of me and maybe within 1 mile each side of the route). I am sure I could code that!
And then for some reason, the MongoDB shell treats using the geospatial spherical model differently from other models. It is invoked through the db.runcommand(...) syntax and not the usual db.dbname.find(...) syntax.
Also since you don't specify which index to use in the db.runcommand(..) syntax, and if you happen to have 2 2d indexes defined it fails. Promising feauture, but could use work.
Utilities are handy - Mongotop and Mongostat are helpful indeed
In many ways MongoDB reminds me of the 1970s system ADABAS, but with updated syntax.
 

Wednesday, November 14, 2012

Grumpy old man and MongoDB - Transactions

I am beyond scared by the possibility of using MongoDB for any kind of meaningful transactional system.

We always have a balance between "getting stuff through the system" and "getting sufficient accuracy". Sufficient here is really key. ACID properties are vital to ensure that we don't see incomplete transactions WHEN THE KINDS OF TRANSACTIONS WE ARE PROCESSING MUST NOT BE SEEN UNTIL COMPLETE. (caps deliberate).

The "classic" example is the movement of money from one account to another. While the money is being moved, decisions based on the value of either account will be flawed. The "from account" will have a smaller balance than we think, and the "to account" a larger one. So we should probably wait until the transfer transaction has completed before allowing any process to make decisons based on the balance in either account.

In the MongoDB world the update to each account is itself atomic, but there appears to be no overarching transaction context. So it is possible (not very probable) for the document that represents the "from" account to show that the amount has been debited, but that the "to account" has yet to be credited. Assuming that the system does debits before credits. It, of course, doesn't have to, although I think it would be foolish not to.

The designers of the major database management systems (relational or not) have thought carefully through those kinds of implications. They have made sure that records are somehow locked to prevent this kind of behavior. They ensure that updates on both the "from" and "to" sides of the transaction are both handled - or neither is.

Do I really trust a developer with the kind of skills I have to get this right in every case if I get no help from the underlying data management system? I don't think so. I would much rather see the transactional systems using transactional databases. And use these powerful engines (like MongoDB) for situations where I don't have to rely on transactional behaviors.

Now the actual number of cases where transactional behavior of this nature is actually required may be smaller than we think. Often times we see a small transactional component (moving the money) not tied to the delivery of the goods. See this excellent post from Gregor Hohpe.

Friday, November 9, 2012

Grumpy old man and MongoDB - Database Design

It is week three in the MongoDB class put on by 10gen. The instructors have done a great job. The material flows well and is presented nicely. So kudos to the guys.

One of the privileges of being old and grumpy is that you learn that there are no mysteries in system design. However, there are new paradigms sometimes. We have that in the MongoDB world and there are many cases where it can make a big difference. Essentially I am now beginning to think of MongoDB as "relational database with embedded arrays". I don't know for sure (I haven't done the math and nor am I likely to), that MongoDB will support the Relational Calculus. It should (probably, but again, I have not done the math!) support SQL pretty well. Especially a very vanilla form that doesn't use constraints, etc. I am not sure of the value of the DDL aspects of SQL, although I guess one could do that. Much more important would be the layering oof SQL for data manipulation.
Even expressing a join would be fine - and if the data were embedded more power to it. SQL as data access layer vs SQL all the way through the storage subsystem.
There are some semantics changes of course - because of the lack of a real "key" in an embedded document, some of the join-like processing will potentially be a bit odd. Essentially we have to treat the values in an embedded document as we would in a materialized view.
SQ Update and Delete operations are less likely to behave as they do in an RDB. The implications of deletion on embedded documents are subtle. However I can see some great opportunities for some stereotypes here.

This post by Bill Kent is one of the all time great articles on thinking about choices in representation of a simple 'fact', The paper was written in 1988.

As a long time teacher of data modeling (my classes pre-date relational databases!), I have come to a couple of realizations:
  • The approach that I take to logical (E/R, not expressed as tables) modeling won't change with MongoDB
  • There should be some pretty simple guidelines for converting an E/R model to a MongoDB implementation
  • The best looking uses for MongoDB are where something else has already done the validation and linking - insertion into MongoDB becomes an organizational exercise.
  • MongoDB gives some flexibility in order of insertion even when things are linked. So some of the convoluted exercises we have done when creating systems of references in conventional relational databases may go away.
  • The modeling tools (like Embarcadero and ER/WIN) are less help than they used to be - except maybe as pure diagramming tools. This one I am less sure of, since all I have ever seen from these tools is modeling as a relational exercise. If there are other ways possible, I haven't really seen them.
I am looking forward to week 4.

Friday, November 2, 2012

syntactic sucralose

In programming languages there is a concept "syntactic sugar". As wikipedia describes it In computer science, syntactic sugar is syntax within a programming language that is designed to make things easier to read or to express.
In some languages (especially the MongoDB shell), there is the reverse concept. There are language features that are present to "make it work" but have no bearing on anything in the program's context. I call these syntactic sucralose. They are the only things available to get the desired result, but leave a bitter taste in your mouth afterwards.
The case that riled me up today was the $unset "operator" in Mongodb's shell interface. To unset (eliminate a name value pair in a MongoDB document), you write something of the following form for the second argument of the .update method.
{$unset :{foo : 1}}. The 1 in this case is a mandatory positional parameter that has no relation to the current value of foo. In fact you could put anything that is a legitimate value (string, date, objectID, integer, boolean...) in place of the 1.  In fact whatever is placed there is evaluated.  So for example, the code fragment

x=0
,{$unset :{foo,x++}} does result in both foo becoming unset and x being incremented.

Even if the value is an unbound variable name, it still is acceptable.

Lots of scope for mischief. This should come with a government health warning

Wednesday, October 31, 2012

MongoDB and the Relational Car

Sometimes you want your data all nicely normalized, and sometimes you don't. This has come into sharp relief as I go through MongoDB training this week. By way of background, I have a fair amount of experience with many types of databases, data modeling and data thinking in general, so it is interesting and fun to learn about new ways of thinking.
But forst a story. In the dark ages (maybe 1974 or 1975), I wondered about mailing lists. So much so that I devised a way of tracking some of the uses of data among companies - especially early markets in buying and selling of information. I would sign up for a magazine using some unique variant of my names. Keep track of which variant I used for which magazine, and then see what solicitations I would get through the mail using that name variant. Most instructive. American Express sent mail to the largest number of variants.
I do the same things to this day - making up email names addresses just for long enough to validate that I want the service I have signed up for, and then wait and see what else arrives. Of course everything that arrives is by definition spam. But I digress.
I also thought about the "relational car" i.e. what would the world be like if I normalized my vehicles. Kept the wheels with the wheels, the engines with the engines,... You get the idea. I think it is likely that I would be late for work every day. First join all the piece parts together to make a suitable version (assuming that the children hadn't emptied the fuel tank the night before, thus putting the equivalent of a lock on the tank). Then drive off. After coming home, put the updated parts back. Updated???? Yes, because the tires are now more worn...
Clearly from the primary use of having a vehicle as transport, the relational car is far from ideal. I am much better off with the assembled car.
That's kind of how I think in Mongo. Often times the data are much more useful when put together by primary usage than when all normalized and accessed with joins. But, of course, not all the time.
That led me onto thinking about the quality of "relationships" (among the data entities). Many others, in more learned writings than mine, have fussed about different qualities in relationships. Composition is different from association, etc. So a Purchase Order might be composed of Many (at least 1) Purchase Order Line Items, so it seems reasonable to think of these line items inherently bound up with the POs. So the document oriented approach looks pretty good. But, things are less rosy, when I think about the association between the Product and the PO. There it is probably unreasonable to bury the product data inside a PO document. And there are several different kinds of associations we might want to consider.
So, when learning the document oriented DBMS (MongoDB), I am finding myself revisiting types of relationships and whether the distinctions are important. For me I have come down on the side of, I probably care in my master systems, those systems of record that actually run the business. But in those which are simply systems of reference, maybe it makes a whole lot of sense not to worry about the normalization, required schemas and other aspects that make the document oriented databases so interesting.
In the relational model the foreign key is the only relationship condtruct available. Even the cretion of "link" relations relies on the Forign Key. That doesn't seem to me to be a powerful enough construct to express the nuances of the kind of relationship and thus its associated semantics.
Oh, and circling around to something interesting about composition types of relationships, we do have some interesting delete anomalies. If we say that an A is composed of 1 or more Bs, what do we do when attempting to remove the last remaining B. That should somehow kill off the A, of course - or we expressed the rule incorrectly.

Thursday, July 12, 2012

Clouds and failures

Cloud infrastructure delivers us a utility view of computing. Models like Amazon's Elastic Cloud Computing (ECC) give us scalability on demand - and consumption based pricing. But along with all of the benefits of the model, there are some downsides that are less well understood or considered. The first is that you are responsible for ensuring that you have chosen the right disaster recovery options. If (when) the cloud infrastrucute hiccups, you do want to make sure your systems are still operational and available. You don't want the very public beatings that occur when Amazon takes a hit and sites hosted by Amazon become unavailable.
Second, there are many more players in the mix than you might think. There is a whole collection of management platforms that help you deal with the complexity of the underlying cloud. I'll use RightScale as an example here. In order to deploy/manage/maintain a specific deployment, the Rightscale environment can be used. That is a whole bunch (technical term, I know) of software that essentially provides processes and tools to configure/start/stop/script/debug/deploy servers in the cloud. That is some fairly complex software. It can break too. Or it can become part of a maintenance window. After all even management software has to be upgraded. So an outage in your management software could become problematic. You need to understand the location, up time guarantees, etc. of your management software.
Above that - because even the management software environments can be a bit cumbersome, companies might add their own layers of management software - think of them as super processes that run canned "scripts" for performing the most common tasks. These can fail too. Often corporations include these as part of their internal infrastructure - providing some handy authentication/authorization services integrated with corporate ldap or other directory services approaches. Where does that live? What's it's failure model?What kind of downtime does it have to take?
There are many moving parts here - so while there are considerable benefits to moving certain kinds of applications and services to the cloud, there may well be more failure scenarios to plan for.

An era ends

On May 31, 2012 I was laid off from Progress Software. Several events led to this - the first being when Progress decided to reduce its focus on the Travel Industry. I switched from being the Industry Architect for Travel and Leisure to being a member of the Demonstration Systems Group. There I had the responsibility for deploying the various Progress offerings "in the cloud" so that developers could quickly get a properly configured instance up to speed for doing demonstrations and proofs of concept.

Progress then decided that it is no longer in the "Enterprise Integration" business and as has been publicly announced, will be divesting several of the products that make up the Responsive Business Integration and Responsive Process Management offerings. Leaving behind the OpenEdge database product and development environment and the Apama/Corticon product lines for rules and analytics "in the cloud" and for capital markets.

From the public statements, it had become clear that Progress wanted to focus more on the more profitable, partner driven model that has been historically so successful.

Since I was a part of the RPM/RBI world of products, I along with several others became "surplus to requirements".

I enjoyed my time at Progress, and while I wished that it would have continued in the previous direction, I completely understand what the driving forces are.

I wish the company well in its future endeavors - and hope that the many friends that I made while there are successful.

Total Cost of Ownership

I think it is time to think about the TCO of living somewhere. The amount we have to expend (or is expended on our behalf) just for being somewhere.
The typical (and intensly political) approaches look at things like "The Deficit" and "Taxes" in giant categories. But what they don't do, not do they make it easy for us individually to do is to look at the cost/waste in the system of being somewhere.
That cost/waste comes out of our hides somehow. So, for example, we can keep taxes low by providing few government services and somehow billing back the citizenry for the necessary services.
We can have legislation that frees up tax money for a specific purpose (e.g. a Flexible Spending Account as part of the healthcare options). Nice you think - that will be taken off pre-tax. But because of abuse, we have to justify every dollar withdrawn from it - ensuring that we aren't buying things we shouldn't with pre-tax $. So bureacuracy is established to handle this. This bureaucracy has to be paid for. Oh, and this is a job creation scheme too! Premiums rise to pay for the bureaucracy. The blame is passed onto the healthcare industry and the government gets away scot free. As my dad would say, "A good game played slowly!"
So when we look at just the costs of being around we need to look at those items that aren't taxes (i.e. not collected by governemnt), but that somehow we have to pay in order to be here.
Of course much of this overhead comes about because so many individuals just cheat the system. So we create legislation, and then band-aid on anti-cheating measures which inevitably increase the price of the service, but the legislators can (correctly, but disingenuously) claim that they didn't increase taxes.

Thursday, January 19, 2012

Cost of switching

I am certainly not the first to make this observation, nor will I be the last. Switching from something that works OK to something that works a bit better probably won't happen. Huge generalization, yes. But we are mostly lazy and if switching requires lots of effort, we won't do it.
What brought this home? I was given an iPad by my employer (well loaned actually). I already have a Motorola Xoom whose quirks I am finally used to.
The iPad is gorgeous. The screen is crystal clear, the rendering spot on, the speed of response to user actions is great, the battery life seems to be wonderful. So why haven't I switched?
There could be several reasons:
  • Switching makes me admit I bought the wrong device
  • Switching requires me to relearn the User Experience
  • Switching means giving up some features in return for others
  • Switching means giving yet another big brother some basic information about me
  • Switching is work!
There isn't a single compelling reason to switch. Sure if I had not bought the Xoom, and was just starting the tablet journey, I might well buy the iPad. But having a device already means that I have to overcome some inertia. How much inertia? In this case a lot. The Xoom does pretty much what I want, so I am getting little more valuable functionality in response to the cost (of my time) that I put into the switching. While the iPad is an emotional purchase, it hasn't stimulated my emotions enough to want to part with the Xoom.

That and while I thought the Xoom soft keyboard was poor, the iPad one is hideous by comparison. Shift keys for each number? And we are told to have letters and numbers in passwords. It's too bloody difficult

Tuesday, January 17, 2012

Human "Technology"

I have been bothered for a while by the calories in/calories out model of weight thinking. I have finally articulated at least to myself why I have difficulties:

  • My scale doesn't weigh "calories" it is all about mass.
  • Energy seems to be a proxy for weight (and no I am not in the e=mc^^2 realm of thinking because that deals at entirely the wrong level!)
  • The energy in/energy out doesn't account for a few things (which I will detail out in the body of this posting)
So, here's the fundamental set of thinking. Please comment and point out the holes.

Taking the assumption that I am interested in what the scales say, I should be thinking in terms of the total inputs/outputs. Even without understanding any digestive processes.

Weight at any point in time = weight at previous point in time + weight of all inputs since that time  - weight of all outputs since that time.

It's the all inputs/all outputs that we need to worry about.

At the simplest level the inputs consist of:
  • All solids (food)
  • All liquids (drink)
  • All gases (air breathed in, water vapor in the air, etc.)
  • Supplements
At the simplest level the outputs consist of:
  • All solid (fecal) material 
  • All liquid (urine) material including dissolved solids
  • Vomit
  • Sweat
  • Respiration products (gases, especially Nitrogen and Carbon Dioxide, but total of all gases)
  • Water vapor in respiration products
  • Anything else we can think of (dead cells, sputum, tears, ear wax....)
It's no wonder we use energy as a proxy!

Now for the area where I have trouble with energy in/energy out. There is a move afoot to extract energy from human waste. That must mean that there is some residual energy in some of our outputs. Especially, I suspect in our feces. But maybe elsewhere too.  So the big questions for me are:
  • How much of the available energy is in our waste products?
  • Does the amount of available energy in our waste depend on our eating habits? If so how?
  • Does the amount of available energy in our waste depend on any conditions (eg Celiac disease) we have? If so how much effect?
Any thoughts/comments gratefully received.

Chris