CS664 Lecture 1

Administrivia

Signup sheet!!

2 problem sets, 1 oral paper presentation, final research project

Scribe system (but not today)

Overall goal: to prepare students to do research in vision or medical imaging

What is computer vision?

Working definition: extracting useful information from images

In particular, information about image content

Certain formal problems [high-dimensional inverse problems with spatial constraints]

Elements of psychology, engineering, mathematics

From a technical point of view, interplay of statistics and geometry

From a very engineering point of view, getting info out of (typically) 512 by 512 8 or 24 bit arrays

Why is vision hard?

It certainly looks easy (any child can see…) but about 1/3 of your brain is doing it

Ill-posed and ill-defined problems

Inverse problems are ill-posed in the sense of Hadamard (no unique solution)

Perception problems are always ill-posed, since the goal is to recover info about the world

Vision is going from the (2D) image to the (3D) scene; graphics is the opposite

Worse still, the problems in vision are ill-defined without a task

There is no formal specification for nearly any vision problem, and few doable tasks

In terms of engineering, to build a vision system requires hooking together unreliable components whose lies cannot be checked

There are many ugly engineering problems, due to e.g. bad cameras, slow computers and busses, etc. (these are slowly being solved).

664’s take on vision

This is not an exhaustive course – there are many areas of vision you will never hear about in 664

This is mostly by choice – the Cornell vision group (= RDZ) has a fairly strong bias

Pro-algorithms, i.e. computational

Discrete math (much of vision is continuous)

Task-oriented

Try to make minimal assumptions, and reasonable ones (this is partly an AI legacy).

A bit about the field itself

About 30 years old (MIT summer vision project from ~1968)

Draws primarily from EE, then CS, then psych (but not in 664!)

About 1 major conference per year w/ 500 people, 2 major journals (see links page)

Originated largely in AI, but is now totally distinct (and proud of it…)

Somewhat related to image processing, 2D signal processing

There is no formal definition, here is an RDZ intuition

Sometimes the fields really do cross, e.g. MPEG-4

Tasks have evolved over time

Classic tasks:

Recognition of tanks

Recognition of characters (OCR)

Industrial inspection

Robotics?

In the last 5-10 years lots of interesting new tasks:

Search engines for images/image databases

Video surveillance

Security applications: identify people via faces, iris, fingerprints (in order of less vision)

HCI (Bill Gates’ favorite examples)

Multimedia apps, i.e. compression

Graphics applications!

A new emphasis this year on medical imaging applications

From a technical point of view, a huge amount of overlap with non-medical vision

But also some specific quirks

You can do a final project in any area of vision

What we will cover

There is a list of topics in the 1^st day handout.

Order of topics is roughly “low level/early” to “high level/late”

Distinction: low level involves direct operations on the pixels

high level involved intermediate representations

Most of computer vision is low level/early; high-level vision is perhaps premature as a field

A very brief overview of the 1^st half of the course:

There is a classic vision problem (pixel labeling) that is extremely important

It’s vital for almost any application

It provides a nice intro to some of vision’s mathematical tools and techniques

So, we’ll be talking about it in detail, starting today.

Image Formation (digital x-ray, conventional camera)

Consider taking a point X-ray (photon) source, an object to be x-rayed, and a detector.

Detector counts photons per unit time, which is what we measure

A pixel value tells us the average density (to x-rays) of a solid angle

Now consider a pinhole camera looking at a scene

The geometry is a little more complex, but basically similar

What we are measuring is the brightness of a patch of the world (scene element)

Back to the X-ray. Suppose that bone is very bright and soft tissue/air is very dark.

The picture we get should, perhaps, be 200’s (bone) and 50’s (other).

Ideally we would see only these values. BUT, a wide variety of processes, which

we lump together as “noise”, gives us slightly and randomly different values.

Suppose we really want to know which pixels are bone and which are not.

(Why do we care? A good example comes from angiography, where you are looking at an artery into which you’ve injected some radio opaque dye. To find a stenosis, or to measure its seriousness, you’d really like to know for the individual pixels if they are blood or vessel.)

So here is our problem, usually known as “image restoration”, sometimes called “denoising”.

There is a “true” value at each pixel, which we are trying to figure out. What we get as input

is the true value plus some noise.