Internet Audio Data Exchange

 

Project Objectives

Our component in the Internet telephony project was to implement an interface capable of transferring audio data between computers over a local or wide area network. There are two corresponding processes necessary to accomplish this goal. First, raw audio samples must be read from either a file or the microphone on the local machine, packaged into Real Time Protocol (RTP) Packets and then sent across the network to the remote destination. In a similar manner, RTP packets must be read from the network, reorder to account for losses and then played out to destination such as the computer speaker or a local file. These two processes should run continuously and independently to ensure smooth audio transfer between computers.

Encoding Analog Audio

Audio data must be sampled from the microphone at regularly spaced intervals to produce fixed sized binary samples. Initially these samples will then be encoded in the linear format produced by most PC audio cards. We intend to compress this raw audio into a format better suited for transmission of speech over the network, such as 8-bit m -law.

 

Our goals are similar for retrieving data from a file. In this case we would expect the files to already contain 8-bit m -law audio samples. Although, It is be possible to offer a similar linear to logarithmic conversion for audio data stored in a file, the wave file format is already capable of holding m -law data in a format called CCIT m -law. When reading a file in this format, all that needs to be done is to verify the file header and move forward to read the audio data.

 

Real Time Protocol (RTP) Encapsulation

Once we have samples in 8-bit m -law, we must prepare them for transfer over a possibly relatively unreliably link, the network. But first we determine if the sample contains any data, if the sample is silent we do not need to send it at all, we can simply drop the packet and allow the remote side to reconstruct the silent packet. However if the sample does contain data we must send it over the network. In order to do this we encapsulate the audio sample into an RTP packet. Each sample is given a time stamp, sequence number, source identifier, and other information as part of the RTP packet header. When this header and corresponding packet data is prepared, we are ready to send the RTP packets as datagrams over the network to the remote host specified to us through our exposed interfaces.

 

Reading and Resequencing RTP Packets

The next task is to receive RTP Packets from the network and prepare them for playout (or redirection) on the local machine. After the RTP packet is read from the network, it is analyzed according to fields within the RTP header. Using the sequence number and timestamp fields the packets are placed into a delay queue for playout. It is here that we must reconstruction silent packets. True multi-conferencing support would also occur at this step. Incoming packets to be played at the same time would be mixed so that all parties transmitting are heard.

 

Finally, the queued samples should be played out at regular intervals to the speakers or to a file. This sampling interval will most likely need to be dynamically adjusted and some packets may need to be dropped to account for reordering and delays produced by the transmission. The sample is extracted from the packet and as in the recording phase, the samples may need to be coded from the network format to the format necessary for playout on the speakers.

 

 

Project Achievements

 

At the time of the demo nearly all the functionality described above had been implemented. We were able to successfully encoding analog audio as 8-bit m -law. Encapsulation these audio samples into Real Time Protocol (RTP) Packets and then read and Resequencing RTP Packets. However, these subsystems were not available because of problems exposing our interfaces with the Microsoft Java Virtual Machine.

 

Figure 1 on the following page gives a visual description of data flow and object hierarchy.

 

As you can see, functionality of the component is solely controlled using the CommunicationsFacade object and classes interfacing IPipe. The user specifies a remote pipe and a local pipe, calling CommunicationsFacade.active() begins the marshaling of data between the two data pipes. As the diagram shows, CommunicationsFacade encapsulates the Pipes into two DataMediator objects that in turn encapsulate the given data pipes into a PacketFactory or optionally a PlayoutMediator. Each DataMediator runs as a thread, one transfers data from the localPipe to the remotePipe, the other from remote to local. The normal usage model is for CommunicationsFacade to receive an AudioPipe for the local Pipe and a DatagramPipe for the remote Pipe. This causes data to be sent from the AudioPipe directly to the network (via the PacketFactory) and data from the network to be sent through a DataMediator and a PacketFactory before being given to the AudioPipe.

 

Figure 1 Simplified view of Data Interface and subsystems, CommunicationsFacade and IPipe are public classes.

Small arrowheads represent data encapsulation and thus data flow, large arrowheads mean "implements"

 

 

The Pipe hierarchy includes FilePipe, AudioPipe, DatagramPipe (UDP), and NetworkPipe (TCP) , all allow both reading and writing to the underlying data source.

 

AudioPipe

This pipe is a native method wrapper to a Win32 Library where the actual work in recording from the microphone or playing to the speakers is done. The library consists of three major parts, a read method, a write method and a message handler thread to which read/write completion messages are posted to by the windows multimedia subsystem.

 

In order to read and write from the window’s multimedia subsystem smoothly, read and write calls had to be buffered in a queue so that the sound card was never starved for data to play or a place to write data into. For either direction (input and output) a handle to the channel, a circular buffer of data and a semaphore are associated with it. This prevents a burst of writes from writing over each other (write bursts are handled by the PlayoutMediator). In addition data must be prepared in the correct format before playback or record. Before playback, logarithmic data must be changed to linear data and after recording linear data must be converted to logarithmic data. The functions to do his are relatively simple. They basically represent the linear samples of PCM in a format similar to the IEEE format for storing floating point numbers. This results in higher precision for smaller lower amplitude sounds, which is good because our ears have much greater sensitivity to changes when the volume is low than when the volume is high.

 

An example of the output case is presented below in figure 2. The case for input is similar.

 

 

MultiMediaCallbackThreadProc() {

Message = WaitForMessage();

Switch(message)

case MM_OUTPUT_DONE:

ReleaseBuffer(pOutputBuffer);

ReleaseSemaphore(hOutputSemaphore);

}

 

Write(char* pData)

{

WaitForSemaphore(hOutputSemaphore);

pBuffer = GetFreeBuffer();

mulawToLinear(pData);

Copy(pBuffer,pData);

PlayBuffer(pBuffer);

}

Figure 2 example of code to write to the windows multimedia system, GetFreeBuffer and ReleaseBuffer get memory from a circular buffer to minimize news and deletes. MulawToLinear decodes the m -law data into sound card compatible linear data. The sample incrases from 8 bits per sample to 16 bits per sample in this transition.

 

FilePipe

Files are read to and written from using FilePipe. The file format can be anything that contains 8 bit, 8 kHz m -law samples, but it is assumed that it is a wave file. For simplicity, 50 bytes at the beginning of the file are skipped before audio samples are read. These 50 bytes correspond to the approximate size of a wave file header generated by most sound recorders. If the header is larger or smaller than this, the amount of extra garbage audio or lost audio should be so small as to be unnoticeable. When writing, if the file is new, the FilePipe builds a complete wave header and then goes on to append samples to the file. If the file already exists, FilePipe simply appends audio to the end of the file; with each sample written, the header is updated to reflect the increased number of samples in the file.

 

DatagramPipe

Sending and receiving from the network is implemented by DatagramPipe. DatagramPipe simply uses a DatagramSocket to send packets to a given IP address and port. This IP address and port is usually set by the signaling code.

 

Since some Pipes return raw audio samples only and others return byte arrays representing RTP packets, we encapsulate Pipes into a PacketFactory. A PacketFactory holds two Pipes: one Pipe is only read from and the other is only written to. Normally these two Pipes would be the same object, for example one AudioPipe. PacketFactory creates RTP packets from data read from the source Pipe. It also takes RTP packets to be written to the destination Pipe.

 

DataMediator runs as a thread and handles transfer of data. It simply loops until told to stop, reading an RTP packet from a source PacketFactory and writing it to the destination PacketFactory. Within CommunicationsFacade there are two DataMediators running. One handles communication from the local factory to the remote factory, and the other vice versa. Multithreading with DataMediator is necessary because certain Pipes, and thus the PacketFactory which contains them, will block if there is no input. Even if one direction is blocking, data still needs to flow in the other direction.

 

Additionally, DataMediator can use a PacketAnalyzer between its source and destination. In this case, rather than writing directly to the destination PacketFactory, DataMediator writes to a PacketAnalyzer. The PacketAnalyzer is only used to handle packets coming in from the network. When it gets a packet, PacketAnalyzer looks up a profile stored about the sender and gives it to the PacketQueue. PacketQueue then calculates where in the buffer to put the packet or if is an old packet and should be dropped. Additionally, PacketAnalyzer starts a PlayoutMediator thread, which takes the packets off of the PacketQueue and plays them out in a steady stream to the destination PacketFactory.

 

We did not have full multiparty conferencing support, since we could not mix audio samples. However, we know how to mix samples, and theoretically it should be a trivial addition to our code. When adding incoming packets to the buffer, PacketQueue would simply mix with the existing samples rather than overwriting them. One method for mixing which we know about is simply to add the linear samples. We would first convert samples into 16 bit linear. Then we would average the samples, since just adding them might result in sample which was out of the range of 16 bits. Finally we would have to convert back into m -law and place the summation into the buffer. There are probably algorithms for mixing m -law samples, but we did not have enough time to research them.

 

Problems

In developing the Data component problems basically fell into two categories: Algorithmic problems, communications problems. We will present some of the most challenging here.

 

Platform Problems

One major stumbling block was the choice of platform for the project. Although the majority of our group were undergraduates and because of their course work were more comfortable developing under Windows than under a UNIX system, our group chose Linux and the unsupported Linux JDK as our development platform. This immediately lead to problems. For starters, the soundcard that shipped with the Dell machine would not work under Linux. When we finally had a new soundcard installed in the machine, the card turned out to be half-duplex making, use of the Data layer rather awkward. Rather late in the game (December) we decided to switch over to the Windows platform, but our group computer was to remain a Linux machine, leaving it useless for testing.

 

Win32 multimedia APIs

Another difficulty was Win32 API. When we converted from the UNIX to Win32 we had to totally reengineer our sound subsystem. In Linux read and writing to the sound card is as easy as opening /dev/audio as a file descriptor and reading and writing to it. We had assumed that a similar model would be used in Win32. Unfortunately, we had to have a messaging thread, circular buffers, and two semaphores to make the system work correctly. Maybe this part of the project should be available as a simple COM dll next semester. Our implementation was a lot simpler to use than the implementation provided to us on the cs519 web page.

 

Coordination Problems

It was very difficult for our group to coordinate efforts. Our group only had 3 formal meetings during the semester (one of them over Thanksgiving when some of the group members were at home). Because of this each team developed their interfaces without consulting with the other groups about requirements. We set our date for integration far too late in the semester (December 7th). When we finally did meet with Signaling and Gateway to work out details, much of our early efforts were devoted to explaining interfaces, their proper usage and their problems. However once we started talking to people several of the issues which should have been apparent much earlier in the processes began to appear. For one we discovered that the gateway was not sending us linear samples, but was sending us m -law samples. We saw some changes that needed to be made to our DatagramPipe to support multi-conferencing. We also found out about an RMI/JDirect compatibility issue which is described below.

 

Problems testing our implementation

In order to do meaningful testing we required two machines with full duplex sound cards communicating over a LAN with a reasonable amount of background activity. Thus, the best environment for testing would be during the day when people are actively using network resources. Unfortunately, our group machine was a dedicated Linux box with a half-duplex sound card, making it useless as a test machine. So in order for us (and anyone else using audio) to test effectively we had obtain logons to two other group computers and wait until the groups that owned them were not using them. Although this was a major inconvenience it was possible to find some time when we could use two machines to perform some test.

In order to make testing easier we created a version of our application that acted as a reflector. When RTP Packets were received, they were immediately queued for retransmission to the source. This allowed us to do non-interactive testing by using only one group computer and any other computer available in the lab.

The real difficulty came about when we needed to perform tests with gateway. For this to happen we needed a group machine (other than our group machine) to be free while our gateway team had reserved use of the gateway. Needless to say this was often not possible, we ended up using most of our testing time figuring out where and how to test. In general it took at least an hour for everyone to get set up, leaving a maximum of one hour for running tests. For each test we performed, we had to do the test, find where we thought the bug was, make changes to our code, and finally set up for another test. The end result was that we couldn't get more than a few real tests in during our two hours.

Another problem with testing with Gateway was that they were never in the same room as us. It would have also been beneficial if we were both in the same room while we were testing the code.

 

JDirect and RMI problems

A major disappointment was that the interfaces that we have developed over the course of the semester could not be used by many of the other groups because of an incompatibility between the Microsoft Java Virtual Machine (MSJVM) and Sun’s Java Virtual Machine (SJVM). In our last meeting of the semester (around December 1st) the management team decided that we could no longer use the Linux platform and explicitly stated that all applications would have to switch to the Microsoft platform MSJVM. We went ahead with these changes and developed a system used COM to access our Windows DLL that accessed the sound card using Win32 API.

We used JDirect (a native method call substitute which is significantly easier to develop with that Sun’s JNI) to access our Windows DLL that accessed the sound card using Win32 API. Other teams (primarily signaling) began using our interface in their code and we began to work out any problems that we found in our implementations. However on Dec 16th, when a large scale test was to be preformed we discovered that the Management, Directory Services, and Accounting code requires Remote Method Invocation (RMI, Sun’s RPC for Java). However, RMI is incompatible with the MSJVM and hence with the code we had developed. Because the final and our demo was the next day we needed an immediate solution to this problem. Reengineering our code was not an option; the performance tweaks that we had spent several weeks perfecting would likely have to be redone. We attempted to recompile Sun’s RMI Java sources with no luck. We then decided that it would be possible to wrap our API and make it available using a socket. However, when we examined this option we realized that because function calls often took variable parameters, and some calls had to be made interactively, implementation such a system would mean creating an full blown RPC system. Although this would have been the best choice from a technical standpoint other than converting to JNI, we simply did not have the time to talk to all the groups, gather requirements, create and then debug such a system. As a "hack" we made use of a class we had used for debugging. It takes command line parameters to decide which Pipes to use for a source and destination, then executes our code. All components that needed to call our interface would have to use the method Runtime.exec("Data.exe <parameters>"); to launch our wrapper. This involved some technical challenges of it’s own. It turns out that sub processes spawned by the current running process are not given a system console, thus all calls to System.out.println() block indefinitely. We not only had to remove all System.out.println’s from our Java code to make it compatible with this system, but we had to remove all printf's from the DLL.

 

Difficulties Normalizing Playout

As we mentioned above, the most difficult subsystem to design, test, optimize and retest was the PlayoutMediator and associated Packet Queue. The goal of the system was to improve the playout quality of audio when it was delivered from a network source to the computer speakers. We knew that this involved a few key factors. We had to delay packets so that out of order packets had time to arrive, we had to make sure that this delay was neither too big nor too small, and we had to know what to do if we never received a particular packet. The key issue to all this seemed to be the delay between times that we got packets off of the buffer. We needed to pull packets off the queue only as fast as the local sound hardware would play them. If we went too fast we would play a lot of silent packets, if we went too slow we would have choppy sound. Getting these details correct was difficult and they are described in the section above.

 

 

 

Knowledge Gained, Lessons Learned

During the semester we learned a great deal about audio. For one, we have an understanding of audio encoding schemes, including m -law and linear PCM, what their purpose is and why they work. This understanding allowed us to write a simple silence detection function. To detect silence, we take the average of all the samples in the packet and check if the average is beloew some threshold. After figuring out a good threshold by trial and error, we were able to cut the transmitted size of a file in half. The transmitted file was indistinguishable from the original when silence was reconstructed.

 

In addition we have a good understanding of wave files and how they are built. This understanding was necessary to be able to output wave files which were readable by commercial wave players. Wave files are made of chunks, and each chunk has its own data fields. If any fields have wrong numbers in them, an audio player may reject the wave file as corrupted.

 

We also got a great deal of hands on experience with real time communication over the Internet and see the difficulties that one would face, even if the communication is to occur over a LAN instead of a WAN. Because packets may arrive in random order, or not at all, it is very difficult to create a system which has both a small delay and reliable playback. Having thought about the problem for so long we have been able to think of new more efficient ways of ensuring quality. If we had more time for the project (or if we had started earlier) it may have been possible to do some kind of error control for dropped packets; possible that would mean sending two different sets of samples per packet. Another idea that we were toying with was the ability fill in a dropped packet with a packet that more closely approximates the packets on either side of it so that the dropped packet doesn’t get played as silence. We’re sure that many other tricks are possible, such as a variable packet size, better compression techniques and better error detection.

 

Besides the technical challenges, we also saw the value of the development process for a large development team. Unfortunately, our development team did not adequately use the many of the techniques that can speed up development cycles. During the last week or so of development, however, we entered into an integrative development process where we would add components at each step and stabilize the components before continuing. This seemed to work very well between Data, Gateway and Signaling and could have probably been applied to the entire development team effectively.

 

Even more importantly, we could see the need for good management in a team as large as 20. Although management does not write any of the basic functionality, it has a large impact on the outcome of the project. Unfortunately, it is hard to come by good managers, and to blame our fellow students for managing poorly would be unreasonable.

 

Interface

The interface for our code is rather simple. First instantiate two pipes for local and remote data. Then pass these pipes to the CommunicationsFacade and call active() to begin communications. Use stop() to hang up the connection. Please see the attached JavaDoc for details on how to construct other types of Pipes.

 

Pipe local = new AudioPipe();

Pipe remote = new DatagramPipe("hostname",port);

CommunicationsFacade data = new CommunicationsFacade();

data.initialize(local, remote);

data.activate();

//wait

data.stop();

 

Project Suggestions

We believe that we had enough support for technical issues in doing this project. Resources were made available to get the technical part of the project done. However, we needed more structure as far as scheduling goals and tasks to complete on a certain date. Without that kind of structuring we ended up doing a great deal of work in the last several weeks of the semester.

 

Integration checkpoints are one possible way of providing such scheduling. Although the management group should impose them internally, it might be beneficial if they were mandated and used to encourage incremental development of the system. Milestones and corresponding dates should be assigned at the onset of the project. Not only will this encourage students to begin work sooner, but it will also teach them how difficult it is to accurately determine the scope of a project.

 

Overall management needs more guidance, perhaps a book such as Rapid Development by Steve McConnell could be recommended as suggested reading. It might also help for management teams to hold weekly meetings with the course staff and other management teams to discuss each groups’ progress and help resolve any problems. In that way the management teams would have some support in leading a team of 18 developers.

 

Finally, unrelated to issues of scheduling and management, there were problems with the Gateway which we mentioned above. Testing was difficult because time slots were only two hours, and no group could sign up for two consecutive time slots. Therefore we had to go through the overhead of setting up for every two hours. To remedy this, either the signup segments should be longer, 3 or even 4 hours, or groups should be able to sign up for consecutive slots. One possible policy could be that a group can sign up for 2 or three slots, but then must wait that same number of slots before signing up again.

 

References

File Formats, http://www.goice.co.jp/member/mo/formats/index.html

FreePhone from INRIA, http://www.inria.fr/rodeo/fphone/

Java Technology Home Page, http://www.javasoft.com

Kientzle, Tim. A Programmer’s Guide To Sound.

Programmer’s Heaven Sound Programming Page http://www.programmersheaven.com/users/nathan/mainsnd.htm

RFC 1889

RTP: About RTP and the audio-video transport group. http://www.cs.columbia.edu/~hgs/rtp/

Team reports

vat - LBNL Audio Conferencing Tool, http://www-nrg.ee.lbl.gov/vat/