html filter to ban pop ups of all kinds

Filtering out html content


Html Filter is a tool I wrote after being fed up with pop up windows of all kinds.

Applying html filtering to prevent automatically pop up windows from opening is of course a quite effective application. What brings a lot more interest to the technique is that there are other applications to it.


Html Filter ?

I have spent much time trying to close them automatically using "third-party tools". A bunch of third-party tools check out open windows on a regular time basis against a known dictionnary of banned window names. This works fine, as long as you're happy with being forced to add new entries every other day since ad names keep changing all the time. In addition, all those windows being open and then suddently destroyed make you feel out of control, which is hardly acceptable.

So I had to find out a more internal way of doing it. I found that, after trying to work along with IE a couple of times already, there were many limits and weird things happening there with subscribed events which for any reason sometimes don't trigger at all, I thought I had to find something more radical, and less coupled with the navigator I was using.

I finally successfully came up with a proxy filter, a systray tool which, once configured, sends back and forth, every HTTP packet, with the unique opportunity of seeing the Html content itself.

This opportunity is great. Applying predefined filtering rules allows for instance to remove all kinds of nasty javascript known for bringing pop ups on screen, you know that nasty (url, "doubleshit", ...) things.


Configuring the tool

Once installed, it starts listening on the default 8010 port. If you are already using this port, change it, that's what the dialog box is for. Of course, you must let the navigator know that you are listening there, so let's open the Windows control panel, then double-click on Internet Options. In the Connections tab, just edit the Proxy Settings, click on Advanced, and type in front of HTTP Proxy address to use Server field, and type 8010 in the Port field. Apply. Ok, you're done. You can go back and surf the web as you previously did, without notable changes (at least on surface).

To start the tool with a different port than 8010, you may as well just provide it in the cmdline, for instance htmlfilter.exe 8020. The cmdline is useful if you (like I do) are intending to add a registry entry to the apps started at boot-time.

If you are using Netscape or even Opera, just change the proxy settings using a similar procedure. For Netscape, go in the Edit / Preferences, then in Advanced / Proxy, and edit the HTTP Proxy field.

Now, depending on whether you have a direct Internet connection, or use the corporate proxy server at your workplace, you must also let the tool know. For a direct Internet connection, just leave the Use corporate proxy unchecked. For a corporate connection, check the box, then fill in the two fields. For instance (the DNS), and 3128 (the listening port). This information is expected to be known by you (check your LAN Internet settings, check the automatic detection script, ...).

The filter is activated by default, which means the Html content going through it is filtered, and rules are applied. The source code provided filters statements, replacing them with faked // and it is up to you to add any other relevant rules in the CHtmlFilterRules class implementation. To disable filtering, just right-click in the systray and choose the option.

I also wanted the tool not to slow down the surfing experience. This goal is achieved by using simple sockets instead of MFC wrappers such like CAsyncSocket (which in turn mess a lot around with the _afxSockThreadState mess).


Technical details

This tool acts as a proxy server. It basically implements a double-threaded socket line. The code is based on Nish's pop proxy server. Let me explain the two types of connection : either you have a direct connection, provided by your ISP (thanks to a RAS setup for instance), or you connect the internet through the corporate proxy server at your workplace. The two cases are depicted below :

How the html filter works with a direct internet connection


How the html filter works with a corporate connection


With a direct Internet connection, the incoming requests are parsed by the tool so we get to know the web server asked to respond (Host: HTTP header). Once we have it, we can try to connect, pass the request, and wait for a response.

With a corporate proxy connection, the incoming request is not parsed at all, we just pass it to the corporate server listening on a port, and this server does the actual direct internet connection. The response is then sent back to us (no difference in the code and behaviour for getting the response, in fact the only difference to expect here is the response time : should be higher because there are two proxies between the client and the actual web server).


The main class is declared as below :
class CHttpProxyMT

  // Members
  SOCKET    m_HttpServerSocket;
  HANDLE    m_ServerThread;
  int       m_port;
  BOOL      m_bRunning;

  // Constructor
  virtual ~CHttpProxyMT();

  // Methods
public :
  BOOL StartProxy(int port);
  BOOL IsRunning();
  void StopProxy();

  int GetProxyPort();
  int GetNBConnections();
  void EnableFiltering(BOOL bEnable=TRUE);

  // Internal
  // The thread that listens for connections
  DWORD MServerThread();	
  static DWORD ServerThread(void *arg); // thread callback
  // The thread that receives incoming navigator requests
  void StartClientThread(SOCKET sock);
  static DWORD ClientThread(DWORD arg); // thread callback

  // The thread that send and retrieve server responses
  static void StartDataThread(void *parm);
  static DWORD DataThread(void *parm); // thread callback


struct socket_pair
  socket_pair(SOCKET s1, SOCKET s2, BOOL bServerResponse)
    srcsock = s1;
    dstsock = s2;
    bIsServerResponse = bServerResponse;
    n = 0;

  SOCKET srcsock;
  SOCKET dstsock;
  BOOL bIsServerResponse;
  int n;
  char buff[16384+1];

What's funny is when you start working with threads. Suddenly, everything comes so fucking fucked. Indeed, every variable is under the potential fire of being accessed by several threads at the same time, making it just harder to code practically anything. I ended up associating a socketpair instance to each thread and basically referring to this object in every line of code, so to make sure I was sorta thread-safe. But it sucks, what one needs at this particular moment is an easy framework to attach variables to the running thread. It becomes so amazingly hard just because under WIN32 the thread callback is a static (read global) function, thus used and reused by each thread.

In the end, I have code like this when it comes to catching server responses and backing them to the client :
DWORD CHttpProxyMT::DataThread(void *parm)
  socket_pair* spair = (socket_pair*) parm;

// recv bytes from server and send them back to the client, once filtered
  while( (spair->n=recv(spair->srcsock, spair->buff, 16384, 0))>0 )
    spair->buff[spair->n] = 0;

    if (g_bFilteringEnabled && spair->m_bIsServerResponse)
      CHtmlFilterRules filter( spair->buff,spair->n );

    send(spair->dstsock, spair->buff, spair->n, 0);				


Applying rules is up to what you intend to do. Basically I wanted to comment out pop up window javascript code, but virtually the concept can be used for many other applications. Using filtering to forbid pop up windows comes as a consequence to the fact that, using Html, the way to open pop up windows is through the javascript command. Commenting out this line makes it KO, which is what we are looking for. Here is the code for it :

CHtmlFilterRules::CHtmlFilterRules(char *buffer, int nLength)
  m_cpBuffer = buffer;
  m_nLength = nLength;

BOOL CHtmlFilterRules::ApplyRules()
  if (!m_cpBuffer || !m_nLength) return FALSE; // we are already done!

  // copy the buffer, in order to be able to compare the strings regardless of the case
  char *buf = new char[m_nLength+1];
  if (!buf) return FALSE;
  memcpy(buf, m_cpBuffer, m_nLength);

  buf[m_nLength]=0; // force EOL to allow str C-routines to work
  strlwr(buf); // convert to lowercase (CPU time here)

  char *szPattern = buf;
  char *szFirstByte = buf;
  while ( (szPattern=strstr(szPattern,""))!=NULL )
    // replace by // 
    // so the javascript code doesn't create any annoying popup
    m_cpBuffer[szPattern-szFirstByte+0] = '/';
    m_cpBuffer[szPattern-szFirstByte+1] = '/';


  delete [] buf;

  return TRUE;

Code listing : (both VC6 and VC7 workspaces provided)


Update history

Oct 12 - code complete
Oct 13 - cmdline added
               corporate proxy added



Stephane Rodriguez - Oct 13 2002.