Bridging self-regulation and content-filtering

ΕΚΕΦΕ «ηµόκριτος» Ινστιτούτο Πληροφορικής & Τηλεπικοινωνιών Bridging self-regulation and content-filtering Internet Content Filtering Group Software and Knowledge Engineering Lab 2004 Ινστιτούτο Πληροφορικής & Τηλεπικοινωνιών, ΕΚΕΦΕ ηµόκριτος Research activities of i-config Language Technology Image Understanding Knowledge Discovery in Data User Modeling Multimedia Information Processing Core technologies and applications Technologies: Information extraction Information filtering Filter generator: Web filtering Spam filtering ( ) Classifying financial news Filtering digital content Original docs (e.g. messages, news from e- agencies, web pages) category 1 (e.g. complaints, financial news) category 2 (e.g. tech support, sports news) category n Internet Content ~98% safe Uncontrolled Total and direct access Lack of provenance (relative to absolute) High volatility Unsafe content Illegal Pedophiles, Nazism (DE) Offensive Pornography, Racism, Violence Undesired Online gambling Day trading sites Where and how to filter Self-regulation Filtering at the source Filtering during distribution Filtering at the last mile Self-regulation Self-labeling labeling by content authors producers Browsers block according to user settings ICRA v.1.0/rsaci : (n 3 s 4 v 0 l 4) New, more expressive vocabulary incl. context : ICRA v. 2.0 (http://www.( Filtering at the source distribution Literally impossible due to network structure, lack of provenance and routing method (cf. legal case against Yahoo! France) Filtering at the last-mile ( consumer ) List-based solutions underblocking Shallow keyword matching solutions overblocking FilterX: : Web page filtering FilterX is a Web proxy server that filters pornographic content on the Web. Having been trained with suitable examples, FilterX operates in real time. Combining natural language processing image analysis and Web structure, FilterX analyses all the information available on the HTTP stream, not just the URL or title. Usingmachine learning, FilterX considers the actual contribution of textual, structural and pictorial features. Creating a multimedia representation model, for each document FilterX achieves practically zero overblocking. Last-mile applications of FilterX Self-regulation and filters SIFT: Use of filters when self- or 3 rd party labeling absent / not trusted resulted in ICRAplus,, a free platform bridging self-regulation with filtering software Protection of young students SCOFI: Different content access according to student age via smartcard INTERNET SIFT Platform TCP/IP CACHING SYSTEM Proxy Communication through W3C standards (HTTP, Label Bureau) Easy to implement public API Literally ANY filter can become module High security - DSig Will run on Win OS ICRAfilter HTTP HTTP Adaptation Module FILTER 1 Label Bureau interface Proxy label bureau protocol (HTTP) PICS-Rules Based Filter Securiity mechanisms HTTP Transparent proxy mechanism TCP/IP HTTP Adaptation Module FILTER 2 Label Bureau interface Proxy label bureau protocol (HTTP) Traffic Interceprtion Module Protocol Blocking mechanism Adaptation Module Web-based User Interface HTTP FILTER n Label Bureau interface Proxy label bureau protocol (HTTP) Proxy BROWSER USER In the absence of author-provided ICRA labels, the requested page can still be blocked by co-operating operating filters Public API can help filter vendors wrap their existing software into an ICRAplus module in no time. Users can override blocks temporarily or permanently Filters can be combined per user profile in either a simple manner, with predetermined, factory settings or. ..users can have full control on the filtering procedure, incl. filter priority, filter weights and tie resolution! More info and downloads at Clie nt (PC) Net 1 (classroom) SCOFI Server Net 2 (ISP) Web Browser Web Browser SGO SGO Client Client Smart Ca rd HTTP-request HTTP-response 7751/ udp echp/auth handshake Packetfilter sproxyd 8080/ tcp sproxyd SCOFI/SGO SCOFI/SGO Proxy Proxy SGO SGO Server Server Auth request sad sad SCOFI/SGO SCOFI/SGO Auth Auth demon demon echp/auth request/ response squix 8081/ tcp squix Filterix Filterix enhanced enhanced Squid Squid proxy proxy server server Access Access config config Internet SmartCard based authentication and age profile Different levels of filtering High security - Dsig External image analysis Image Server Im age Im age Classifier Classifier FilterX revisited Trained on self-proclaimed porn sites Creation of page-level representation (multimedia and structural) Turn noise to our advantage by using it as feature Models per language + language identifier Evaluated using multi-fold cross-validation (before SIFT & SCOFI) Can pass the decision to the user for thresholding FilterX Results Filtering obscene Web pages 100% 95% 90% obscene precision obscene recall 85% 80% 75% 70% number of retained attributes A case for Instead of learning what is spam, learn what is legit User profile/model based on user s Inbox + spam Turn noise to our advantage by using it as feature Models per language + language identifier Evaluated in house + multi-fold cross-validation Spam filter results Food for thought Decide where to put the bias: overblocking vs. underblocking Most of the harmful content wants to be found Possibility of hostile users! (enforced s/w!) Very hard to detect intention of the content author Reverse the problem of filtering by centering on the effect on the user Technology only to solve problems it has created! More info Konstantinos Chandrinos Internet Content Filtering Group (i-config config) NCSR Demokritos
