Home Page
Archive > Posts > Tags > Convert
Search:

Weird filename encoding issues on windows

So somehow all of the file names in my Rammstein music directory, and some in my Daft Punk, had characters with diacritics replaced with an invalid character. I pasted one of such filenames into a hex editor to evaluate what the problem was. First, I should note that Windows encodes its filenames (and pretty much everything) in UTF16. Everything else in the world (mostly) has settled on UTF8, which is a much better encoding for many reasons. So during some file copy/conversion at some point in the directories’ lifetime, the file names had done a freakish (utf16*)(utf16->utf8) rename, or something to that extent. I had noticed that all I needed to do was to replace the first 2 bytes of the diacritic character with a different byte. Namely “EF 8x” to “Cx”, and the rest of the bytes for the character were fine. So if anyone ever needs it, here is the bash script.

LANG=;
IFS=$'\n'
for i in `find -type f | grep -P '\xEF[\x80-\x8F]'`; do
	FROM="$i";
	TO=$(echo "$i" | perl -pi -e 's/\xEF([\x80-\x8F])/pack("C", ord($1)+(0xC0-0x80))/e');
	echo Renaming "'$FROM'" to "'$TO'"
	mv "$FROM" "$TO"
done

I may need to expand the range beyond the x80-x8F range, but am unsure at this point. I only confirmed the range x82-x83.

Getting HTML from Simple Machine Forum (SMF) Posts

When I first created my website 10 years ago, from scratch, I did not want to deal with writing a comment system with HTML markups. And in those days, there weren’t plugins for everything like there is today. My solution was setting up a forum which would contain a topic for every Project, Update, and Post, and have my pages mirror the linked topic’s posts.

I had just put in a quick hack at the time in which the pulled SMF message’s body had links converted from bbcode (there might have been 1 other bbcode I also hooked). I had done this with regular expressions, which was a nasty hack.

So anywho, I finally got around to writing a script that converts SMF messages’ bbcode to HTML and caches it. You can download it here, or see the code below. The script is optimized so that it only ever needs to load SMF code when a post has not yet been cached. Caching happens during the initial loading of an SMF post within the script’s main function, and is discarded if the post is changed.

The script requires that you run the query on line #3 of itself in your SMF database. Directly after that are 3 variables you need to set. The script assumes you are already logged in to the appropriate user. To use it, call “GFTP\GetForumTopicPosts($ForumTopicID)”. I have the functions split up so you can do individual posts too if needed (requires a little extra code).


<?
//This SQL command must be ran before using the script
//ALTER TABLE smf_messages ADD body_html text, ADD body_md5 char(32) DEFAULT NULL;

namespace GFTP;

//Forum database variables
global $ForumInfo;
$ForumInfo=Array(
    'DBName'=>'YourDatabase_smf',
    'Location'=>'/home/YourUser/www',
    'MessageTableName'=>'smf2_messages',
);

function GetForumTopicPosts($ForumTopicID)
{
    //Change to the forum database
    global $ForumInfo;
    $CurDB=mysql_fetch_row(mysql_query('SELECT database()'))[0];
    if($CurDB!=$ForumInfo['DBName'])
        mysql_select_db($ForumInfo['DBName']);
    $OldEncoding=SetEncoding(true);

    //Get the posts
    $PostsInfos=Array();
    $PostsQuery=mysql_query('SELECT '.implode(', ', PostFields())." FROM $ForumInfo[MessageTableName] WHERE id_topic='".intval($ForumTopicID).
        "' AND approved=1 ORDER BY id_msg ASC LIMIT 1, 9999999");
    if($PostsQuery) //If query failed, do not process
        while(($PostInfo=mysql_fetch_assoc($PostsQuery)) && ($PostsInfos[]=$PostInfo))
            if(md5($PostInfo['body'])!=$PostInfo['body_md5']) //If the body md5s do not match, get new value, otherwise, use cached value
                ProcessPost($PostsInfos[count($PostsInfos)-1]); //Process the lastest post as a reference

    //Restore from the forum database
    if($CurDB!=$ForumInfo['DBName'])
        mysql_select_db($CurDB);
    SetEncoding(false, $OldEncoding);

    //Return the posts
    return $PostsInfos;
}

function ProcessPost(&$PostInfo) //PostInfo must have fields id_msg, body, body_md5, and body_html
{
    //Load SMF
    global $ForumInfo;
    if(!defined('SMF'))
    {
        global $context;
        require_once(rtrim($ForumInfo['Location'], DIRECTORY_SEPARATOR).DIRECTORY_SEPARATOR.'SSI.php');
        mysql_select_db($ForumInfo['DBName']);
        SetEncoding();
    }

    //Update the cached body_html field
    $ParsedCode=$PostInfo['body_html']=parse_bbc($PostInfo['body']);
    $EscapedHTMLBody=mysql_escape_string($ParsedCode);
    $BodyMD5=md5($PostInfo['body']);
    mysql_query("UPDATE $ForumInfo[MessageTableName] SET body_html='$EscapedHTMLBody', body_md5='$BodyMD5' WHERE id_msg=$PostInfo[id_msg]");
}

//The fields to select in the Post query
function PostFields() { return Array('id_msg', 'poster_time', 'id_member', 'subject', 'poster_name', 'body', 'body_md5', 'body_html'); }

//Swap character encodings. Needs to be set to utf8
function SetEncoding($GetOld=false, $NewSet=Array('utf8', 'utf8', 'utf8'))
{
    //Get the old charset if required
    $CharacterVariables=Array('character_set_client', 'character_set_results', 'character_set_connection');
    $OldSet=Array();
    if($GetOld)
    {
        //Fill in variables with default in case they are not found
        foreach($CharacterVariables as $Index => $Variable)
            $OldSet[$Variable]='utf8';

        //Query for the character sets and update the OldSet array
        $Query=mysql_query('SHOW VARIABLES LIKE "character_%"');
        while($VariableInfo=mysql_fetch_assoc($Query))
            if(isset($OldSet[$VariableInfo['Variable_name']]))
                $OldSet[$VariableInfo['Variable_name']]=$VariableInfo['Value'];

        $OldSet=array_values($OldSet); //Turn back into numerical array
    }

    //Change to the new database encoding
    $CompiledSets=Array();
    foreach($CharacterVariables as $Index => $Variable)
        $CompiledSets[$Index]=$CharacterVariables[$Index].'="'.mysql_escape_string($NewSet[$Index]).'"';
    mysql_query('SET '.implode(', ', $CompiledSets));

    //If requested, return the previous values
    return $OldSet;
}
?>
Pulling HTML from Github markdown for external use
Although, converting to markdown is a time consuming pain

So I started getting on the Github bandwagon FINALLY. I figured that while I was going to the trouble of remaking readme files for the projects into github markdown files, I might as well duplicate the compiled HTML for my website.

The below code is a simple PHP script to pull in the converted HTML from Github’s API and then do some more modifications to facilitate directly inserting it into a website.


Usage:
  • The variables that can be updated are all at the top of the file.
  • The script will always output the finished result to the user’s browser, but can also optionally save it to an external file by setting the $SaveFileName variable.
  • Stylesheet:
    • The script automatically includes a specified stylesheet from the $StylesheetLocation variable.
    • The stylesheet I used is from https://gist.github.com/somebox/1082608. I’m not too happy with its coloring scheme, but it’ll do for now.
    • The required modifications that need to be made to the css are to change “body” to “.GHMarkdown”, and then add “.GHMarkdown” before all other rules.
    • This is the one I am currently using for my website, but it also has a few modifications made specifically for my layouts.
  • Modifications
    • In my markdowns, I like to link to internal sections by first creating a bookmark as “<div name="BOOKMARK_NAME">...</div>” and then linking via “[LinkName](#BOOKMARK_NAME)”. While this works on github, the bookmark’s names are actually changed to something like “user-content-BOOKMARK-NAME”, which is not useable outside of github. The first $RegexModifications item therefore updates the bookmarks back to their original name, and turns them into <span>s (which github does not support).
    • The second rule just removes the “aria-hidden” attributes, which my W3C checking scripts throw a warning on.
  • Note that sometimes, the script may return an error of “transfer closed with XXX bytes remaining to read”. This means that github denied the request (probably due to too many requests in too short a timespan), but the input is too large so github prematurely terminated the connection. If this happens, try sending a tiny input and see if you get back a proper error.

<?php
//Variables
$SaveFileName='Output.html'; //Optionally save output to a file. Comment out to not save
$InputFile='Input.md';
$StylesheetLocation='github-markdown.css';
$RegexModifications=Array(
        '/<div name="user-content-(.*?)"(.*?)<\/div>/s'=>'<span id="$1"$2</span>', //Change <div name="user-contentXXX ---TO--- <span name="XXX
        '/ ?aria-hidden="true"/'=>'' //Remove aria-hidden attribute
);

//Set the curl options
$CurlHandle=curl_init(); //Init curl
curl_setopt_array($CurlHandle, Array(
        CURLOPT_URL=>           'https://api.github.com/markdown/raw', //Markdown/raw takes and returns plain text input and output
        CURLOPT_FAILONERROR=>   false,
        CURLOPT_FOLLOWLOCATION=>1,
        CURLOPT_RETURNTRANSFER=>1, //Return result as a string
        CURLOPT_TIMEOUT=>       300,
        CURLOPT_POST=>          1,
        CURLOPT_POSTFIELDS=>    file_get_contents($InputFile), //Pull in the requested file
        CURLOPT_HTTPHEADER=>    Array('Content-type: text/plain'), //Github expects the given data to be plaintext
        CURLOPT_SSL_VERIFYPEER=>0, //In case there are problems with the PHP ssl chain (often the case in Windows), ignore the error
        CURLOPT_USERAGENT=>     'Curl/PHP' //Github now requires a useragent to process the request
));

//Pull in the html converted markdown from Github
$Return=curl_exec($CurlHandle);
if(curl_errno($CurlHandle)) //Check for error
        $Return=curl_error($CurlHandle);
curl_close($CurlHandle);

//Make regex modifications
$Return=preg_replace(array_keys($RegexModifications), array_values($RegexModifications), $Return);

//Generate the final HTML. It will also be output here if not saving to a file
header('Content-Type: text/html; charset=utf-8');
if(isset($SaveFileName)) //If saving to a file, buffer output
        ob_start();
?>
<!DOCTYPE html PUBLIC "-//W3C//DTD HTML 4.01 Transitional//EN" "http://www.w3.org/TR/html4/loose.dtd">
<html>
<head>
<title>Markdown pull</title>
<meta http-equiv="Content-Type" content="text/html; charset=utf-8">
<link href="<?=$StylesheetLocation?>" rel=stylesheet type="text/css">
</head><body><div class=GHMarkdown>
<?=$Return?>
</div></body></html>
<?php
//Save to a file if requested
if(isset($SaveFileName))
        file_put_contents($SaveFileName, ob_get_flush()); //Actual output happens here too when saving to a file
?>
Format Text [to HTML] Script

After writing the documentation in plaintext format for DSQL just now, I needed to convert it into HTML for the project’s page. I’ve done this before manually and it’s always very daunting, so I decided to really quickly write a script to do most of the work for me, which can be downloaded here, or the code seen below.

It has the following:
  • Input text box with HTML data that is instantly shown as HTML in a below section when modified.
    Both sections take up half the vertical screen space
  • Undo/redo buffer for the text box (very primitive functionality)
  • “Open in new page” button, which opens a new window with just the HTML data (useful for validation [W3C or whatnot]).
    This is disabled by default because it is a dangerous option (XSS exploitable, so the script would need to be secured/password protected if this was on)
  • “Escape HTML” escapes HTML characters so they are not improperly interpreted (e.x. “<” becomes “&lt;”)
  • Listize:
    • Turns tabbed lists into HTML
    • For example:
      1
      	2
      	3
      		4
      5
      would become:
      1
      • 2
      • 3
        • 4
      5


I realized while making the script that I should probably instead just start making my documentation in a markup (like GitHub’s) and then have that converted to HTML and text files. Oh well.



Code:
<? header('Content-Type: text/html; charset=utf-8'); ?>
<!DOCTYPE html PUBLIC "-//W3C//DTD HTML 4.01 Transitional//EN" "http://www.w3.org/TR/html4/loose.dtd">
<html>
<head>
<title>Format Text</title>
<meta http-equiv="Content-Type" content="text/html; charset=utf-8">
<?
$AllowRenderText=true; //Set to true only if this is in a secure environment, as directly outputting a given value can lead to XSS
if(isset($_REQUEST['RenderText']))
	return print '</head><body>'.($AllowRenderText ? $_REQUEST['RenderText'] : 'Rendering of text not allowed').'</body></html>';
?>
<style type="text/css">
html, body { width:100%; height:100%; margin:0; padding:0; }
.HalfScreen { display:block; width:calc(100% - 2px); height:calc(50% - 2px - 30px/2); margin:0; border:1px solid black; }
#RenderForm { overflow:hidden; }
#RenderText { margin:0; border:0; width:100%; height:100%; }
#RenderHTML { overflow-x:hidden; overflow-y:scroll; }
.TopBar { height:30px; background-color:grey; }
.Hide { position:absolute; visibility:hidden; top:-10000px; }
</style>

<script type="text/javascript" src="https://code.jquery.com/jquery-2.1.1.min.js"></script>
<script type="text/javascript">$(document).ready(function() {

//History for undoing
var UndoBuf=[], RedoBuf=[];
function Undo()
{
	if(!UndoBuf.length)
		return;
	RedoBuf.push(UndoBuf.pop());
	$('#RenderText').val(UndoBuf[UndoBuf.length-1]);
	$('#RenderHTML').html(UndoBuf[UndoBuf.length-1]);
}
function Redo()
{
	if(!RedoBuf.length)
		return;
	$('#RenderText').val(RedoBuf[RedoBuf.length-1]);
	$('#RenderHTML').html(RedoBuf[RedoBuf.length-1]);
	UndoBuf.push(RedoBuf.pop());
}
$('#Undo').click(function(e) { e.preventDefault(); Undo(); });
$('#Redo').click(function(e) { e.preventDefault(); Redo(); });

//Render HTML
function Render() {
	//Do the render
	var MyVal=$('#RenderText').val();
	$('#RenderHTML').html(MyVal);

	//Save current value to the history
	//*Better history functionality here would be real nice (using smart currentTarget.selectionStart/End calculations), along with an undo/redo button, but not within the scope of this project
	if(RedoBuf.length) //Empty redo buffer
		RedoBuf=[];
	UndoBuf.push(MyVal);
	if(UndoBuf.length>100) //Limit history buffer
		UndoBuf.shift();
}
$('#RenderText').on('keypress paste', function(e) { setTimeout(Render, 1); }); //Automatic update on paste requires a timeout

//Open in new page
$('#OpenInNewPage').click(function(e) {
	e.preventDefault();
	$('#RenderForm').submit();
});

//Escape HTML
$('#EscapeHTML').click(function(e) {
	e.preventDefault();
	$('#RenderText').val(function(index, value) {
		$.each({"&amp;":/&/g, "&lt;":/</g, "&gt;":/>/g, "&quot;":/"/g, "&#039;":/'/g}, function(HTMLStr, ReplStr) {
			value=value.replace(ReplStr, HTMLStr); });
		return value;
	});
	Render();
});

//Listize based on tabbing
//If a successive line is tabbed over beyond the current, it is made inside a new nested list.
//Tabbing over more than once on a successive line will create multiple nests
//Having @@@ at the beginning of a line will include it in the previous line item, no matter the tabbing
//Make sure to have @@@ blank lines tabbed over to the proper nested level
$('#Listize').click(function(e) {
	//Get the text to replace
	e.preventDefault();
	var T=$('#RenderText').val();

	//Go over each line and if the next line is tabbed beyond it, make it a new nested list. Blank
	var CurTabLevel=0, NewLines=[]; //NewLines is 2 items per line: the original string and the new html tags
	$.each(T.split(/\r?\n/), function(Index, Str) {
		//Check for a continued line item
		if(Str.substr(0, 3)=='@@@')
			return NewLines.push('<br>', Str.substr(3));

		//In/de-dent as needed
		var Tags='';
		var NewTabLevel=/^\t*/.exec(Str)[0].length, PreLevel=CurTabLevel; //Get the nested level
		for(;NewTabLevel>CurTabLevel;CurTabLevel++)
			Tags+='<ul><li>';
		for(;NewTabLevel<CurTabLevel;CurTabLevel--)
			Tags+='</li></ul>';

		//Fill out the rest of the line
		if(NewTabLevel==0) //Breaks between top level new lines
			Tags+=(Index && PreLevel==0 ? '<br>' : '');
		else if(PreLevel>=NewTabLevel) //If previous item needs to be ended (new level is not greater and not 0)
			Tags+='</li><li>';

		NewLines.push(Tags, Str);
	});

	//Finish de-dent as needed
	var Final=[NewLines.shift()];
	var EndLine='';
	while(CurTabLevel--)
		EndLine+='</li></ul>';
	NewLines.push(EndLine);

	//Combine each line with the tags
	for(var i=0;i<NewLines.length;i+=2)
		Final.push(NewLines[i+0]+NewLines[i+1]);



	//Update from the replaced text
	$('#RenderText').val(Final.join("\n"));
	Render();
});

});</script>

</head>
<body>
	<div class=TopBar>
		<input type=button id=EscapeHTML value="Escape HTML">
		<input type=button id=Listize value="Listize">
		<? if($AllowRenderText) { ?> <input type=button id=OpenInNewPage value="Open In New Page"> <? } ?>
		<input type=button id=Undo value="Undo">
		<input type=button id=Redo value="Redo">
	</div>
	<form action="FormatText.php" method=post id=RenderForm target="_blank" class=HalfScreen>
		<textarea id=RenderText name=RenderText></textarea>
		<input type=submit class=Hide>
	</form>
	<div id=RenderHTML class=HalfScreen></div>
</body>
</html>
Converting Videos to HTML5 Formats

The following is a Linux script I threw together to convert an mpeg4 (.mp4, .mpv, .mpeg4, .mpg) file into ogg vorbis (.ogg, .ogv) and flash video (.flv) maintaining the same bit rate. It also doesn’t hurt to have a .webm format for some older devices. This assumes you already have the appropriate packages installed, including ffmpeg. The script also wasn’t made to be foolproof, and might not be able to read the bitrate on some videos. The script takes a list of mpeg4 files to convert.

for filename in "$@"
do
  extension="${filename##*.}"
  filename="${filename%.*}"

  export BITRATE=`ffmpeg -i $filename.$extension 2>&1 | grep -oP 'Video.*\d+ kb/s' | grep -oP '\d+ kb/s' | grep -oP '\d+'`
  ffmpeg2theora -V $BITRATE -o $filename.ogv $filename.$extension
  ffmpeg -i $filename.$extension -ar 44100 -b ${BITRATE}k -f flv $filename.flv #Add an optional "-threads #" to make this faster
done